Don't miss an insight. Subscribe to Techopedia for free.


Data Set

What Does Data Set Mean?

A data set is a structured collection of data points related to a particular subject. A collection of related data sets is called a database.


Data sets can be tabular or non-tabular. Tabular data sets contain structured data that is organized by rows and columns. Non-tabular data sets contain unstructured data contained by brackets.

Data sets can also be categorized by the type of information they contain. Popular types of data sets include:

  • Numerical – data is expressed in numbers rather than natural language.
  • Bivariate – contains two types of related data.
  • Multivariate – contains three or more than three types of related data.
  • Categorical – data variables can have one of two values.
  • Correlation – values in the data set have a relationship with each other.

Techopedia Explains Data Set

In computing, the term data set originated with IBM mainframes, where its meaning was similar to that of file. Today, the term is often associated with big data analytics, machine learning (ML) and artificial intelligence (AI).

Machine learning

Large datasets are required to train machine learning algorithms. After the intitial training, additional data sets are used to check for overfitting and validate the model's ability to interpret new data accurately.

Data sets for training machine learning algorithms can either be created in-house or acquired from a dataset repository. If large data sets are not available, data scientists can use smaller datasets produced by random sampling.

Mean, Median, Mode

The labels mean, median and mode are measurements of a data sets' central tendency. The concept of central tendency is to represent the contents of a large data set with a single value that signifies the data set's middle distribution.

The mean (average) is found by adding all numbers in the data set and then dividing the sum by the number of values in the set. The median is the middle value of a data set that has ordered from least to greatest. The mode is the number that occurs most often in a data set.


Related Terms