How to put everything in order with your dataset

How to put everything in order with your dataset | How to put everything in order with your dataset | A data set or also known as a sample was known long before people began to speak about machine learning. But in this post, we’ll talk about ML data sets and try to explain the specifics of using row data for ML modeling, from what a data set consists of to how actually to form it right.


In ML a data set is a collection of the data under study. It can be located simultaneously within different D.B.s or collected with some hardware sensors or with the assistance of some tools in a live mode. In addition, the term refers to sampling from available data to teach a particular machine learning model.

To have one practical example, think about the excel table with sales figures. 

The ready data set for supervised machine learning is structured and processed information usually represented in a tabular form. Object features are the columns and rows of such a table. 

When we speak about the classification model of ML, there exist two types of variables:

  • independent predictors, like seasons, types of goods, and prices;
  • dependent variables: low sales volume goods and high sales volume goods forecast. These are variables, which are calculated based on one, two, or more predictors.

During the process of machine learning, a model is built in such a way, that it may classify objects from the original set. 

But the practical meaning of ML tasks is much wider.

Find out more about classification in ML here.

Data sampling

The main data set is usually referred to as the general population. The process of collecting a set from the population is called data sampling. A sample is a finite subset of the elements of a population that can be examined to understand the behavior of the original array. For example, the general population is all site visitors, and the sample includes 250 randomly selected visitors.

The probabilistic model of data generation assumes that a sample from the general population is formed randomly. If all its elements are randomly selected and independently distributed over the original set, the choice is called simple. Simple sampling is a mathematical model of a series of independent experiments that are commonly used for math statistical analysis. 

We have some similarities for ML modeling too. Each stage of machine learning requires its own set of data:

  • for direct training of models requires training data that is used for tuning (parameter optimization) algorithm.;
  • to assess the quality of the model using the test sample, which ideally should not depend on the training of the model.;
  • to choose the best machine learning model, you need a set of tests.


Methods of formation of the training and evaluation data sets depending on the class of machine learning tasks:

  • for classification tasks, the data must be divided so that the numerical value for different classes of objects in the result sets was the same as in the original total population;
  • the regression task requires the same distribution of the target variable in a set of results that will be used for training and quality control.

If these conditions are met, the volumes of training and evaluation samples may differ significantly. For example, the validation dataset may be as small as 10% of the total population. The main thing when forming samples is not to combine the training data set with the evaluation (testing and validation), as this threatens to rethink the machine learning model. In this case, the model will get high-quality estimates during training but will not show such a result on real data.

Where to take the general-purpose datasets?

There are many ready-to-use datasets available on the ML scientific platforms.

You can get there some State datasets:

  • The U.S. National Center for Education Statistics. Data on educational institutions and educational demographics in the United States;
  • Data USA. Comprehensive visualization of publicly available U.S. data.

Housing data:

  • Boston Housing Dataset. Contains information about housing in Boston collected by the U.S. Census Bureau. It was obtained from the StatLib archive and was widely used in the literature for evaluating algorithms.

Economics and Finance:

  • Google Trends. Study and analyze data on Internet search activity and trends around the world;
  • American Economic Association (AEA). A good source of data on U.S. macroeconomics.

Specialized datasets for ML

Computer vision:

  • Google’s Open Images. A collection of several million images “that have been tagged spanning over 6,000 categories” under a Creative Commons license;
  • Labeled Faces in the Wild. A set of 13,000 marked-up images of people’s faces for use by apps that involve the use of facial recognition technology;
  • Stanford Dogs Dataset. Contains 20,580 images from 120 dog breeds;
  • Indoor Scene Recognition. Dataset for recognizing the interior of buildings. Contains 15,620 images and 67 categories.

Text tonality analysis:

  • Sentiment140. A popular dataset with 160,000 tweets with emoticons removed;
  • Twitter U.S. Airline Sentiment. A set of data from Twitter about U.S. airlines dating back to February 2015, divided into positive, negative, and neutral tweets.

Natural language processing:

  • SMS Spam Collection in English. A dataset consisting of 5574 spam SMS messages in English;
  • Yelp Reviews. A dataset from Yelp containing more than 5 million reviews;
  • UCI’s Spambase. A large dataset of spam emails.

Check this brilliant collection of links for data sets.

And the most complicated part of the scientific work is collecting and structuring your own row data. To use them, a lot of manual work needs to be done on clearing the data from duplicates, sorting mismatched fields, clearing non-relevant figures. This work is usually delegated to the AI experts. And a separate issue is the proper storage of dynamically generated data. You have to think about regular updates while keeping the historical data in some place. Do not hesitate to refer to one of the ML scientific labs for assistance if you wish to do custom ML researches.