Train/Test
Learn all about Train/Test in this comprehensive tutorial.
- •In Machine Learning we create models to predict the outcome of certain events, like in the previous chapter where we predicted the CO2 emission of a car when we knew the weight and engine size.
- •Train/Test is a method to measure the accuracy of your model.
- •Start with a data set you want to test.
- •The training set should be a random selection of 80% of the original data.
- •Display the same scatter plot with the training set:
- •To make sure the testing set is not completely different, we will take a look at the testing set as well.
- •What does the data set look like?
- •Remember R2, also known as R-squared?
- •Now we have made a model that is OK, at least when it comes to training data.
- •Now that we have established that our model is OK, we can start predicting new values.
Evaluate Your Model
In Machine Learning we create models to predict the outcome of certain events, like in the previous chapter where we predicted the CO2 emission of a car when we knew the weight and engine size.
To measure if the model is good enough, we can use a method called Train/Test.
What is Train/Test
Train/Test is a method to measure the accuracy of your model.
It is called Train/Test because you split the data set into two sets: a training set and a testing set.
You train the model using the training set.
You test the model using the testing set.
Start With a Data Set
Start with a data set you want to test.
Our data set illustrates 100 customers in a shop, and their shopping habits.
Split Into Train/Test
The training set should be a random selection of 80% of the original data.
The testing set should be the remaining 20%.
Display the Training Set
Display the same scatter plot with the training set:
Display the Testing Set
To make sure the testing set is not completely different, we will take a look at the testing set as well.
Fit the Data Set
What does the data set look like? In my opinion I think the best fit would be a polynomial regression, so let us draw a line of polynomial regression.
To draw a line through the data points, we use the plot() method of the matplotlib module:
The result can back my suggestion of the data set fitting a polynomial regression, even though it would give us some weird results if we try to predict values outside of the data set. Example: the line indicates that a customer spending 6 minutes in the shop would make a purchase worth 200. That is probably a sign of overfitting.
But what about the R-squared score? The R-squared score is a good indicator of how well my data set is fitting the model.
R2
Remember R2, also known as R-squared?
It measures the relationship between the x axis and the y axis, and the value ranges from 0 to 1, where 0 means no relationship, and 1 means totally related.
The sklearn module has a method called r2_score() that will help us find this relationship.
In this case we would like to measure the relationship between the minutes a customer stays in the shop and how much money they spend.
Bring in the Testing Set
Now we have made a model that is OK, at least when it comes to training data.
Now we want to test the model with the testing data as well, to see if gives us the same result.
Predict Values
Now that we have established that our model is OK, we can start predicting new values.
The example predicted the customer to spend 22.88 dollars, as seems to correspond to the diagram:

Module quiz
2 questionsWhich of the following is true about Train/Test?
What is the most common pitfall when working with Train/Test?
Answer all questions to submit.