Data Sets for Testing and Training

Machine learning / Data Sets for Testing and Training

Model
It is a representation of real world process and is used to predict on the test data.

There are 3 data sets used in different stages of the creation of the model. They were

Data set Name	Used to
Training Data Set	Fits the model
Test Data Set	Tests Model
Validation Data set	Predicts the responses for the observations

Training dataset
It is used to fit the parameters of the model.

Validation dataset
The validation dataset provides an unbiased evaluation of a model fit on the training dataset while tuning the models hyperparameters.It is used to stop training when the error on the validation dataset increases i.e it’s a sign of over fitting to the training dataset.

Test / holdout dataset
It is used to provide an unbiased evaluation of a final model fit on the training dataset.

Split Into Train or Test Data Set
Intial data set = Train Data Set + Test Data Set.

Example
Train Data Set = 70 % + Test Data Set = 30 % = Initial Data set ( Total Data Set)

Apply a linear regression model to this dataset

Python Program Output

import numpy as npobj import matplotlib.pyplot as pltobj from sklearn.linear_model import LinearRegression npobj.random.seed(2) x = 2 - 3 * npobj.random.normal(0, 1, 20) y = x - 2 * (x ** 2) + 0.5 * (x ** 3) + npobj.random.normal(-3, 3, 20) # transform data to include another axis x = x[:, npobj.newaxis] y = y[:, npobj.newaxis] model = LinearRegression() model.fit(x , y) y_pred = model.predict(x) pltobj.scatter(x , y, s=10) pltobj.plot(x , y_pred, color='r') pltobj.show()

Home Back