Machine learning / Data Sets for Testing and Training

Model
It is a representation of real world process and is used to predict on the test data.

There are 3 data sets used in different stages of the creation of the model. They were
Data set Name Used to
Training Data Set Fits the model
Test Data Set Tests Model
Validation Data set Predicts the responses for the observations

Training dataset
It is used to fit the parameters of the model.

Validation dataset
The validation dataset provides an unbiased evaluation of a model fit on the training dataset while tuning the models hyperparameters.It is used to stop training when the error on the validation dataset increases i.e it’s a sign of over fitting to the training dataset.

Test / holdout dataset
It is used to provide an unbiased evaluation of a final model fit on the training dataset.

Split Into Train or Test Data Set
Intial data set = Train Data Set + Test Data Set.

Example
Train Data Set = 70 % + Test Data Set = 30 % = Initial Data set ( Total Data Set)

Apply a linear regression model to this dataset
Python Program Output
import numpy as npobj
import matplotlib.pyplot as pltobj
from sklearn.linear_model import LinearRegression
npobj.random.seed(2)
x = 2 - 3 * npobj.random.normal(0, 1, 20)
y = x - 2 * (x ** 2) + 0.5 * (x ** 3) + npobj.random.normal(-3, 3, 20)
# transform data to include another axis
x = x[:, npobj.newaxis]
y = y[:, npobj.newaxis]
model = LinearRegression()
model.fit(x , y) y_pred = model.predict(x)
pltobj.scatter(x , y, s=10)
pltobj.plot(x , y_pred, color='r')
pltobj.show()
test train a data set


Home     Back