Train-Test Split Explained Simply

Train-test split is one of the most important concepts in machine learning. It helps us check whether a model can make good predictions on new, unseen data instead of only memorising the data it has already seen.

What is Train-Test Split?

Train-test split means dividing a dataset into two parts:

Dataset Part	Purpose
Training Set	Used to train the model
Testing Set	Used to evaluate the model on unseen data

Train-test split = train the model on one part and test it on another part

Why Do We Need It?

If we train and test a model on the same data, the model may appear very accurate because it has already seen the answers. This does not prove that the model will work well in real life.

Simple Analogy

If a student only practises past exam questions and then gets tested on the same questions, the result may look excellent. But the real test is whether the student can answer new questions.

Key idea: Testing data checks how well the model performs on new examples.

Training Data vs Testing Data

Training Data	Testing Data
Used for learning patterns	Used for checking performance
The model sees this data	The model has not seen this data
Helps model learn	Helps measure generalisation

What is Generalisation?

Generalisation means the model can perform well on new data, not just the data used during training.

Example

A customer purchase model should correctly predict behaviour for future customers, not only the customers in the training dataset.

Good model = good performance on unseen data

Common Split Ratios

A common approach is to use most of the data for training and the rest for testing.

Split Ratio	Meaning	Common Use
80% / 20%	80% training, 20% testing	Very common choice
70% / 30%	70% training, 30% testing	Useful when dataset is larger
75% / 25%	75% training, 25% testing	Balanced option

Simple rule: 80/20 is a good starting point for many beginner machine learning projects.

Train-Test Split in Python

Scikit-learn provides a simple function called train_test_split.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

This means 20% of the data will be used for testing and 80% will be used for training.

What Do X and y Mean?

Term	Meaning
X	Features / input columns
y	Target / output column
X_train	Features used for training
X_test	Features used for testing
y_train	Correct answers for training data
y_test	Correct answers for testing data

Simple Example

Suppose we are building a model to predict whether a customer will purchase a product.

Features

Age
Estimated Salary

Target

Purchased: 0 or 1

X = df[["Age", "EstimatedSalary"]]
y = df["Purchased"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

What is random_state?

The random_state value makes the split repeatable. Without it, Python may split the data differently each time you run the code.

Why Use random_state=42?

It allows you and your teacher to get the same train-test split and compare results easily.

Tip: random_state does not have to be 42. It just needs to be fixed if you want repeatable results.

Stratified Split for Classification

In classification problems, it is often useful to keep the same class proportions in both training and testing sets.

Example

If 70% of customers did not purchase and 30% did purchase, we may want both train and test sets to keep a similar balance.

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

Useful for: classification datasets where class balance matters.

Important Note for Time Series Data

For time series data, we usually do not randomly split the data because time order matters.

Correct Idea

Train on past data and test on future data.

Time series split = past data for training, future data for testing

Common Beginner Mistakes

Training and testing on the same data
Forgetting to split before evaluating the model
Using random split for time series data
Not using stratify for imbalanced classification data
Comparing models using different splits

Remember: The test set should represent data the model has not seen before.

Why It Matters in Machine Learning

Train-test split helps us measure whether a model is useful in real-world situations. A model that performs well only on training data may fail when used with new data.

Result	Possible Meaning
Good training performance, poor test performance	Possible overfitting
Poor training and test performance	Possible underfitting
Good training and test performance	Model generalises well

Quick Practice

A dataset has 1,000 rows. You use an 80/20 train-test split.

Question: How many rows are used for training and testing?

Answer:

Training rows: 800
Testing rows: 200

Key Takeaway

Train-test split allows us to train a model on one part of the data and evaluate it on another part. This helps us understand whether the model can generalise to new, unseen data.

Simple rule: Train on one part, test on another part.

Want to Learn More?

Explore our practical courses in Data Analysis, Machine Learning and AI to apply train-test split in real-world projects.

View Courses

Train-Test Split Explained Simply

What is Train-Test Split?

Why Do We Need It?

Simple Analogy

Training Data vs Testing Data

What is Generalisation?

Example

Common Split Ratios

Train-Test Split in Python

What Do X and y Mean?

Simple Example

Features

Target

What is random_state?

Why Use random_state=42?

Stratified Split for Classification

Example

Important Note for Time Series Data

Correct Idea

Common Beginner Mistakes

Why It Matters in Machine Learning

Quick Practice

Key Takeaway

Want to Learn More?

Popular Courses

Useful Links

Share this page now!

What we do?

Our Contacts

Regional Training