• 0208 432 6218
  • WhatsApp
  • Register

Train-Test Split Explained Simply

Train-test split is one of the most important concepts in machine learning. It helps us check whether a model can make good predictions on new, unseen data instead of only memorising the data it has already seen.

What is Train-Test Split?

Train-test split means dividing a dataset into two parts:

Dataset Part Purpose
Training Set Used to train the model
Testing Set Used to evaluate the model on unseen data
Train-test split = train the model on one part and test it on another part

Why Do We Need It?

If we train and test a model on the same data, the model may appear very accurate because it has already seen the answers. This does not prove that the model will work well in real life.

Simple Analogy

If a student only practises past exam questions and then gets tested on the same questions, the result may look excellent. But the real test is whether the student can answer new questions.

Key idea: Testing data checks how well the model performs on new examples.

Training Data vs Testing Data

Training Data Testing Data
Used for learning patterns Used for checking performance
The model sees this data The model has not seen this data
Helps model learn Helps measure generalisation

What is Generalisation?

Generalisation means the model can perform well on new data, not just the data used during training.

Example

A customer purchase model should correctly predict behaviour for future customers, not only the customers in the training dataset.

Good model = good performance on unseen data

Common Split Ratios

A common approach is to use most of the data for training and the rest for testing.

Split Ratio Meaning Common Use
80% / 20% 80% training, 20% testing Very common choice
70% / 30% 70% training, 30% testing Useful when dataset is larger
75% / 25% 75% training, 25% testing Balanced option
Simple rule: 80/20 is a good starting point for many beginner machine learning projects.

Train-Test Split in Python

Scikit-learn provides a simple function called train_test_split.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

This means 20% of the data will be used for testing and 80% will be used for training.

What Do X and y Mean?

Term Meaning
X Features / input columns
y Target / output column
X_train Features used for training
X_test Features used for testing
y_train Correct answers for training data
y_test Correct answers for testing data

Simple Example

Suppose we are building a model to predict whether a customer will purchase a product.

Features

  • Age
  • Estimated Salary

Target

Purchased: 0 or 1

X = df[["Age", "EstimatedSalary"]]
y = df["Purchased"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

What is random_state?

The random_state value makes the split repeatable. Without it, Python may split the data differently each time you run the code.

Why Use random_state=42?

It allows you and your teacher to get the same train-test split and compare results easily.

Tip: random_state does not have to be 42. It just needs to be fixed if you want repeatable results.

Stratified Split for Classification

In classification problems, it is often useful to keep the same class proportions in both training and testing sets.

Example

If 70% of customers did not purchase and 30% did purchase, we may want both train and test sets to keep a similar balance.

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
Useful for: classification datasets where class balance matters.

Important Note for Time Series Data

For time series data, we usually do not randomly split the data because time order matters.

Correct Idea

Train on past data and test on future data.

Time series split = past data for training, future data for testing

Common Beginner Mistakes

  • Training and testing on the same data
  • Forgetting to split before evaluating the model
  • Using random split for time series data
  • Not using stratify for imbalanced classification data
  • Comparing models using different splits
Remember: The test set should represent data the model has not seen before.

Why It Matters in Machine Learning

Train-test split helps us measure whether a model is useful in real-world situations. A model that performs well only on training data may fail when used with new data.

Result Possible Meaning
Good training performance, poor test performance Possible overfitting
Poor training and test performance Possible underfitting
Good training and test performance Model generalises well

Quick Practice

A dataset has 1,000 rows. You use an 80/20 train-test split.

Question: How many rows are used for training and testing?

Answer:

  • Training rows: 800
  • Testing rows: 200

Key Takeaway

Train-test split allows us to train a model on one part of the data and evaluate it on another part. This helps us understand whether the model can generalise to new, unseen data.

Simple rule: Train on one part, test on another part.

Want to Learn More?

Explore our practical courses in Data Analysis, Machine Learning and AI to apply train-test split in real-world projects.

View Courses

What we do?

At London Academy of IT, we provide instructor-led online and in-person IT training in Data Analytics, SQL, Python, Power BI, and more. Our cutting-edge courses are designed to boost performance and enhance employability, providing the competitive edge employers look for.

Our Contacts

London Academy of IT
64 Broadway
Stratford
London E15 1NT
United Kingdom

Regional Training

2012 - 2026 © London Academy of IT Limited. All Rights Reserved.
UKPRN: 10045491. Registered in England & Wales with company no. 07923992.