Train-Test Split Explained Simply
Train-test split is one of the most important concepts in machine learning. It helps us check whether a model can make
good predictions on new, unseen data instead of only memorising the data it has already seen.
What is Train-Test Split?
Train-test split means dividing a dataset into two parts:
| Dataset Part |
Purpose |
| Training Set |
Used to train the model |
| Testing Set |
Used to evaluate the model on unseen data |
Train-test split = train the model on one part and test it on another part
Why Do We Need It?
If we train and test a model on the same data, the model may appear very accurate because it has already seen the answers.
This does not prove that the model will work well in real life.
Simple Analogy
If a student only practises past exam questions and then gets tested on the same questions, the result may look excellent.
But the real test is whether the student can answer new questions.
Key idea: Testing data checks how well the model performs on new examples.
Training Data vs Testing Data
| Training Data |
Testing Data |
| Used for learning patterns |
Used for checking performance |
| The model sees this data |
The model has not seen this data |
| Helps model learn |
Helps measure generalisation |
What is Generalisation?
Generalisation means the model can perform well on new data, not just the data used during training.
Example
A customer purchase model should correctly predict behaviour for future customers, not only the customers in the training dataset.
Good model = good performance on unseen data
Common Split Ratios
A common approach is to use most of the data for training and the rest for testing.
| Split Ratio |
Meaning |
Common Use |
| 80% / 20% |
80% training, 20% testing |
Very common choice |
| 70% / 30% |
70% training, 30% testing |
Useful when dataset is larger |
| 75% / 25% |
75% training, 25% testing |
Balanced option |
Simple rule: 80/20 is a good starting point for many beginner machine learning projects.
Train-Test Split in Python
Scikit-learn provides a simple function called train_test_split.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
This means 20% of the data will be used for testing and 80% will be used for training.
What Do X and y Mean?
| Term |
Meaning |
| X |
Features / input columns |
| y |
Target / output column |
| X_train |
Features used for training |
| X_test |
Features used for testing |
| y_train |
Correct answers for training data |
| y_test |
Correct answers for testing data |
Simple Example
Suppose we are building a model to predict whether a customer will purchase a product.
Features
Target
Purchased: 0 or 1
X = df[["Age", "EstimatedSalary"]]
y = df["Purchased"]
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
What is random_state?
The random_state value makes the split repeatable.
Without it, Python may split the data differently each time you run the code.
Why Use random_state=42?
It allows you and your teacher to get the same train-test split and compare results easily.
Tip: random_state does not have to be 42. It just needs to be fixed if you want repeatable results.
Stratified Split for Classification
In classification problems, it is often useful to keep the same class proportions in both training and testing sets.
Example
If 70% of customers did not purchase and 30% did purchase, we may want both train and test sets to keep a similar balance.
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
stratify=y
)
Useful for: classification datasets where class balance matters.
Important Note for Time Series Data
For time series data, we usually do not randomly split the data because time order matters.
Correct Idea
Train on past data and test on future data.
Time series split = past data for training, future data for testing
Common Beginner Mistakes
- Training and testing on the same data
- Forgetting to split before evaluating the model
- Using random split for time series data
- Not using stratify for imbalanced classification data
- Comparing models using different splits
Remember: The test set should represent data the model has not seen before.
Why It Matters in Machine Learning
Train-test split helps us measure whether a model is useful in real-world situations.
A model that performs well only on training data may fail when used with new data.
| Result |
Possible Meaning |
| Good training performance, poor test performance |
Possible overfitting |
| Poor training and test performance |
Possible underfitting |
| Good training and test performance |
Model generalises well |
Quick Practice
A dataset has 1,000 rows. You use an 80/20 train-test split.
Question: How many rows are used for training and testing?
Answer:
- Training rows: 800
- Testing rows: 200
Key Takeaway
Train-test split allows us to train a model on one part of the data and evaluate it on another part.
This helps us understand whether the model can generalise to new, unseen data.
Simple rule: Train on one part, test on another part.
Want to Learn More?
Explore our practical courses in Data Analysis, Machine Learning and AI to apply train-test split in real-world projects.
View Courses