Data Distribution Explained Simply
Data distribution shows how values are spread across a dataset. Understanding distributions helps us identify patterns,
unusual values, skewness and the overall behaviour of data before using it for analysis or machine learning.
What is Data Distribution?
Data distribution describes how often different values appear in a dataset. It helps us understand whether values are
grouped together, spread out, balanced or concentrated on one side.
Simple Meaning
If we collect exam scores from students, the distribution shows how many students scored low, medium or high marks.
Distribution = how values are spread across the dataset
Why Data Distribution Matters
Before building charts, reports or machine learning models, we need to understand the shape of the data.
Distribution helps us answer important questions.
- Are most values low, medium or high?
- Are there unusual values or outliers?
- Is the data balanced or skewed?
- Does the data look consistent?
- Which analysis or model may be suitable?
Key idea: Understanding distribution helps us understand the behaviour of data.
Example: Student Scores
Imagine the following exam scores:
45, 50, 55, 60, 65, 70, 75, 80, 85
These values are fairly spread out from low to high. A distribution helps us see this pattern more clearly.
Histogram: The Most Common Way to View Distribution
A histogram shows how many values fall into different ranges. It is one of the best charts for understanding distribution.
| Score Range |
Number of Students |
| 40–50 |
2 |
| 51–60 |
3 |
| 61–70 |
5 |
| 71–80 |
4 |
| 81–90 |
1 |
Simple rule: Use a histogram when you want to see the shape of numerical data.
Normal Distribution
A normal distribution is a balanced, bell-shaped distribution where most values are close to the average and fewer values
appear at the extremes.
Example
Many exam scores may cluster around the average, with fewer very low or very high scores.
| Feature |
Meaning |
| Balanced shape |
Values are spread evenly around the centre |
| Mean near the centre |
The average represents the data well |
| Few extreme values |
Very low and very high values are less common |
Skewed Distribution
A skewed distribution is not balanced. Most values may be concentrated on one side, while a small number of values stretch
toward the other side.
| Type |
Meaning |
Example |
| Right-skewed |
Most values are low, with a few very high values |
Income data |
| Left-skewed |
Most values are high, with a few very low values |
Easy exam scores |
Important: Skewed data can make the mean misleading.
Distribution and Outliers
Distribution also helps us identify outliers. Outliers are values that are very different from the rest of the data.
Example
Monthly sales: £4,800, £5,000, £5,100, £5,200, £25,000
The value £25,000 is much higher than the others. A histogram or box plot can help reveal this unusual value.
Data Distribution in Python
You can view distribution using a histogram in Python:
import pandas as pd
import matplotlib.pyplot as plt
scores = [45, 50, 55, 60, 65, 70, 75, 80, 85]
df = pd.DataFrame({"score": scores})
df["score"].plot(kind="hist", bins=5)
plt.title("Distribution of Scores")
plt.xlabel("Score")
plt.show()
Checking Skewness in Python
Pandas can also calculate skewness. This gives a simple numerical indication of whether data is skewed.
import pandas as pd
data = [20, 22, 23, 25, 26, 28, 100]
s = pd.Series(data)
print("Mean:", s.mean())
print("Median:", s.median())
print("Skewness:", s.skew())
Tip: If the mean is much higher than the median, the data may be right-skewed.
Why Distribution Matters in Machine Learning
Machine learning models learn patterns from data. If the data distribution is unusual, imbalanced or contains outliers,
model performance can be affected.
- Skewed features may need transformation
- Outliers may affect model training
- Class imbalance can affect classification models
- Feature scaling may be needed for some algorithms
- Distribution helps us understand whether data is suitable for modelling
Business Example
A company analyses customer spending. Most customers spend between £20 and £100, but a few customers spend over £2,000.
Business Insight
The spending data is right-skewed. The company may want to analyse high-spending customers separately because they behave
differently from typical customers.
Quick Practice
Look at the following data:
10, 12, 14, 15, 16, 18, 90
Questions:
- Is the data balanced or skewed?
- Is there an outlier?
- Would the mean or median better represent the typical value?
Suggested answer: The data is right-skewed, 90 is an outlier, and the median is likely more representative.
Common Beginner Mistake
A common mistake is calculating only the average and ignoring the shape of the data.
This can hide important patterns, outliers and skewness.
Remember: Always look at the distribution before making conclusions from data.
Key Takeaway
Data distribution shows how values are spread across a dataset. It helps us understand patterns, detect outliers,
identify skewness and prepare data for better analysis and machine learning.
Simple rule: Distribution tells us the shape and behaviour of data.
Want to Learn More?
Explore our practical courses in Data Analysis, Machine Learning and AI to apply statistics in real-world projects.
View Courses