Understanding Dispersion: Range, IQR, and the Box Plot
Understanding how data spreads is just as important as knowing its center. In this post, you’ll learn about dispersion — using the range, interquartile range (IQR), and box plots — and how these tools help identify outliers, variability, and improve machine learning models.
📚 This post is part of the "Intro to Statistics" series
🔙 Previously: Measuring the Center: Mean, Median, and Mode Explained
🔜 Next: Measuring Variability: Variance and Standard Deviation
📏 Range: A Simple Start
- Range = Largest value − Smallest value
\[ \text{Range} = x_{\max} - x_{\min} \]
- Gives a basic idea of spread
- But it’s not reliable for measuring variability if there are outliers
💡 Example:
If data = [5, 6, 6, 7, 95] →
\[ \text{Range} = 95 - 5 = 90 \]
🛑 That huge gap is because of one extreme value (an outlier).
📦 Interquartile Range (IQR)
To handle outliers, we use the IQR, which is based on quartiles:
Quartile | Meaning |
---|---|
Q1 | 25% of data is below this point |
Q2 | 50% (median) |
Q3 | 75% of data is below this point |
🧮 Formula:
\[ \text{IQR} = Q_3 - Q_1 \]
- IQR focuses on the middle 50% of the data
- It removes the influence of extreme values
💡 Example:
Given ordered data:
[2, 4, 5, 7, 8, 10, 12, 15, 20, 22]
- ( Q_1 = 5 ) (25th percentile)
- ( Q_3 = 15 ) (75th percentile)
\[ \text{IQR} = 15 - 5 = 10 \]
This means the middle half of data spans 10 units.
📊 Box Plot: Best of Both Worlds
A box plot visually summarizes:
- The minimum and maximum values (excluding outliers)
- Q1, Q2 (median), and Q3
- Any outliers (points beyond 1.5×IQR from quartiles)
It’s one of the best visual tools to understand:
- Center
- Spread
- Skewness
- Outliers
🖼️ Visual: Anatomy of a Box Plot
- Each 25% of data is shown as a section
- The box spans from Q1 to Q3
- The line in the middle is the median (Q2)
- Points outside the whiskers are outliers
🎯 Why Not Just Use the Mean?
While central tendency is important, it’s not enough.
We need to know how spread out the data is — especially when comparing groups.
🧠 The box plot helps you see both center and variability.
🤖 Why Dispersion Matters in Machine Learning
- Outlier Detection: IQR and box plots help identify outliers, which can strongly affect model performance.
- Feature Selection: Features with very low or very high dispersion may be less useful or require special handling.
- Comparing Groups: Box plots make it easy to compare distributions across classes or experimental groups.
- Data Preprocessing: Understanding spread helps guide normalization, scaling, and robust imputation strategies.
In machine learning, understanding and visualizing data dispersion is essential for building reliable, interpretable models and for effective data cleaning.
📌 Try It Yourself
Q: Consider these two datasets:
- 📦 Dataset A: {10, 12, 12, 13, 13, 13, 14, 15}
- 📦 Dataset B: {10, 12, 13, 14, 15, 70}
Both datasets have the same median. Which one has a larger range, and what does that tell you about its spread?
💡 Show Answer
✅ Dataset B — it has a much larger range: 70 - 10 = 60
vs 15 - 10 = 5
in Dataset A.
This tells us that Dataset B includes a more extreme value — possibly an outlier — which greatly increases its range.
Bonus: Why might the IQR be a better measure of spread than the range in some cases?
💡 Show Answer
✅ The Interquartile Range (IQR) measures the spread of the middle 50% of the data.
It’s not affected by outliers, so it's a more reliable indicator of typical variability in skewed datasets.
🧠 Level Up: Understanding Variability Beyond the Range
While the range gives a simple measure of spread, it’s very sensitive to outliers — extreme values can distort your understanding.
- 📊 The IQR zeroes in on the middle 50% of data, making it more robust when outliers exist.
- 📦 The box plot visually separates the central bulk of data from outliers, showing you skewness and spread at a glance.
- 🔍 These tools are especially important in fields like finance, biology, and machine learning where outliers are common.
Mastering these measures will help you make better decisions and spot patterns that average measures alone can miss.
💬 Got a question or suggestion?
Feel free to leave a comment below — I’d love to hear your thoughts or help clarify any part of this topic.
🔁 Summary
Measure | What it tells us | Sensitive to outliers? |
---|---|---|
Range | Max − Min | ✅ Yes |
IQR | Spread of middle 50% | ❌ No |
Box Plot | Visual of quartiles & outliers | ❌ No |
✅ Up Next
Next, we’ll go deeper into numeric measures of variability:
- Variance
- Standard Deviation
And we’ll learn how to calculate and visualize them!
Stay tuned.