Correlation Between Variables: Contingency Tables and Scatter Plots

Posted May 10, 2025 Updated Jun 22, 2025

By Hoda Osama

5 min read

Understanding how two variables relate is a core step in data analysis — and that’s where correlation comes in.

But choosing the right correlation method depends on your data type: are your variables categorical or quantitative? In this post, we’ll break down the key approaches — from contingency tables to scatter plots — so you can analyze relationships with confidence.

📚 This post is part of the "Intro to Statistics" series

🔙 Previously: A Real-World Statistics Example

🔜 Next: Understanding Pearson's R

🎓 Real-Life Case: Study Habits and Exam Performance

Imagine a high school counselor wants to investigate the relationship between how often students study and whether they pass or fail a weekly quiz.

She surveys 30 students and records two things:

📚 Study Time Category: Rarely, Sometimes, Often
✅ Quiz Result: Pass or Fail

🧮 Step 1: The Contingency Table

This type of table is used for categorical variables. It shows how often combinations of categories occur.

Study Frequency \ Quiz Result	Pass	Fail	Total
Rarely	3	7	10
Sometimes	6	4	10
Often	9	1	10
Total	18	12	30

🔁 Step 2: Conditional Proportions

The raw counts don’t tell the full story. So we calculate the percentage of each outcome within each group.

For example:

Among students who study Rarely, 3/10 passed = 30%
Among those who study Often, 9/10 passed = 90%

Study Frequency	% Passed	% Failed
Rarely	30%	70%
Sometimes	60%	40%
Often	90%	10%

✅ These are conditional proportions — percentages within each row.

📊 Step 3: Understanding Proportions — Quick Summary

We use conditional proportions to look within groups, and marginal proportions to summarize a variable on its own.

Conditional example:
Among those who study Rarely → 3/10 passed = 30%
Marginal example:
Overall pass rate → 18/30 = 60%

📚 Want a full breakdown with examples, visual tables, and when to use each?
👉 Read: Conditional vs. Marginal Proportions →

🔍 Step 4: Interpreting the Categorical Correlation

The more a student studies, the more likely they are to pass.
We can see a positive association in the conditional proportions:

Rarely study → low pass rate
Often study → high pass rate

➡️ But contingency tables don’t quantify correlation — they only show patterns.

🔄 Step 5: Let’s Make It Quantitative

Now let’s change the scenario:

The counselor asks students for their exact number of study hours per week and records their quiz scores out of 100.

Here’s a sample:

Hours Studied	Quiz Score
2	50
3	55
5	65
7	70
8	76
10	85
12	92

📈 Step 6: Scatter Plot

This type of plot is perfect for quantitative variables.
It helps us visually assess correlation:

Each point = one student
X-axis: Hours studied
Y-axis: Quiz score

You’ll notice: the more hours students study, the higher their scores.
This is a strong positive relationship.

✅ Best Practices When Exploring Variable Relationships

Start with simple visuals like scatter plots or tables before jumping into modeling.
Use scatter plots to spot linear or curved patterns between numeric variables.
For categorical data, contingency tables show how categories relate.
Always ask: “Can this help my model make better predictions?”
Use a correlation metric (like Pearson’s r) for numerical comparisons.

⚠ Common Pitfalls to Avoid

Assuming correlation means causation — just because two variables move together doesn’t mean one causes the other.
Ignoring outliers — one extreme value can distort your scatter plot or correlation result.
Overlooking non-linear patterns — not all relationships are straight lines. Try other visuals or transformations.
Using the wrong chart — don’t use a scatter plot for categorical data; use a contingency table instead.
Forgetting to check variable types — always know what kind of data you're working with before analyzing relationships.

🧠 Level Up: Choosing the Right Correlation Approach Based on Data Types

Correlation analysis isn’t one-size-fits-all — the type of variables determines the best method:

📊 For two quantitative variables, measures like Pearson's r capture linear relationships.
📋 For two categorical variables, contingency tables and tests like Chi-square help assess association.
🔄 For mixed variable types, specialized methods like point-biserial correlation or ANOVA are used.

Understanding your data types ensures you pick the most powerful and appropriate analysis technique.

📌 Try It Yourself

Q: Imagine you're analyzing students’ test scores, and a few unusually high scores raise the mean. Which measure of center gives a more accurate picture of the typical student’s performance — mean or median?

💡 Show Answer

✅ Median — because it's resistant to outliers, unlike the mean which gets skewed. The median focuses on the middle value, so a few extreme values won't distort it, making it more reliable in such cases.

🤖 Why It Matters in Machine Learning

In machine learning, understanding relationships between variables helps you:

📊 Choose the right features for your model (feature selection).
📉 Detect multicollinearity — too much correlation between features can hurt model accuracy.
🧪 Engineer new features based on strong associations (e.g., combining study time and pass rate).
📈 Pick the right models — strong linear correlation? Consider regression. Categorical outcomes? Try classification.

Learning how to interpret contingency tables and scatter plots builds your EDA skills, a core part of every data science pipeline.

✅ Conclusion

Type of Data	Tool to Use	Example
Categorical (Nominal/Ordinal)	Contingency Table	Study Frequency vs Pass/Fail
Quantitative	Scatter Plot	Hours Studied vs Quiz Score

🧠 Choose the right tool based on your variable types.

💬 Got a question or suggestion?

Leave a comment below — I’d love to hear your thoughts or help if something was unclear.

🔜 Up Next

Next, we’ll calculate the Pearson correlation coefficient (r) — a number that tells us how strong a linear relationship really is.

statistics, beginner

This post is licensed under CC BY 4.0 by the author.