Post

Regression: Predicting Relationships Between Variables with the Best Fit Line

Regression: Predicting Relationships Between Variables with the Best Fit Line

If correlation shows the strength of a relationship, regression goes further — it models that relationship to make predictions.

In this post, you’ll learn how to use linear regression to fit the best line through data, calculate slope and intercept, and assess prediction quality using residuals and R². Whether you’re analyzing trends or building machine learning models, this is a core concept you’ll return to often.


📚 This post is part of the "Intro to Statistics" series

🔙 Previously: Pearson’s r It quantifies the strength and direction of a linear relationship

🔜 Next: Mastering Randomness


🎯 A Creative Case Study: Predicting Exam Scores from Study Hours

Imagine you collect data on students’ hours studied and their corresponding exam scores. You plot these points on a scatter plot and calculate Pearson’s r:

[ r = 0.93 ]

This indicates a strong positive linear relationship: generally, more study hours relate to higher exam scores.


📊 Drawing the Regression Line & Understanding Residuals

We want to draw a regression line through the scatter plot that best fits the data and allows us to predict exam scores based on study hours.

But how do we know which line fits best?

  • For each data point, calculate the vertical distance to the regression line — this distance is called a residual.

  • Positive residuals are distances where points lie above the regression line.

  • Negative residuals are distances where points lie below the regression line.

    Scatter Plot with Regression Line & Residuals

    *Visualizing residuals: distances from data points to regression line*


🧮 Minimizing Residuals: Ordinary Least Squares (OLS)

We want the regression line that minimizes the sum of squared residuals (RSS):

\[ RSS = \sum (y_i - \hat{y}_i)^2 \]

  • We square residuals because positive and negative residuals would otherwise cancel out (sum of residuals = 0).

  • Squaring also penalizes larger errors more heavily.

This method is called Ordinary Least Squares (OLS) — the most common technique for fitting regression lines.

Visualizing Sum of Squared Residuals

*Squares representing residuals and how RSS is calculated*


📈 The Regression Equation

The regression line predicts values of \( y \) (dependent variable) from \( x \) (independent variable) as:

\[ \hat{y} = a + b x \]

Where:

  • \( \hat{y} \) = predicted value of \( y \)
  • \( a \) = intercept (value of \( y ) when ( x = 0 \))
  • \( b \) = slope (change in \( y \) for one unit increase in \( x \))

🧮 Calculating the Regression Coefficients

Using the data, we calculate:

\[ b = r \times \frac{s_y}{s_x} \]

\[ a = \bar{y} - b \bar{x} \]

Where:

  • \( r \) = Pearson’s correlation coefficient between \( x \) and \( y \)
  • \( s_y \) = standard deviation of \( y \)
  • \( s_x \) = standard deviation of \( x \)
  • \( \bar{y} \) = mean of \( y \)
  • \( \bar{x} \) = mean of \( x \)

Why Is This the Best Fitting Line?

  • Because it minimizes the sum of squared residuals (errors), no other line has less total squared distance from the points.

  • It’s the line with the smallest prediction error on the observed data.


📊 Assessing the Fit: \( R^2 \) (Coefficient of Determination)

The statistic \( R^2 \) measures how well the regression line predicts \( y \):

\[ R^2 = r^2 \]

  • \( r \) gives the direction and strength of the linear relationship.

  • \( R^2 \) tells how much better the regression line predicts \( y \) compared to simply using the mean of \( y \).

  • \( R^2 \) also represents the proportion of variance in \( y \) explained by \( x \).


Comparison of High vs Low R-squared

*Comparing high and low \( R^2 \) values: High \( R^2 \) shows a good fit, Low \( R^2 \) shows a poor fit*


🤖 Why Regression Matters in Machine Learning

Linear regression isn’t just a statistics tool — it’s one of the simplest yet most powerful building blocks in machine learning.

  • 📈 Used for predictive modeling, especially for numeric outcomes like prices, scores, or trends.
  • 🔍 Helps in feature evaluation and understanding how each variable influences predictions.
  • 🧪 Forms the basis of more advanced models like Ridge, Lasso, and Logistic Regression.
  • 📉 Residuals and R² are essential tools for evaluating model accuracy.

Even in complex ML systems, understanding linear regression strengthens your intuition and debugging skills.


🧠 Level Up: When Linear Regression Isn't Enough

Linear regression is powerful, but it assumes a straight-line relationship. When that doesn’t hold, it can mislead more than inform.

  • 📈 Use polynomial regression if the trend curves (e.g., U-shaped or exponential).
  • 🔍 Apply log or square root transformations to linearize skewed relationships.
  • 🤖 In machine learning, try models like decision trees, random forests, or gradient boosting for non-linear patterns.

Understanding when linear regression fails is just as important as knowing when it works. That’s how you build smarter models.


✅ Best Practices for Linear Regression
  • Always visualize your data to verify a linear relationship.
  • Check for and address outliers — they can distort your model.
  • Interpret both slope and — they tell different stories.
  • Use domain knowledge to avoid nonsense predictions (e.g. negative income).

⚠️ Common Pitfalls
  • ❌ Assuming correlation implies causation.
  • ❌ Using regression with non-linear data without transformation.
  • ❌ Ignoring units — slope has meaning only in context (e.g. dollars per hour).
  • ❌ Forgetting to evaluate model fit using residual plots and R².

📌 Try It Yourself: Regression Quiz

Q1: What does a residual represent in a linear regression model?

💡 Show Answer

Answer: The vertical distance between an actual data point and the predicted value on the regression line.

Q2: Why do we minimize the sum of squared residuals (SSR) instead of just summing the residuals directly?

💡 Show Answer

Answer: Because positive and negative residuals would cancel each other out. Squaring them avoids this and emphasizes larger errors.

Q3: In the equation \( \hat{y} = a + bx \), what does the slope \( b \) represent?

💡 Show Answer

Answer: It represents the **expected change in \( y \)** for every 1-unit increase in \( x \).

Q4: What does the \( R^2 \) value (coefficient of determination) tell us about a regression model?

💡 Show Answer

Answer: It tells us what percentage of the variance in the dependent variable \( y \) is explained by the independent variable \( x \).

Q5: Which of the following statements is true about correlation and causation?

💡 Show Answer

Answer: Correlation does not imply causation. Just because two variables move together doesn’t mean one causes the other.

Bonus: You build a model to predict house prices, and the slope of your regression line is 50,000. What does this mean?

💡 Show Answer

Answer: For every 1-unit increase in the input variable (e.g., number of rooms), the predicted house price increases by $50,000.


🔁 Summary

ConceptMeaning
ResidualThe vertical distance between an observed data point and the regression line
Sum of Squared Residuals (RSS)The total of squared residuals minimized in ordinary least squares regression
Regression Equation\( \hat{y} = a + b x \); predicts \( y \) from \( x \)
Slope (\(b\))Change in predicted \( y \) for a one unit increase in \( x \)
Intercept (\(a\))Predicted value of \( y \) when \( x = 0 \)
Pearson’s \( r \)Correlation coefficient showing strength and direction of linear relationship
Coefficient of Determination (\(R^2\))Proportion of variance in \( y \) explained by \( x \) via the regression model

💬 Got a question or suggestion?

Leave a comment below — I’d love to hear your thoughts or help if something was unclear.


✅ Up Next

In the next post, we’ll dive into Randomness and Probability

  • Understanding randomness in data and processes
  • Key probability concepts and rules
  • How probability helps us make sense of uncertainty
  • Practical examples to build your intuition

Stay tuned!

This post is licensed under CC BY 4.0 by the author.