Post

Regression: Predicting Relationships Between Variables with the Best Fit Line

Regression: Predicting Relationships Between Variables with the Best Fit Line

In our last post, we explored Pearson’s r, which measures the strength and direction of a linear relationship between two variables. Now, we take the next step: using regression to model and predict relationships with a best-fitting line.


📚 This post is part of the "Intro to Statistics" series

🔙 Previously: Pearson’s r It quantifies the strength and direction of a linear relationship

🔜 Next: Mastering Randomness


🎯 A Creative Case Study: Predicting Exam Scores from Study Hours

Imagine you collect data on students’ hours studied and their corresponding exam scores. You plot these points on a scatter plot and calculate Pearson’s r:

[ r = 0.93 ]

This indicates a strong positive linear relationship: generally, more study hours relate to higher exam scores.


📊 Drawing the Regression Line & Understanding Residuals

We want to draw a regression line through the scatter plot that best fits the data and allows us to predict exam scores based on study hours.

But how do we know which line fits best?

  • For each data point, calculate the vertical distance to the regression line — this distance is called a residual.

  • Positive residuals are distances where points lie above the regression line.

  • Negative residuals are distances where points lie below the regression line.

    Scatter Plot with Regression Line & Residuals

    *Visualizing residuals: distances from data points to regression line*


🧮 Minimizing Residuals: Ordinary Least Squares (OLS)

We want the regression line that minimizes the sum of squared residuals (RSS):

\[ RSS = \sum (y_i - \hat{y}_i)^2 \]

  • We square residuals because positive and negative residuals would otherwise cancel out (sum of residuals = 0).

  • Squaring also penalizes larger errors more heavily.

This method is called Ordinary Least Squares (OLS) — the most common technique for fitting regression lines.

Visualizing Sum of Squared Residuals

*Squares representing residuals and how RSS is calculated*


📈 The Regression Equation

The regression line predicts values of \( y \) (dependent variable) from \( x \) (independent variable) as:

\[ \hat{y} = a + b x \]

Where:

  • \( \hat{y} \) = predicted value of \( y \)
  • \( a \) = intercept (value of \( y ) when ( x = 0 \))
  • \( b \) = slope (change in \( y \) for one unit increase in \( x \))

🧮 Calculating the Regression Coefficients

Using the data, we calculate:

\[ b = r \times \frac{s_y}{s_x} \]

\[ a = \bar{y} - b \bar{x} \]

Where:

  • \( r \) = Pearson’s correlation coefficient between \( x \) and \( y \)
  • \( s_y \) = standard deviation of \( y \)
  • \( s_x \) = standard deviation of \( x \)
  • \( \bar{y} \) = mean of \( y \)
  • \( \bar{x} \) = mean of \( x \)

Why Is This the Best Fitting Line?

  • Because it minimizes the sum of squared residuals (errors), no other line has less total squared distance from the points.

  • It’s the line with the smallest prediction error on the observed data.


📊 Assessing the Fit: \( R^2 \) (Coefficient of Determination)

The statistic \( R^2 \) measures how well the regression line predicts \( y \):

\[ R^2 = r^2 \]

  • \( r \) gives the direction and strength of the linear relationship.

  • \( R^2 \) tells how much better the regression line predicts \( y \) compared to simply using the mean of \( y \).

  • \( R^2 \) also represents the proportion of variance in \( y \) explained by \( x \).


Comparison of High vs Low R-squared

*Comparing high and low \( R^2 \) values: High \( R^2 \) shows a good fit, Low \( R^2 \) shows a poor fit*


⚠️ Important Tips

  • Correlation is NOT causation. A strong relationship doesn’t prove one variable causes the other.

  • Outliers can distort regression results. Always visualize your data and consider the effect of outliers before trusting the model.


📌 Try It Yourself: Regression Quiz

Q1: What does a residual represent in regression analysis?

💡 Show Answer

The vertical distance between a data point and the regression line.

Q2: Why do we minimize the sum of squared residuals instead of summing residuals directly?

💡 Show Answer

Because positive and negative residuals would cancel each other out if summed directly.

Q3: What does the slope (\(b\)) in the regression equation \( \hat{y} = a + bx \) represent?

💡 Show Answer

The change in \( y \) for each one unit increase in \( x \).

Q4: What does \( R^2 \) (coefficient of determination) indicate?

💡 Show Answer

How much of the variance in \( y \) is explained by the independent variable \( x \) through the regression line.

Q5: Which statement is TRUE about correlation and causation?

💡 Show Answer

Correlation does not imply causation. A strong relationship does not mean one variable causes the other.

🔁 Summary

ConceptMeaning
ResidualThe vertical distance between an observed data point and the regression line
Sum of Squared Residuals (RSS)The total of squared residuals minimized in ordinary least squares regression
Regression Equation\( \hat{y} = a + b x \); predicts \( y \) from \( x \)
Slope (\(b\))Change in predicted \( y \) for a one unit increase in \( x \)
Intercept (\(a\))Predicted value of \( y \) when \( x = 0 \)
Pearson’s \( r \)Correlation coefficient showing strength and direction of linear relationship
Coefficient of Determination (\(R^2\))Proportion of variance in \( y \) explained by \( x \) via the regression model

✅ Up Next

In the next post, we’ll dive into Randomness and Probability

  • Understanding randomness in data and processes
  • Key probability concepts and rules
  • How probability helps us make sense of uncertainty
  • Practical examples to build your intuition

Stay tuned!

This post is licensed under CC BY 4.0 by the author.