Post

Understanding Normal Distribution

Understanding Normal Distribution

๐Ÿ“Œ What is Normal Distribution (Gaussian Distribution)?

The normal distribution (or Gaussian distribution) is a type of continuous probability distribution for a real-valued random variable. It describes how many natural phenomena and errors in measurements are distributed. The graph is symmetric and bell-shaped.


๐Ÿ“š This post is part of the "Intro to Statistics" series

๐Ÿ”™ Previously: Mean, Variance, and Standard Deviation of Random Variables

๐Ÿ”œ Next: Understanding Z-Distribution and Using the Z-Table


๐Ÿ“ The Probability Density Function (PDF) for Normal Distribution

The equation for the PDF of a normal distribution is:

\[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{(x - \mu)^2}{2\sigma^2} \right) \]

Where:

  • \( \mu \) is the mean (location parameter) of the distribution, which defines where the peak of the bell curve is located.
  • \( \sigma \) is the standard deviation (shape parameter), which controls the width of the bell curve.
  • \( \exp \) is the exponential function, describing how particles or phenomena distribute themselves in nature (e.g., diffusion).

This equation connects the statistical world to real-world distributions.


๐Ÿ“Š Understanding the Equation

This equation is an exponential function and, after standardization, it describes how the values are distributed symmetrically around the mean.

  • The area under the curve represents the total probability, and the sum of all probabilities equals 1.
  • The variable \( x \) can take any value from \( -\infty \) to \( +\infty \), meaning the distribution extends infinitely in both directions.

๐Ÿ”„ Important Characteristics of Normal Distribution

  • \( \mu \) describes the location of the distribution, i.e., where the center of the bell curve lies.
  • \( \sigma \) defines the shape of the distribution, i.e., how spread out the values are around the mean.
  • The probability for any given range can be found using the cumulative distribution function (CDF).

๐Ÿงฎ Example of Normal Distribution

For any normal distribution:

  • 68% of values lie between \( \mu - \sigma \) and \( \mu + \sigma \).
  • 95% of values lie between \( \mu - 2\sigma \) and \( \mu + 2\sigma \).
  • 99.7% of values lie between \( \mu - 3\sigma \) and \( \mu + 3\sigma \).

๐Ÿ“ˆ Visualizing the 68%, 95%, and 99.7% Rule

Hereโ€™s a visual showing the 68%, 95%, and 99.7% areas under the curve:

Normal Distribution - Empirical Rule


๐Ÿ“ How to Calculate Probabilities Using Normal Distribution

To calculate the probability that a variable \( X \) lies within a specific range:

  • We use the Cumulative Distribution Function (CDF), which gives the area under the curve from \( -\infty \) to a specified \( x \).

๐Ÿค– Why It Matters for Machine Learning

  • In Linear Regression, residuals are ideally normally distributed โ€” this ensures valid confidence intervals and hypothesis tests.
  • Gaussian Naive Bayes classifier assumes features are normally distributed within each class.
  • Many statistical tests (like t-tests or ANOVA) assume normality โ€” often used in feature selection.
  • The Central Limit Theorem justifies normal approximations in ensemble learning, bootstrapping, and model evaluation.

๐Ÿง  Level Up: Understanding the Normal Distribution in Detail
  • The normal distribution is foundational in statistics. It is used in hypothesis testing, confidence intervals, and in many natural and social sciences.
  • The 68-95-99.7 rule: This empirical rule highlights the percentage of data that falls within 1, 2, and 3 standard deviations from the mean.
  • The central limit theorem suggests that, regardless of the original distribution of data, the sampling distribution of the sample mean will approximate a normal distribution as the sample size increases.
  • In practice, many natural phenomena and errors in measurement follow a normal distribution because of the law of large numbers.

โœ… Best Practices for Normal Distribution
  • Check if your data is approximately symmetric before assuming normality.
  • Use Q-Q plots or histograms to assess normality visually.
  • Apply normal distribution when dealing with large samples (thanks to CLT).
  • Understand when standardizing (Z-scores) is appropriate.

โš ๏ธ Common Pitfalls
  • โŒ Assuming data is normal without checking (especially for small samples).
  • โŒ Using normal distribution with categorical or non-continuous data.
  • โŒ Confusing the normal distribution with the uniform distribution.
  • โŒ Misinterpreting the standard deviation as covering 100% of data.

๐Ÿ“Œ Try It Yourself: Normal Distribution

Q1: What is the normal distribution also called?

๐Ÿ’ก Show Answer

โœ… Itโ€™s also known as the Gaussian distribution.

Named after Carl Friedrich Gauss, who helped develop the mathematical theory behind it.


Q2: What does the standard deviation \( \sigma \) control in a normal distribution?

๐Ÿ’ก Show Answer

โœ… It controls the spread (width) of the bell curve.

A larger \( \sigma \) means a wider curve; a smaller \( \sigma \) results in a tighter, narrower shape.


Q3: What percentage of values fall between \( \mu - 3\sigma \) and \( \mu + 3\sigma \)?

๐Ÿ’ก Show Answer

โœ… Approximately 99.7% of values fall within this range.

This is part of the Empirical Rule for normal distributions.


Q4: How about the range \( \mu - 2\sigma \) to \( \mu + 2\sigma \)?

๐Ÿ’ก Show Answer

โœ… Around 95% of values fall in this range.

This is commonly used for confidence intervals in statistics.


Q5: What does the cumulative distribution function (CDF) tell us?

๐Ÿ’ก Show Answer

โœ… The CDF gives the probability that a random variable is less than or equal to a certain value.

It's useful for computing probabilities over ranges instead of exact points.


Q6: How much of the distribution lies within one standard deviation from the mean?

๐Ÿ’ก Show Answer

โœ… About 68% of the data lies between \( \mu - \sigma \) and \( \mu + \sigma \).

This forms the central region of the normal curve.


๐Ÿ“ Summary of Key Points

  • The normal distribution is symmetric and bell-shaped.
  • The mean \( \mu \) determines the location of the peak.
  • The standard deviation \( \sigma \) controls the spread.
  • 68% of values lie within one standard deviation (\( \mu \pm \sigma \)).
  • 95% of values lie within two standard deviations (\( \mu \pm 2\sigma \)).
  • 99.7% of values lie within three standard deviations (\( \mu \pm 3\sigma \)).

๐Ÿ’ฌ Got a question or suggestion?

Leave a comment below โ€” Iโ€™d love to hear your thoughts or help if something was unclear.


๐Ÿ”œ Up Next

Next, weโ€™ll explore the Z-Distribution โ€” a standardized version of the normal distribution that is used to calculate probabilities and percentiles.

Stay tuned!

This post is licensed under CC BY 4.0 by the author.