Post

⚙️ Optimization in Machine Learning: From Gradient Descent to Newton’s Method

⚙️ Optimization in Machine Learning: From Gradient Descent to Newton’s Method

⚙️ What Is Optimization in ML?

Optimization is the core process that drives learning in machine learning models.
It’s how we find the best parameters — like weights in a neural network — that reduce error as much as possible.

Imagine you’re hiking through a mountainous landscape: the goal is to reach the lowest valley.
This landscape represents the loss surface, and optimization algorithms guide your steps to find that minimum.



📚 This post is part of the "Intro to Calculus" series

🔙 Previously: What is the Hessian? Understanding Curvature and Optimization in Machine Learning



🎯 Loss Functions

A loss function measures how far off a model’s predictions are from actual labels.

Examples:

  • Mean Squared Error: \( L = \frac{1}{n}\sum (y_i - \hat{y}_i)^2 \)
  • Cross-Entropy for classification

We want to minimize the loss by adjusting parameters.


🔽 Gradient Descent (First-Order)

Gradient descent uses the gradient (first derivative) to take steps in the direction of steepest decrease.

Update rule: \[ \theta_{\text{next}} = \theta - \eta \cdot \nabla L(\theta) \]

  • \( \eta \): learning rate (step size)
  • \( \nabla L(\theta) \): gradient of the loss function

Too large \( \eta \) → overshooting; too small → slow convergence.


📉 Python: Gradient Descent Demo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
import matplotlib.pyplot as plt

# Loss function and derivative
f = lambda x: x**2
df = lambda x: 2*x

x_vals = [2.0]  # starting point
lr = 0.1

for _ in range(10):
    x_new = x_vals[-1] - lr * df(x_vals[-1])
    x_vals.append(x_new)

# Plot
x = np.linspace(-2, 2, 100)
y = f(x)
plt.plot(x, y)
plt.scatter(x_vals, [f(x) for x in x_vals], color='red')
plt.title("Gradient Descent on f(x) = x²")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.show()

📊 Visual: Gradient Descent Steps on a Curve

This plot shows how gradient descent iteratively steps toward the minimum on ( f(x) = x^2 ):

Gradient descent path showing red dots moving toward the bottom of a parabola


🧠 Newton’s Method (Second-Order)

Newton’s method uses the Hessian to adjust step size based on curvature:

\[ \theta_{\text{next}} = \theta - H^{-1} \nabla L(\theta) \]

Where:

  • \( H \): Hessian matrix (second derivatives)
  • More accurate near optima, but requires expensive matrix inversion.

🧪 Python: Newton’s Method Symbolic (2D)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import sympy as sp

x, y = sp.symbols('x y')
f = x**2 + y**2

# Gradient
grad = sp.Matrix([sp.diff(f, var) for var in (x, y)])
# Hessian
H = sp.hessian(f, (x, y))

# Newton step
theta = sp.Matrix([x, y])
theta_next = theta - H.inv() * grad
sp.simplify(theta_next)

🖼️ Visual: Newton’s Method and Curvature

This 3D surface shows how Newton’s Method uses curvature to jump directly to the bottom:

  • Red arrow = gradient direction
  • Blue point = Newton’s update lands at the minimum

3D surface of $f(x, y) = x^2 + y^2$ with gradient and Newton step arrow


⚖️ Comparing First- and Second-Order Methods

MethodUsesProsCons
Gradient DescentFirst derivativeSimple, scalableSlower convergence
Newton’s MethodSecond derivativeFaster near minimumExpensive (Hessian inverse)

🤖 Relevance to Machine Learning

Optimization is the engine behind how machine learning models learn. Whether it’s a simple linear model or a deep neural network, optimization determines how the model improves its predictions over time.

  • Model Training: Algorithms like gradient descent adjust weights to minimize loss functions — it's how models learn from data.
  • Neural Networks: Every training step in deep learning is an optimization update, typically using stochastic gradient descent or its variants.
  • Convex vs. Non-Convex Loss: Optimization challenges differ depending on whether the loss surface is smooth and convex (easy) or full of local minima/saddles (hard).
  • Learning Rate Tuning: Choosing the right learning rate is essential — too small and the model learns slowly, too large and it may diverge.
  • Advanced Optimizers: Algorithms like Adam, RMSProp, and L-BFGS use momentum, adaptive learning rates, or curvature approximations to improve training.
  • Batch Training: Optimization behaves differently when using mini-batches, full batches, or stochastic updates.
  • Loss Surface Geometry: Understanding optimization gives insight into why some models get stuck in saddle points, overshoot, or fail to converge.

🧠 Level Up
  • 🧭 Optimization = Learning: Every ML model “learns” by minimizing a loss function — optimization is the engine behind it.
  • 🚀 Gradient Descent: First-order methods are simple and scale well, making them the default in most neural networks.
  • 🔁 Learning Rate Tuning: Adjusting the learning rate can mean the difference between convergence and failure — even for the same algorithm.
  • 🔺 Newton’s Method: Second-order updates use curvature to move faster near minima — especially in low-dimensional or convex problems.
  • 🧠 Adaptive Methods: Optimizers like Adam and RMSProp approximate curvature and adjust step sizes automatically per parameter.

✅ Best Practices
  • 🧠 Understand the loss surface: Visualizing or analyzing loss curvature helps choose the right optimizer and learning rate.
  • 🔢 Start with simple functions: Practice gradient descent and Newton’s method on 1D and 2D functions to build intuition.
  • 📉 Tune the learning rate: Test multiple values — even the best optimizer fails with a poor step size.
  • 🔁 Compare first and second-order updates: Understand when Newton’s method outperforms basic gradient descent (e.g., near optima).
  • ⚙️ Use auto-diff tools: Frameworks like TensorFlow and PyTorch compute gradients and Hessians efficiently and correctly.

⚠️ Common Pitfalls
  • Choosing a bad learning rate: Too large and you overshoot; too small and training stalls.
  • Ignoring convergence behavior: Watch for oscillation, divergence, or vanishing updates.
  • Over-relying on Newton’s method: It's powerful, but impractical for high-dimensional models due to Hessian computation.
  • Misinterpreting gradient direction: The gradient shows steepest increase — you must subtract it to minimize loss.
  • Confusing global vs. local minima: Optimization finds local minima — that doesn’t always mean it’s the best possible model.

📌 Try It Yourself

📊 Gradient Descent Step: What’s the next value of \( x \) if we start at \( x = 2 \) on \( f(x) = x^2 \) with learning rate 0.1? 🧠 Step-by-step:
- Gradient: \( f'(x) = 2x \), so at \( x = 2 \), \( f'(2) = 4 \)
- Update rule: \( x_{\text{next}} = x - \eta \cdot f'(x) = 2 - 0.1 \cdot 4 = 1.6 \) ✅ Final Answer: \[ x_{\text{next}} = 1.6 \]

📊 Newton’s Method Step: On \( f(x) = x^2 \), what is the Newton update from \( x = 2 \)? 🧠 Step-by-step:
- Gradient: \( f'(x) = 2x \), so \( f'(2) = 4 \)
- Hessian (2nd derivative): \( f''(x) = 2 \)
- Newton update: \( x_{\text{next}} = x - \frac{f'(x)}{f''(x)} = 2 - \frac{4}{2} = 0 \) ✅ Final Answer: \[ x_{\text{next}} = 0 \]

📊 Compare Steps: Which method moves more aggressively from \( x = 2 \) on \( f(x) = x^2 \)? 🧠 Answer:
- Gradient Descent took \( x \rightarrow 1.6 \)
- Newton’s Method took \( x \rightarrow 0 \)
✅ Newton's Method is more aggressive and goes straight to the minimum in this case.

🔁 Summary: What You Learned

🧠 Concept📌 Description
Loss FunctionMeasures model error — optimization tries to minimize it
Gradient DescentFirst-order method that uses slope to step toward minima
Learning RateControls step size — critical for convergence
Newton’s MethodUses both gradient and Hessian for curvature-aware optimization
Hessian MatrixMatrix of second derivatives — shows local curvature
Convex vs Non-ConvexOptimization behavior changes based on loss surface shape
Optimizer BehaviorDifferent optimizers converge at different speeds and directions
Python ToolsLibraries like NumPy and SymPy can simulate and visualize optimization steps

💬 Got a question or suggestion?

Leave a comment below — I’d love to hear your thoughts or help if something was unclear.

This post is licensed under CC BY 4.0 by the author.