Abstract representation of machine learning data points

Stochastic Gradient Descent (SGD) – Simple Explanation & Examples

Stochastic Gradient Descent (SGD) is one of the most popular and foundational optimization algorithms used to train machine learning models, especially deep learning models on large datasets.

What is Stochastic Gradient Descent?

At its core, Gradient Descent is an iterative method used to find the minimum of a function. In machine learning, this "function" is the loss or cost function, which measures how far off our model's predictions are from the actual values. By minimizing this function, we improve the model's accuracy.

Unlike traditional (or "batch") gradient descent, which calculates the error across all training examples before making an update, Stochastic Gradient Descent updates the model's parameters using just one training sample at a time. This "stochastic" (random) approach makes the process much faster and more scalable.

The SGD Update Formula

The update rule for SGD is simple yet powerful. For each training example, the model parameters (weights) are updated in the opposite direction of the gradient of the loss function.

θ = θ - η · ∇J(θ; x⁽ⁱ⁾, y⁽ⁱ⁾)

Where:

θ (theta) represents the model's parameters (weights).
η (eta) is the learning rate, a small value that controls how big of a step we take.
∇J(θ; x⁽ⁱ⁾, y⁽ⁱ⁾) is the gradient of the loss function J with respect to the parameters θ, calculated for a single training example (x⁽ⁱ⁾, y⁽ⁱ⁾).

This process is repeated for many passes over the entire dataset (known as epochs) until the model converges.

SGD vs. Batch vs. Mini-Batch Gradient Descent

Understanding the differences between the three main types of gradient descent is key:

Type	Update Frequency	Pros	Cons
Batch GD	Once per epoch (all data)	Stable convergence	Very slow; memory intensive
Stochastic GD (SGD)	For every single sample	Fast; good for large datasets	Noisy/unstable updates
Mini-Batch GD	For every small batch of samples	Best of both worlds; stable and efficient	Requires tuning batch size

When to Use SGD

SGD is particularly effective in several scenarios:

Large Datasets: When you have millions of training examples, processing them one by one is far more computationally feasible than using the entire dataset for each update.
Neural Networks: Most deep learning models are trained using SGD or its variants (like Adam or RMSprop) because of its efficiency.
Online Learning: In scenarios where data arrives in a stream, SGD can update the model on the fly as each new piece of data comes in, without needing to re-train on the entire dataset.