Binary classification. Sigmoid function, cross-entropy loss.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Logistic regression is the “hello world” of modern classification: a linear score turned into a probability, trained by a loss that directly matches how Bernoulli (yes/no) data is generated. It’s simple enough to fully understand, but rich enough to connect straight into neural networks.
Logistic regression models P(y = 1 | x) = σ(w·x + b). Train w (and b) by minimizing binary cross-entropy (negative log-likelihood). The gradient has a clean form: ∇ = (ŷ − y)x (and bias gradient ŷ − y), making it easy to optimize with gradient descent.
In binary classification, each example has features x ∈ ℝᵈ and a label y ∈ {0, 1}. You want a model that, given x, outputs a probability that the label is 1:
Many models can produce a hard decision (0/1), but logistic regression is designed to produce a calibrated probability in [0, 1].
A linear model like w·x + b can be any real number: negative, > 1, etc. That’s not a valid probability.
We need two ingredients:
z = w·x + b
This is the raw “evidence” for the positive class.
Logistic regression chooses the sigmoid (logistic) function:
σ(z) = 1 / (1 + e^(−z))
So the model is:
ŷ = P(y = 1 | x) = σ(w·x + b)
The score z measures where x lies relative to a hyperplane.
w·x + b = 0
So logistic regression is a linear classifier in geometry, but a probabilistic model in output.
A key reason logistic regression is so standard is that it models the log-odds as linear.
Define odds:
odds = P(y=1|x) / P(y=0|x) = p / (1 − p)
Log-odds (logit):
logit(p) = log(p / (1 − p))
Logistic regression assumes:
log(p / (1 − p)) = w·x + b
Solve for p:
Let z = w·x + b.
p / (1 − p) = e^z
p = e^z (1 − p)
p = e^z − e^z p
p + e^z p = e^z
p(1 + e^z) = e^z
p = e^z / (1 + e^z) = 1 / (1 + e^(−z)) = σ(z)
This shows sigmoid isn’t arbitrary: it’s what you get when you say “log-odds are linear in features.”
Sometimes we fold the bias into the weight vector by adding a constant feature x₀ = 1.
Define extended feature vector x̃ = [1, x₁, …, x_d] and w̃ = [b, w₁, …, w_d]. Then:
z = w̃·x̃
This can simplify implementations and derivations.
Logistic regression is:
The linear predictor is the simplest way to combine features:
z = w·x + b = ∑ⱼ wⱼ xⱼ + b
Motivation:
The set of points where the model is indifferent (predicts 0.5) is where ŷ = 0.5.
Since σ(0) = 0.5, we have:
ŷ = 0.5 ⇔ z = 0 ⇔ w·x + b = 0
That equation describes a hyperplane.
Predicted class often uses a threshold:
predict 1 if ŷ ≥ 0.5 (equivalently z ≥ 0)
We want a function with:
Sigmoid has these properties.
Key values:
Training needs gradients. Sigmoid has a famously convenient derivative.
Let p = σ(z) = 1 / (1 + e^(−z)).
Differentiate:
p = (1 + e^(−z))^(−1)
∂p/∂z = −1 · (1 + e^(−z))^(−2) · ∂/∂z (1 + e^(−z))
∂/∂z (1 + e^(−z)) = −e^(−z)
So:
∂p/∂z = (1 + e^(−z))^(−2) · e^(−z)
Now rewrite in terms of p:
p = 1 / (1 + e^(−z))
1 − p = e^(−z) / (1 + e^(−z))
Therefore:
p(1 − p) = [1 / (1 + e^(−z))] · [e^(−z) / (1 + e^(−z))] = e^(−z) / (1 + e^(−z))²
Thus:
∂p/∂z = p(1 − p)
This compact form is one reason logistic regression is so convenient.
Because log-odds are linear:
log(p/(1−p)) = w·x + b
Each weight wⱼ has a direct interpretation:
Caution: interpretation depends on feature scaling. If one feature is measured in large units, its weight will tend to be smaller.
| Component | Linear regression (for y ∈ ℝ) | Logistic regression (for y ∈ {0,1}) |
|---|---|---|
| Score | w·x + b | w·x + b |
| Output | ŷ = score | ŷ = σ(score) ∈ (0,1) |
| Typical loss | squared error | binary cross-entropy |
| Probabilistic meaning | Gaussian noise assumption | Bernoulli likelihood |
This sets up the next step: choosing a loss that matches Bernoulli labels.
For classification, we don’t just want “close numeric values.” We want:
Binary cross-entropy (BCE) comes directly from maximum likelihood estimation for a Bernoulli model.
Assume for each input x, the label y is drawn as:
P(y = 1 | x) = p
P(y = 0 | x) = 1 − p
with p = σ(z) and z = w·x + b.
The Bernoulli probability mass function can be written compactly as:
P(y | x) = pʸ (1 − p)^(1−y)
because:
Given N i.i.d. examples {(xᵢ, yᵢ)}:
L(w, b) = ∏ᵢ pᵢ^(yᵢ) (1 − pᵢ)^(1−yᵢ)
where pᵢ = σ(w·xᵢ + b).
Maximizing a product is awkward, so take logs:
log L = ∑ᵢ [ yᵢ log pᵢ + (1 − yᵢ) log(1 − pᵢ) ]
Maximum likelihood is equivalent to minimizing negative log-likelihood:
J(w, b) = − log L = − ∑ᵢ [ yᵢ log pᵢ + (1 − yᵢ) log(1 − pᵢ) ]
Often we average over N:
J = (1/N) ∑ᵢ ℓᵢ
with per-example loss:
ℓ = −[ y log p + (1 − y) log(1 − p) ]
That is the binary cross-entropy loss.
Consider one example.
So BCE strongly penalizes confident mistakes.
This is the workhorse result that makes training simple.
For one example:
p = σ(z), z = w·x + b
ℓ = −[ y log p + (1 − y) log(1 − p) ]
We compute ∂ℓ/∂z.
Step 1: derivative of ℓ with respect to p:
∂ℓ/∂p = −[ y · (1/p) + (1 − y) · (−1/(1 − p)) ]
∂ℓ/∂p = − y/p + (1 − y)/(1 − p)
Step 2: chain rule with ∂p/∂z = p(1 − p):
∂ℓ/∂z = (∂ℓ/∂p)(∂p/∂z)
∂ℓ/∂z = ( − y/p + (1 − y)/(1 − p) ) · p(1 − p)
Distribute p(1 − p):
∂ℓ/∂z = −y(1 − p) + (1 − y)p
∂ℓ/∂z = −y + yp + p − yp
∂ℓ/∂z = p − y
So the derivative w.r.t. the score is:
∂ℓ/∂z = (p − y) = (ŷ − y)
Now apply z = w·x + b:
∂z/∂w = x
∂z/∂b = 1
Thus:
∇_w ℓ = (ŷ − y)x
∂ℓ/∂b = (ŷ − y)
For the full dataset (averaged):
∇_w J = (1/N) ∑ᵢ (ŷᵢ − yᵢ)**xᵢ
∂J/∂b = (1/N) ∑ᵢ (ŷᵢ − yᵢ)
This is the key computational loop: predict p, compute error (p − y), accumulate gradients.
For standard logistic regression (no hidden layers), the BCE objective is convex in (w, b). That means:
This is a major difference from neural networks, where the objective is non-convex.
A typical training step:
w ← w − η g_w
b ← b − η g_b
Because you already know gradient descent, the main learning here is: BCE + sigmoid makes the gradient become “prediction minus label.”
The model outputs a probability ŷ. Turning it into a label requires a threshold t.
But in many real applications, false positives and false negatives have different costs.
Examples:
So logistic regression naturally supports probability-based decision-making.
Accuracy is not always enough, especially with class imbalance.
Common choices:
BCE is a natural metric because it evaluates probability quality, not just hard labels.
To reduce overfitting, add a penalty on w.
L2 regularization (ridge):
J_reg = J + (λ/2)‖w‖²
Gradient adds:
∇_w J_reg = ∇_w J + λw
(Usually the bias b is not regularized.)
L1 regularization (lasso) encourages sparsity, but its gradient uses subgradients and optimization needs more care.
Directly computing log(σ(z)) can cause issues when z is very large in magnitude.
In practice, libraries use a stable form often called binary cross-entropy with logits, where you pass z (the logit) directly.
This is a practical detail, but it matters for robust training.
Logistic regression is a 1-layer neural network:
When you later learn neural networks, you’ll generalize the linear layer to multiple layers and nonlinearities. The final layer for binary classification often remains a sigmoid (or a 2-class softmax), and the loss remains cross-entropy.
So mastering logistic regression means you already understand:
You’re standing right at the entrance to Neural Networks.
Let w = (0.8, −0.4), b = −0.2. For input x = (2, 1), compute z, ŷ = σ(z), and the predicted class with threshold 0.5.
Compute the linear score:
z = w·x + b
= (0.8)(2) + (−0.4)(1) + (−0.2)
= 1.6 − 0.4 − 0.2
= 1.0
Map score to probability:
ŷ = σ(z) = 1 / (1 + e^(−1.0))
≈ 1 / (1 + 0.3679)
≈ 0.7311
Apply threshold t = 0.5:
ŷ ≈ 0.7311 ≥ 0.5 ⇒ predict class 1
Insight: The decision boundary is z = 0. Here z = 1 is on the positive side, and sigmoid turns that margin into a probability (about 73%).
Suppose the model predicts ŷ = 0.9 for an example whose true label is y = 1. Then compute the per-example BCE loss. Repeat for a confident wrong prediction ŷ = 0.01 when y = 1.
If y = 1, BCE loss is:
ℓ = −[ y log ŷ + (1 − y) log(1 − ŷ) ]
= −[ 1 · log(0.9) + 0 · log(0.1) ]
= −log(0.9)
≈ 0.1053
For ŷ = 0.01 with y = 1:
ℓ = −log(0.01)
≈ 4.6052
Insight: BCE is gentle when you’re confidently correct, but extremely harsh when you’re confidently wrong—exactly what you want for probabilistic classification.
Single training example: x = (3, −1), y = 0. Start with w = (0, 0), b = 0. Use learning rate η = 0.1. Do one gradient descent update.
Compute score:
z = w·x + b = 0
So ŷ = σ(0) = 0.5
Compute gradients for one example:
∇_w ℓ = (ŷ − y)x
Here (ŷ − y) = 0.5 − 0 = 0.5
So:
∇_w ℓ = 0.5 · (3, −1) = (1.5, −0.5)
Bias gradient:
∂ℓ/∂b = (ŷ − y) = 0.5
Update parameters:
w ← w − η ∇_w ℓ
= (0, 0) − 0.1(1.5, −0.5)
= (−0.15, 0.05)
b ← b − η(∂ℓ/∂b)
= 0 − 0.1(0.5)
= −0.05
Insight: Because y = 0 but ŷ = 0.5 is too high, (ŷ − y) is positive, so the update moves w and b in a direction that reduces the score z on this example next time.
Logistic regression uses a linear score z = w·x + b and converts it to a probability with the sigmoid σ(z).
The decision boundary ŷ = 0.5 corresponds to z = 0, a hyperplane with normal vector w.
The Bernoulli likelihood leads directly to binary cross-entropy: ℓ = −[ y log ŷ + (1 − y) log(1 − ŷ) ].
A crucial simplification: ∂ℓ/∂z = ŷ − y, giving ∇_w ℓ = (ŷ − y)x and ∂ℓ/∂b = ŷ − y.
Logistic regression’s objective is convex, making optimization more reliable than many non-convex models.
Thresholds can be adjusted away from 0.5 to reflect unequal error costs; probabilities enable this flexibility.
Regularization like (λ/2)‖w‖² is commonly added to reduce overfitting and improves generalization.
Logistic regression is effectively a single-neuron neural network: linear layer + sigmoid + cross-entropy.
Using mean squared error instead of binary cross-entropy, which usually yields worse probabilistic behavior and gradients for classification.
Forgetting the bias term b (or forgetting to include x₀ = 1 when folding bias into w), which can severely limit the decision boundary.
Interpreting weights without considering feature scaling; weights are only comparable when features are on comparable scales.
Computing log(σ(z)) and log(1−σ(z)) naively for large |z|, leading to numerical instability instead of using a stable “BCE with logits” formulation.
Given w = (1, −2), b = 0.5, and x = (1, 2), compute z, ŷ = σ(z), and the predicted label using threshold 0.5.
Hint: Compute z = 1·1 + (−2)·2 + 0.5, then apply σ(z). If z < 0 then ŷ < 0.5.
z = (1)(1) + (−2)(2) + 0.5 = 1 − 4 + 0.5 = −2.5.
ŷ = σ(−2.5) = 1/(1+e^(2.5)) ≈ 1/(1+12.182) ≈ 0.0759.
Since ŷ < 0.5, predict label 0.
Show that for BCE with sigmoid output, the derivative with respect to the logit z is ∂ℓ/∂z = ŷ − y.
Hint: Use ℓ = −[ y log ŷ + (1 − y) log(1 − ŷ) ], then chain rule: (∂ℓ/∂ŷ)(∂ŷ/∂z). Recall ∂σ/∂z = ŷ(1 − ŷ).
Let ŷ = σ(z).
ℓ = −[ y log ŷ + (1 − y) log(1 − ŷ) ].
Compute:
∂ℓ/∂ŷ = −[ y(1/ŷ) + (1 − y)(−1/(1 − ŷ)) ] = −y/ŷ + (1 − y)/(1 − ŷ).
Also ∂ŷ/∂z = ŷ(1 − ŷ).
So:
∂ℓ/∂z = (−y/ŷ + (1 − y)/(1 − ŷ))·ŷ(1 − ŷ)
= −y(1 − ŷ) + (1 − y)ŷ
= −y + yŷ + ŷ − yŷ
= ŷ − y.
One-step update with two examples (mini-batch): Start w = (0, 0), b = 0, η = 0.2. Examples: (x₁=(1,0), y₁=1) and (x₂=(0,1), y₂=0). Use the average gradient over the two examples to update w and b once.
Hint: With w = 0 and b = 0, both logits are 0 so both predictions are 0.5. Compute (ŷ − y) for each example, then average gradients: (1/N)∑(ŷ − y)x.
Initial: z₁ = 0, z₂ = 0 ⇒ ŷ₁ = ŷ₂ = 0.5.
Errors:
(ŷ₁ − y₁) = 0.5 − 1 = −0.5
(ŷ₂ − y₂) = 0.5 − 0 = 0.5
Average weight gradient:
∇_w J = (1/2)[(−0.5)(1,0) + (0.5)(0,1)]
= (1/2)[(−0.5, 0) + (0, 0.5)]
= (−0.25, 0.25)
Average bias gradient:
∂J/∂b = (1/2)[(−0.5) + (0.5)] = 0
Update:
w ← w − η∇_wJ = (0,0) − 0.2(−0.25, 0.25) = (0.05, −0.05)
b ← 0 − 0.2(0) = 0.