Draft
Confusion Matrix
Count where a binary spam filter was right, and which kinds of mistakes it made.
Hook problem: evaluate a spam filter
You run a simple classifier that labels each email as spam or not-spam.
You need to know not only how many emails were right, but what kind of errors were made.
This fixed fixture has 12 messages:
Prize claim now
Obvious prize bait caught by the filter.
Project notes
A normal work message left in the inbox.
Receipt attached
A real receipt incorrectly flagged as spam.
Account alert
A fake alert slipped into the inbox.
Limited offer
Promotional spam correctly blocked.
Team lunch
A casual team email correctly kept.
Password reset
A requested reset email reached the user.
Urgent transfer
A scam message was missed by the filter.
Flight update
A useful travel update became a false alarm.
Crypto bonus
Suspicious bonus spam correctly caught.
Invoice approved
A business invoice correctly accepted.
Verify wallet
A phishing-style wallet email was missed.
First naive idea
One score is the total number of correct predictions:
correct / total.
For these 12 examples, that number is 7 / 12 = 58.3%.
But it hides a critical split:
Count: 3
e1, e5, e10
Count: 4
e2, e6, e7, e11
Correct = TP + TN = 7
Wrong = FP + FN = 5
Where it hurts
e3 and e4 are both wrong, but they do not cost the same:
e3: not-spam predicted as spam (false alarm),e4: spam predicted as not-spam (missed message).
e3
Receipt attached
A real receipt incorrectly flagged as spam.
Actual: not-spam, Predicted: spam (FP: False Positive)
e4
Account alert
A fake alert slipped into the inbox.
Actual: spam, Predicted: not-spam (FN: False Negative)
Core invention
Separate the questions:
- Is the truth
actualyes or no (relative to one chosen positive class)? - Is the model output
predictedyes or no (for the same positive class)?
That gives four buckets.
Seeded examples: e1->TP, e3->FP, e2->TN, e4->FN.
| actual | pred=spam | pred=not-spam |
|---|---|---|
| actual=spam | 1 (e1) | 1 (e4) |
| actual=not-spam | 1 (e3) | 1 (e2) |
The two labels are the row partition (actual), and the two columns are the prediction partition.
Interactive trace
Step through the same 12 examples and watch one cell increment each time.
Step 1 of 12: add e1 to True Positive (TP).
Actual: spam; Predicted: spam.
e1: actual is positive under positive=spam, prediction is positive; this is True Positive (TP).
TP=1, FP=0, TN=0, FN=0
total = 1
TP changed: 0 → 1.
TP + FP + TN + FN = 12
right = TP + TN = 7
wrong = FP + FN = 5
| actual \ predicted | positive 1 | negative 0 |
|---|---|---|
| actual=positive | TP: 1 | FN: 0 |
| actual=negative | FP: 0 | TN: 0 |
| Current | Step | Id | Subject | Actual | Predicted | Cell |
|---|---|---|---|---|---|---|
| active | 1 | e1 | Prize claim now | spam | spam | TP: True Positive |
| 2 | e2 | Project notes | not-spam | not-spam | TN: True Negative | |
| 3 | e3 | Receipt attached | not-spam | spam | FP: False Positive | |
| 4 | e4 | Account alert | spam | not-spam | FN: False Negative | |
| 5 | e5 | Limited offer | spam | spam | TP: True Positive | |
| 6 | e6 | Team lunch | not-spam | not-spam | TN: True Negative | |
| 7 | e7 | Password reset | not-spam | not-spam | TN: True Negative | |
| 8 | e8 | Urgent transfer | spam | not-spam | FN: False Negative | |
| 9 | e9 | Flight update | not-spam | spam | FP: False Positive | |
| 10 | e10 | Crypto bonus | spam | spam | TP: True Positive | |
| 11 | e11 | Invoice approved | not-spam | not-spam | TN: True Negative | |
| 12 | e12 | Verify wallet | spam | not-spam | FN: False Negative |
Swap-check preview (positive = not-spam): the current step updates:
TP=0, FP=0, TN=1, FN=0
The fixture is deterministic, so the same email always drives the same step.
Formal version
Let y be the true label and \hat{y} be the predicted label.
In this node, the positive class is explicitly declared:
1 = spam, 0 = not spam.
Use rows for actual, columns for predicted:
Here 1 is a chosen positive class flag, not an intrinsic “goodness” label.
TP, FP, TN, and FN are counts in these four cells.
| predicted=positive | predicted=negative | |
|---|---|---|
| actual=positive | TP | FN |
| actual=negative | FP | TN |
The one-cell invariant is:
n is the number of evaluated examples.
The shared example fixture has:
TP = 3FP = 2TN = 4FN = 3correct = TP + TN = 7wrong = FP + FN = 5total = 12
Implementation sketch
Each email updates exactly one bucket:
if (actual === positiveLabel && predicted === positiveLabel) {
tp += 1;
} else if (actual !== positiveLabel && predicted === positiveLabel) {
fp += 1;
} else if (actual !== positiveLabel && predicted !== positiveLabel) {
tn += 1;
} else {
fn += 1;
}
| if actual == positive? | if prediction == positive? | update | interpretation |
|---|---|---|---|
| positive | positive | TP++ | True Positive |
| negative | positive | FP++ | False Positive |
| negative | negative | TN++ | True Negative |
| positive | negative | FN++ | False Negative |
Correctness intuition
The two boolean questions are exhaustive and mutually exclusive.
For a fixed positive label, each example is exactly one of these four combinations:
- actual positive + predicted positive
- actual negative + predicted positive
- actual negative + predicted negative
- actual positive + predicted negative
So each example is assigned to exactly one matrix cell. After 9 emails:
TP = 2, FP = 2, TN = 3, FN = 2, total = 9.
At the end:
TP = 3, FP = 2, TN = 4, FN = 3, total = 12.
TP=2, FP=2, TN=3, FN=2
total = 9
formula: TP + FP + TN + FN = 9
TP=3, FP=2, TN=4, FN=3
total = 12
Invariant: the table is complete and disjoint.
You can read this as the invariance check for every trace prefix:
And for the full fixture, processed examples = 12.
Complexity
The scan is one pass over the examples:
Each example contributes one constant-time set of comparisons and one counter increment.
Common confusions
Current table: positive means spam (1 means spam).
| predicted=spam | predicted=not-spam | |
|---|---|---|
| actual=spam | 3 (TP) | 3 (FN) |
| actual=not-spam | 2 (FP) | 4 (TN) |
If positive=not-spam, TP/FP/TN/FN re-meaning changes.
| predicted=not-spam | predicted=spam | |
|---|---|---|
| actual=not-spam | 4 (TP) | 2 (FN) |
| actual=spam | 3 (FP) | 3 (TN) |
A useful quick exercise is a positive-class swap. Keep the same underlying predictions but set positiveLabel = "not-spam":
TP = 4FP = 3TN = 3FN = 2
This does not make the dataset better or worse; it only changes what “positive” means.
Connections
This concept is the base for precision, recall, and f1-score. Those nodes are planned but not yet in the graph.
implemented
planned follow-up
planned follow-up
planned follow-up
Static trace ledger (no-JS fallback)
The deterministic ledger below is visible without JavaScript and stays aligned with the interactive demo:
| Step | Subject | Actual | Predicted | Cell | TP | FP | TN | FN | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | e1 | Prize claim now | spam | spam | TP (True Positive) | 1 | 0 | 0 | 0 |
| 2 | e2 | Project notes | not-spam | not-spam | TN (True Negative) | 1 | 0 | 1 | 0 |
| 3 | e3 | Receipt attached | not-spam | spam | FP (False Positive) | 1 | 1 | 1 | 0 |
| 4 | e4 | Account alert | spam | not-spam | FN (False Negative) | 1 | 1 | 1 | 1 |
| 5 | e5 | Limited offer | spam | spam | TP (True Positive) | 2 | 1 | 1 | 1 |
| 6 | e6 | Team lunch | not-spam | not-spam | TN (True Negative) | 2 | 1 | 2 | 1 |
| 7 | e7 | Password reset | not-spam | not-spam | TN (True Negative) | 2 | 1 | 3 | 1 |
| 8 | e8 | Urgent transfer | spam | not-spam | FN (False Negative) | 2 | 1 | 3 | 2 |
| 9 | e9 | Flight update | not-spam | spam | FP (False Positive) | 2 | 2 | 3 | 2 |
| 10 | e10 | Crypto bonus | spam | spam | TP (True Positive) | 3 | 2 | 3 | 2 |
| 11 | e11 | Invoice approved | not-spam | not-spam | TN (True Negative) | 3 | 2 | 4 | 2 |
| 12 | e12 | Verify wallet | spam | not-spam | FN (False Negative) | 3 | 2 | 4 | 3 |
Invariant check: {TP + FP + TN + FN} = 12.
Exercises
- Which cell does
e9reach? - Which cell changes if
positiveLabelbecomes"not-spam"? - In this node, if an email is predicted as spam but is actually not-spam, what should happen to precision later?
- Why is a single wrong-rate number less informative than the four-cell table?
Graph connections : Confusion Matrix