Confusion Matrix | ReConcept Lab

Hook problem: evaluate a spam filter

You run a simple classifier that labels each email as spam or not-spam. You need to know not only how many emails were right, but what kind of errors were made.

This fixed fixture has 12 messages:

Spam fixtureEach evaluated email has an actual label and a model prediction.

Prize claim now

Obvious prize bait caught by the filter.

actual: spampred: spamTP: True Positive

Project notes

A normal work message left in the inbox.

actual: not-spampred: not-spamTN: True Negative

Receipt attached

A real receipt incorrectly flagged as spam.

actual: not-spampred: spamFP: False Positive

Account alert

A fake alert slipped into the inbox.

actual: spampred: not-spamFN: False Negative

Limited offer

Promotional spam correctly blocked.

actual: spampred: spamTP: True Positive

Team lunch

A casual team email correctly kept.

actual: not-spampred: not-spamTN: True Negative

Password reset

A requested reset email reached the user.

actual: not-spampred: not-spamTN: True Negative

Urgent transfer

A scam message was missed by the filter.

actual: spampred: not-spamFN: False Negative

Flight update

A useful travel update became a false alarm.

actual: not-spampred: spamFP: False Positive

e10

Crypto bonus

Suspicious bonus spam correctly caught.

actual: spampred: spamTP: True Positive

e11

Invoice approved

A business invoice correctly accepted.

actual: not-spampred: not-spamTN: True Negative

e12

Verify wallet

A phishing-style wallet email was missed.

actual: spampred: not-spamFN: False Negative

First naive idea

One score is the total number of correct predictions:

correct / total.

For these 12 examples, that number is 7 / 12 = 58.3%.

But it hides a critical split:

Right versus wrong splitTotal correctness ignores which type of mistake is made.

TP: True Positive + correct

Count: 3

e1, e5, e10

TN: True Negative + correct

Count: 4

e2, e6, e7, e11

Right vs wrong total

Correct = TP + TN = 7

Wrong = FP + FN = 5

Where it hurts

e3 and e4 are both wrong, but they do not cost the same:

e3: not-spam predicted as spam (false alarm),
e4: spam predicted as not-spam (missed message).

A false alarm and a missTwo wrong labels can have opposite user impact.

e3

Receipt attached

A real receipt incorrectly flagged as spam.

Actual: not-spam, Predicted: spam (FP: False Positive)

e4

Account alert

A fake alert slipped into the inbox.

Actual: spam, Predicted: not-spam (FN: False Negative)

Core invention

Separate the questions:

Is the truth actual yes or no (relative to one chosen positive class)?
Is the model output predicted yes or no (for the same positive class)?

That gives four buckets.

Core 2×2 ideaTwo binary questions define four buckets.

Seeded examples: e1->TP, e3->FP, e2->TN, e4->FN.

Seeded matrix counts (not full fixture)
actual	pred=spam	pred=not-spam
actual=spam	1 (e1)	1 (e4)
actual=not-spam	1 (e3)	1 (e2)

The two labels are the row partition (actual), and the two columns are the prediction partition.

Interactive trace

Step through the same 12 examples and watch one cell increment each time.

Confusion matrix trace

Step 1 of 12: add e1 to True Positive (TP).

Actual: spam; Predicted: spam.

e1: actual is positive under positive=spam, prediction is positive; this is True Positive (TP).

Current counts

TP=1, FP=0, TN=0, FN=0

total = 1

TP changed: 0 → 1.

Final invariants

TP + FP + TN + FN = 12

right = TP + TN = 7

wrong = FP + FN = 5

Current matrix counts
actual \ predicted	positive 1	negative 0
actual=positive	TP: 1	FN: 0
actual=negative	FP: 0	TN: 0

Step ledger (active row marked)
Current	Step	Id	Subject	Actual	Predicted	Cell
active	1	e1	Prize claim now	spam	spam	TP: True Positive
	2	e2	Project notes	not-spam	not-spam	TN: True Negative
	3	e3	Receipt attached	not-spam	spam	FP: False Positive
	4	e4	Account alert	spam	not-spam	FN: False Negative
	5	e5	Limited offer	spam	spam	TP: True Positive
	6	e6	Team lunch	not-spam	not-spam	TN: True Negative
	7	e7	Password reset	not-spam	not-spam	TN: True Negative
	8	e8	Urgent transfer	spam	not-spam	FN: False Negative
	9	e9	Flight update	not-spam	spam	FP: False Positive
	10	e10	Crypto bonus	spam	spam	TP: True Positive
	11	e11	Invoice approved	not-spam	not-spam	TN: True Negative
	12	e12	Verify wallet	spam	not-spam	FN: False Negative

Swap-check preview (positive = not-spam): the current step updates:

TP=0, FP=0, TN=1, FN=0

The fixture is deterministic, so the same email always drives the same step.

Formal version

Let y be the true label and \hat{y} be the predicted label. In this node, the positive class is explicitly declared:

1 = spam, 0 = not spam.

Use rows for actual, columns for predicted:

\begin{array}{c|cc} & \hat{y}=1 & \hat{y}=0 \\ \hline y=1 & TP & FN \\ y=0 & FP & TN \end{array}

Here 1 is a chosen positive class flag, not an intrinsic “goodness” label.

TP, FP, TN, and FN are counts in these four cells.

Matrix orientationRows are reality (actual); columns are model output (prediction).

Orientation for this node
	predicted=positive	predicted=negative
actual=positive	TP	FN
actual=negative	FP	TN

The one-cell invariant is:

TP + FP + TN + FN = n.

n is the number of evaluated examples.

The shared example fixture has:

TP = 3
FP = 2
TN = 4
FN = 3
correct = TP + TN = 7
wrong = FP + FN = 5
total = 12

Implementation sketch

Each email updates exactly one bucket:

if (actual === positiveLabel && predicted === positiveLabel) {
  tp += 1;
} else if (actual !== positiveLabel && predicted === positiveLabel) {
  fp += 1;
} else if (actual !== positiveLabel && predicted !== positiveLabel) {
  tn += 1;
} else {
  fn += 1;
}

Implementation branchesFour conditions map each case to one counter update.

Implementation branches
if actual == positive?	if prediction == positive?	update	interpretation
positive	positive	`TP++`	True Positive
negative	positive	`FP++`	False Positive
negative	negative	`TN++`	True Negative
positive	negative	`FN++`	False Negative

Correctness intuition

The two boolean questions are exhaustive and mutually exclusive.

For a fixed positive label, each example is exactly one of these four combinations:

actual positive + predicted positive
actual negative + predicted positive
actual negative + predicted negative
actual positive + predicted negative

So each example is assigned to exactly one matrix cell. After 9 emails:

TP = 2, FP = 2, TN = 3, FN = 2, total = 9.

At the end:

TP = 3, FP = 2, TN = 4, FN = 3, total = 12.

Trace invariantEvery example enters exactly one cell.

After e9

TP=2, FP=2, TN=3, FN=2

total = 9

formula: TP + FP + TN + FN = 9

Final

TP=3, FP=2, TN=4, FN=3

total = 12

Invariant: the table is complete and disjoint.

You can read this as the invariance check for every trace prefix:

TP + FP + TN + FN = \text{processed examples}

And for the full fixture, processed examples = 12.

Complexity

The scan is one pass over the examples:

\text{time} = O(n), \qquad \text{extra space} = O(1).

Each example contributes one constant-time set of comparisons and one counter increment.

Common confusions

Common confusionsNaming and orientation errors are the top sources of confusion.

Positive-class definition

Current table: positive means spam (1 means spam).

positive=spam
	predicted=spam	predicted=not-spam
actual=spam	3 (TP)	3 (FN)
actual=not-spam	2 (FP)	4 (TN)

Positive-class swap

If positive=not-spam, TP/FP/TN/FN re-meaning changes.

positive=not-spam
	predicted=not-spam	predicted=spam
actual=not-spam	4 (TP)	2 (FN)
actual=spam	3 (FP)	3 (TN)

A useful quick exercise is a positive-class swap. Keep the same underlying predictions but set positiveLabel = "not-spam":

TP = 4
FP = 3
TN = 3
FN = 2

This does not make the dataset better or worse; it only changes what “positive” means.

Connections

This concept is the base for precision, recall, and f1-score. Those nodes are planned but not yet in the graph.

Graph stripCurrent follow-up nodes are planned but not yet implemented.

confusion-matrix

implemented

→

precision

planned follow-up

→

recall

planned follow-up

→

f1-score

planned follow-up

Static trace ledger (no-JS fallback)

The deterministic ledger below is visible without JavaScript and stays aligned with the interactive demo:

Confusion trace over 12 emails
Step	Email	Subject	Actual	Predicted	Cell	TP	FP	TN	FN
1	e1	Prize claim now	spam	spam	TP (True Positive)	1	0	0	0
2	e2	Project notes	not-spam	not-spam	TN (True Negative)	1	0	1	0
3	e3	Receipt attached	not-spam	spam	FP (False Positive)	1	1	1	0
4	e4	Account alert	spam	not-spam	FN (False Negative)	1	1	1	1
5	e5	Limited offer	spam	spam	TP (True Positive)	2	1	1	1
6	e6	Team lunch	not-spam	not-spam	TN (True Negative)	2	1	2	1
7	e7	Password reset	not-spam	not-spam	TN (True Negative)	2	1	3	1
8	e8	Urgent transfer	spam	not-spam	FN (False Negative)	2	1	3	2
9	e9	Flight update	not-spam	spam	FP (False Positive)	2	2	3	2
10	e10	Crypto bonus	spam	spam	TP (True Positive)	3	2	3	2
11	e11	Invoice approved	not-spam	not-spam	TN (True Negative)	3	2	4	2
12	e12	Verify wallet	spam	not-spam	FN (False Negative)	3	2	4	3

Invariant check: {TP + FP + TN + FN} = 12.

Exercises

Which cell does e9 reach?
Which cell changes if positiveLabel becomes "not-spam"?
In this node, if an email is predicted as spam but is actually not-spam, what should happen to precision later?
Why is a single wrong-rate number less informative than the four-cell table?

Graph connections : Confusion Matrix