Graph connections

Draft

Confusion Matrix

Count where a binary spam filter was right, and which kinds of mistakes it made.

concept beginner machine-learningmetricsclassification

Hook problem: evaluate a spam filter

You run a simple classifier that labels each email as spam or not-spam. You need to know not only how many emails were right, but what kind of errors were made.

This fixed fixture has 12 messages:

Spam fixtureEach evaluated email has an actual label and a model prediction.
e1

Prize claim now

Obvious prize bait caught by the filter.

actual: spampred: spamTP: True Positive
e2

Project notes

A normal work message left in the inbox.

actual: not-spampred: not-spamTN: True Negative
e3

Receipt attached

A real receipt incorrectly flagged as spam.

actual: not-spampred: spamFP: False Positive
e4

Account alert

A fake alert slipped into the inbox.

actual: spampred: not-spamFN: False Negative
e5

Limited offer

Promotional spam correctly blocked.

actual: spampred: spamTP: True Positive
e6

Team lunch

A casual team email correctly kept.

actual: not-spampred: not-spamTN: True Negative
e7

Password reset

A requested reset email reached the user.

actual: not-spampred: not-spamTN: True Negative
e8

Urgent transfer

A scam message was missed by the filter.

actual: spampred: not-spamFN: False Negative
e9

Flight update

A useful travel update became a false alarm.

actual: not-spampred: spamFP: False Positive
e10

Crypto bonus

Suspicious bonus spam correctly caught.

actual: spampred: spamTP: True Positive
e11

Invoice approved

A business invoice correctly accepted.

actual: not-spampred: not-spamTN: True Negative
e12

Verify wallet

A phishing-style wallet email was missed.

actual: spampred: not-spamFN: False Negative

First naive idea

One score is the total number of correct predictions:

correct / total.

For these 12 examples, that number is 7 / 12 = 58.3%.

But it hides a critical split:

Right versus wrong splitTotal correctness ignores which type of mistake is made.
TP: True Positive + correct

Count: 3

e1, e5, e10

TN: True Negative + correct

Count: 4

e2, e6, e7, e11

Right vs wrong total

Correct = TP + TN = 7

Wrong = FP + FN = 5

Where it hurts

e3 and e4 are both wrong, but they do not cost the same:

  • e3: not-spam predicted as spam (false alarm),
  • e4: spam predicted as not-spam (missed message).
A false alarm and a missTwo wrong labels can have opposite user impact.

e3

Receipt attached

A real receipt incorrectly flagged as spam.

Actual: not-spam, Predicted: spam (FP: False Positive)

e4

Account alert

A fake alert slipped into the inbox.

Actual: spam, Predicted: not-spam (FN: False Negative)

Core invention

Separate the questions:

  • Is the truth actual yes or no (relative to one chosen positive class)?
  • Is the model output predicted yes or no (for the same positive class)?

That gives four buckets.

Core 2×2 ideaTwo binary questions define four buckets.

Seeded examples: e1->TP, e3->FP, e2->TN, e4->FN.

Seeded matrix counts (not full fixture)
actualpred=spampred=not-spam
actual=spam1 (e1)1 (e4)
actual=not-spam1 (e3)1 (e2)

The two labels are the row partition (actual), and the two columns are the prediction partition.

Interactive trace

Step through the same 12 examples and watch one cell increment each time.

Confusion matrix trace

Step 1 of 12: add e1 to True Positive (TP).

Actual: spam; Predicted: spam.

e1: actual is positive under positive=spam, prediction is positive; this is True Positive (TP).

Current counts

TP=1, FP=0, TN=0, FN=0

total = 1

TP changed: 0 → 1.

Final invariants

TP + FP + TN + FN = 12

right = TP + TN = 7

wrong = FP + FN = 5

Current matrix counts
actual \ predictedpositive 1negative 0
actual=positiveTP: 1FN: 0
actual=negativeFP: 0TN: 0
Step ledger (active row marked)
CurrentStepIdSubjectActualPredictedCell
active1e1Prize claim nowspamspamTP: True Positive
2e2Project notesnot-spamnot-spamTN: True Negative
3e3Receipt attachednot-spamspamFP: False Positive
4e4Account alertspamnot-spamFN: False Negative
5e5Limited offerspamspamTP: True Positive
6e6Team lunchnot-spamnot-spamTN: True Negative
7e7Password resetnot-spamnot-spamTN: True Negative
8e8Urgent transferspamnot-spamFN: False Negative
9e9Flight updatenot-spamspamFP: False Positive
10e10Crypto bonusspamspamTP: True Positive
11e11Invoice approvednot-spamnot-spamTN: True Negative
12e12Verify walletspamnot-spamFN: False Negative

Swap-check preview (positive = not-spam): the current step updates:

TP=0, FP=0, TN=1, FN=0

The fixture is deterministic, so the same email always drives the same step.

Formal version

Let y be the true label and \hat{y} be the predicted label. In this node, the positive class is explicitly declared:

1 = spam, 0 = not spam.

Use rows for actual, columns for predicted:

y^=1y^=0y=1TPFNy=0FPTN\begin{array}{c|cc} & \hat{y}=1 & \hat{y}=0 \\ \hline y=1 & TP & FN \\ y=0 & FP & TN \end{array}

Here 1 is a chosen positive class flag, not an intrinsic “goodness” label.

TP, FP, TN, and FN are counts in these four cells.

Matrix orientationRows are reality (actual); columns are model output (prediction).
Orientation for this node
predicted=positivepredicted=negative
actual=positiveTPFN
actual=negativeFPTN

The one-cell invariant is:

TP+FP+TN+FN=n.TP + FP + TN + FN = n.

n is the number of evaluated examples.

The shared example fixture has:

  • TP = 3
  • FP = 2
  • TN = 4
  • FN = 3
  • correct = TP + TN = 7
  • wrong = FP + FN = 5
  • total = 12

Implementation sketch

Each email updates exactly one bucket:

if (actual === positiveLabel && predicted === positiveLabel) {
  tp += 1;
} else if (actual !== positiveLabel && predicted === positiveLabel) {
  fp += 1;
} else if (actual !== positiveLabel && predicted !== positiveLabel) {
  tn += 1;
} else {
  fn += 1;
}
Implementation branchesFour conditions map each case to one counter update.
Implementation branches
if actual == positive?if prediction == positive?updateinterpretation
positivepositiveTP++True Positive
negativepositiveFP++False Positive
negativenegativeTN++True Negative
positivenegativeFN++False Negative

Correctness intuition

The two boolean questions are exhaustive and mutually exclusive.

For a fixed positive label, each example is exactly one of these four combinations:

  • actual positive + predicted positive
  • actual negative + predicted positive
  • actual negative + predicted negative
  • actual positive + predicted negative

So each example is assigned to exactly one matrix cell. After 9 emails:

TP = 2, FP = 2, TN = 3, FN = 2, total = 9.

At the end:

TP = 3, FP = 2, TN = 4, FN = 3, total = 12.

Trace invariantEvery example enters exactly one cell.
After e9

TP=2, FP=2, TN=3, FN=2

total = 9

formula: TP + FP + TN + FN = 9

Final

TP=3, FP=2, TN=4, FN=3

total = 12

Invariant: the table is complete and disjoint.

You can read this as the invariance check for every trace prefix:

TP+FP+TN+FN=processed examplesTP + FP + TN + FN = \text{processed examples}

And for the full fixture, processed examples = 12.

Complexity

The scan is one pass over the examples:

time=O(n),extra space=O(1).\text{time} = O(n), \qquad \text{extra space} = O(1).

Each example contributes one constant-time set of comparisons and one counter increment.

Common confusions

Common confusionsNaming and orientation errors are the top sources of confusion.
Positive-class definition

Current table: positive means spam (1 means spam).

positive=spam
predicted=spampredicted=not-spam
actual=spam3 (TP)3 (FN)
actual=not-spam2 (FP)4 (TN)
Positive-class swap

If positive=not-spam, TP/FP/TN/FN re-meaning changes.

positive=not-spam
predicted=not-spampredicted=spam
actual=not-spam4 (TP)2 (FN)
actual=spam3 (FP)3 (TN)

A useful quick exercise is a positive-class swap. Keep the same underlying predictions but set positiveLabel = "not-spam":

  • TP = 4
  • FP = 3
  • TN = 3
  • FN = 2

This does not make the dataset better or worse; it only changes what “positive” means.

Connections

This concept is the base for precision, recall, and f1-score. Those nodes are planned but not yet in the graph.

Graph stripCurrent follow-up nodes are planned but not yet implemented.
confusion-matrix

implemented

precision

planned follow-up

recall

planned follow-up

f1-score

planned follow-up

Static trace ledger (no-JS fallback)

The deterministic ledger below is visible without JavaScript and stays aligned with the interactive demo:

Confusion trace over 12 emails
StepEmailSubjectActualPredictedCellTPFPTNFN
1e1Prize claim nowspamspamTP (True Positive)1000
2e2Project notesnot-spamnot-spamTN (True Negative)1010
3e3Receipt attachednot-spamspamFP (False Positive)1110
4e4Account alertspamnot-spamFN (False Negative)1111
5e5Limited offerspamspamTP (True Positive)2111
6e6Team lunchnot-spamnot-spamTN (True Negative)2121
7e7Password resetnot-spamnot-spamTN (True Negative)2131
8e8Urgent transferspamnot-spamFN (False Negative)2132
9e9Flight updatenot-spamspamFP (False Positive)2232
10e10Crypto bonusspamspamTP (True Positive)3232
11e11Invoice approvednot-spamnot-spamTN (True Negative)3242
12e12Verify walletspamnot-spamFN (False Negative)3243

Invariant check: {TP + FP + TN + FN} = 12.

Exercises

  1. Which cell does e9 reach?
  2. Which cell changes if positiveLabel becomes "not-spam"?
  3. In this node, if an email is predicted as spam but is actually not-spam, what should happen to precision later?
  4. Why is a single wrong-rate number less informative than the four-cell table?

Graph connections : Confusion Matrix