Principal Component Analysis

The problem: a table that is wider than its pattern

Imagine a tiny table of body measurements. height and arm span are two different columns, but in this fixture they mostly rise together. The table has two numbers per row, while the visible pattern is close to one tilted direction.

That is the pain PCA repairs: keep the important movement without keeping every original column.

One pattern, two raw columnsThe table looks two-dimensional, but the scatter plot shows height and arm span mostly moving together.

Fixture measurements in original units
person	height	arm span
A	156	153
B	160	160
C	164	164
D	168	173
E	172	176
F	176	182

First naive idea: drop one raw column

A tempting compression rule is “keep the column with the larger spread” or “drop the column that looks redundant.” That rule is cheap, but it can throw away the shared diagonal movement. If height and arm span move together, neither raw axis is exactly the best one-dimensional summary.

Dropping a raw column wastes the diagonalA rotated PC1 coordinate keeps the shared movement better than either raw column alone.

keep raw height590

Sum squared reconstruction error. Arm span is replaced by its feature mean.

keep raw arm span280

Sum squared reconstruction error. Height is replaced by its feature mean.

keep PC12.3

Sum squared reconstruction error. One rotated coordinate follows the shared movement.

The bar comparison uses the same fixture as the table. Keeping PC1 has lower squared reconstruction error because the kept coordinate is a rotated mixture, not just height or just arm span.

The pain: the signal is tilted

The useful line cuts diagonally through the centered cloud. A raw column can only measure horizontal or vertical shadows; PCA is allowed to rotate the measuring line first.

A rotated coordinate is a measured shadowEach centered row casts a coordinate onto the PC1 line; the line is computed from covariance, not assumed.

Computed PC155.5°

This is a unit direction chosen by maximum projected variance, visibly diagonal but not forced to 45 degrees.

The core invention: center, rotate, keep

PCA, or Principal Component Analysis, does three small things:

Center every feature column by subtracting its mean.
Find unit directions where the centered points have the largest projected variance.
Keep the first few rotated coordinates and discard the quieter directions.

PCA is unsupervised: it ignores labels and targets. It is also linear: each principal component is a weighted mix of the original features. Because it measures numeric spread, PCA is sensitive to feature scaling.

Centering toggle

mean vector: (166, 168); current mean: (166, 168); mean after centering: (0, 0)

Centering moves the cloud so PCA measures spread around the middle instead of distance from the origin.

Changing units can rotate PC1PCA is sensitive to feature scaling because larger numeric spread can dominate the covariance.

Vocabulary bridge before the formulas

Variance means how spread out the projected dots are along a line.
A unit direction is an arrow of length 1, so longer arrows do not win by cheating.
A projection is the shadow a point makes on a chosen line.
Orthogonal means at a right angle; later components must look in a new right-angle direction.
Covariance summarizes which feature values rise and fall together.
An eigenvector is a direction that covariance stretches without turning.
A linear mixture is a new feature made by adding weighted old features, such as some height plus some arm span.

Variance sweep

candidate direction55°

projected variance: 144.61 / 144.62 PC1 maximum

PCA tests unit directions and chooses the one whose projected coordinates are most spread out.

PC1 maximum144.61

projected variance

Formal notation

Let n be the number of rows, d the number of original features, and k the number of components we keep. Let X be the raw n x d table, mu the feature-mean vector, X_c the centered table, W_k the first k principal directions, and Z the compressed coordinates.

PCA notation mapEach symbol is one object in the same pipeline: raw table, centered table, directions, compressed code, reconstruction.

raw dataX

n x d

centered dataX_c = X - 1mu^T

subtract feature means

kept directionsW_k

d x k

compressed codeZ = X_c W_k

project onto directions

reconstructionX_hat = ZW_k^T + mu

expand and add mean

Center the table:

X_c = X - \mathbf{1}\mu^T

Plain meaning: subtract each feature average so PCA studies spread around the cloud center.

Summarize how features move together:

C = \frac{1}{n}X_c^T X_c

Plain meaning: covariance summarizes shared movement. Some libraries use 1/(n-1) instead of 1/n; the direction story on this page is unchanged.

Choose the first direction:

\max_{\lVert w\rVert=1}\operatorname{Var}(X_cw) = \max_{\lVert w\rVert=1} w^T Cw

Plain meaning: try unit directions and choose the one where projected points are most spread out.

Principal directions satisfy:

Cw_j = \lambda_j w_j

Plain meaning: each principal direction is a stable covariance direction, and lambda_j is the variance carried by that direction.

Compress and reconstruct:

Z = X_c W_k

\hat X = ZW_k^T + \mu

Plain meaning: replace each row with coordinates along kept directions, then expand those coordinates and add the mean back for an approximate table.

The share of total spread carried by component j is:

\text{explained variance ratio}_j = \frac{\lambda_j}{\sum_i \lambda_i}

Trace lab: the algorithm state

PCA trace lab

Step 1/6: Raw table. The two measurements mostly rise together, so the table is wider than the real pattern.

Raw table

2 raw columns

state table

A	156, 153
B	160, 160
C	164, 164
D	168, 173

Implementation sketch

const mu = columnMeans(X);
const Xc = subtractColumnMeans(X, mu);
const C = multiply(transpose(Xc), Xc).scale(1 / X.length);
const components = eigenvectorsSortedByEigenvalue(C);
const Wk = components.slice(0, k);
const Z = multiply(Xc, Wk);
const Xhat = addColumnMeans(multiply(Z, transpose(Wk)), mu);

The code mirrors the trace: raw table, centered table, covariance, components, projected codes, and reconstruction. Practical libraries often compute PCA with SVD directly on X_c, truncated SVD, or randomized methods when only a few components are needed.

Correctness intuition

After centering, PC1 is the unit direction with maximum projected variance. PC2 is orthogonal to PC1 and captures the largest remaining variance. Continuing this way keeps mutually right-angle directions ordered from most spread to least spread.

For a fixed number of kept components, this same ordering minimizes squared reconstruction error among linear projections. Intuitively, if a direction has large centered spread and you throw it away, many points need long reconstruction corrections.

Reconstruction comparison

keep PC1: sum squared error 2.29. One rotated coordinate follows the shared movement.

The bars show total squared reconstruction error in the original unstandardized two-feature space.

keep raw height590

Arm span is replaced by its feature mean.

keep raw arm span280

Height is replaced by its feature mean.

keep PC12.29

One rotated coordinate follows the shared movement.

keep PC1 + PC20

Both PCA coordinates rebuild the original two-feature table.

Complexity

Where PCA spends workUse n for rows, d for original features, and k for kept components.

CenteringO(nd)

touch every table entry once

CovarianceO(nd^2)

compare every feature pair across rows

Full eigendecompositionO(d^3)

often expensive when feature count grows

ProjectionO(ndk)

multiply each row by k directions

More rows mainly make scanning and projection longer. More features can make covariance and full eigendecomposition grow much faster because feature pairs and the d x d matrix dominate.

Common confusions

PCA common confusions

PCA is not a curved-manifold method. If the data lies on a bent surface, PCA can only choose a flat linear view; methods such as Isomap or neighbor-embedding methods ask a different question.

Connections in the graph

Dimensionality reduction pathLinear projections, distance-preserving maps, supervised discriminants, and neighbor embeddings solve different pains.

PCA

Keep the directions where centered data varies most.

MDS

Place points so low-dimensional distances imitate the original distance table.

Isomap

Use neighbor-graph shortest paths before applying an MDS-style layout.

LDA

Use labels to find projections that separate class means while keeping classes tight.

QDA

Let each class keep its own covariance, creating quadratic boundaries rather than one shared projection.

SNE

Match neighbor probabilities between high and low dimensions.

t-SNE

Repair SNE's crowding problem with a heavy-tailed low-dimensional similarity.

UMAP

Build a fuzzy neighbor graph, then optimize a low-dimensional graph with similar membership strengths.

feature-map motivates PCA because both rewrite inputs into a representation; PCA learns a linear representation from the data itself.
mds contrasts with PCA because MDS starts from pairwise distances, while PCA starts from coordinate variance.
lda contrasts with PCA because LDA uses labels to find separating directions, while PCA ignores labels.

Prediction questions

If one feature is measured in meters and another in millimeters, what might happen to PC1 before scaling?
Why does PCA subtract the mean before measuring variance?
In the fixture above, why does keeping PC1 beat keeping only raw height?
Why would a curved data manifold be a poor match for plain PCA?

Graph connections : Principal Component Analysis