Draft
Principal Component Analysis
Compress a data table by rotating centered data toward the directions where it varies most.
The problem: a table that is wider than its pattern
Imagine a tiny table of body measurements. height and arm span are two different columns, but in this fixture they mostly rise together. The table has two numbers per row, while the visible pattern is close to one tilted direction.
That is the pain PCA repairs: keep the important movement without keeping every original column.
| person | height | arm span |
|---|---|---|
| A | 156 | 153 |
| B | 160 | 160 |
| C | 164 | 164 |
| D | 168 | 173 |
| E | 172 | 176 |
| F | 176 | 182 |
First naive idea: drop one raw column
A tempting compression rule is “keep the column with the larger spread” or “drop the column that looks redundant.” That rule is cheap, but it can throw away the shared diagonal movement. If height and arm span move together, neither raw axis is exactly the best one-dimensional summary.
Sum squared reconstruction error. Arm span is replaced by its feature mean.
Sum squared reconstruction error. Height is replaced by its feature mean.
Sum squared reconstruction error. One rotated coordinate follows the shared movement.
The bar comparison uses the same fixture as the table. Keeping PC1 has lower squared reconstruction error because the kept coordinate is a rotated mixture, not just height or just arm span.
The pain: the signal is tilted
The useful line cuts diagonally through the centered cloud. A raw column can only measure horizontal or vertical shadows; PCA is allowed to rotate the measuring line first.
This is a unit direction chosen by maximum projected variance, visibly diagonal but not forced to 45 degrees.
The core invention: center, rotate, keep
PCA, or Principal Component Analysis, does three small things:
- Center every feature column by subtracting its mean.
- Find unit directions where the centered points have the largest projected variance.
- Keep the first few rotated coordinates and discard the quieter directions.
PCA is unsupervised: it ignores labels and targets. It is also linear: each principal component is a weighted mix of the original features. Because it measures numeric spread, PCA is sensitive to feature scaling.
Centering toggle
mean vector: (166, 168); current mean: (166, 168); mean after centering: (0, 0)
Centering moves the cloud so PCA measures spread around the middle instead of distance from the origin.
Vocabulary bridge before the formulas
- Variance means how spread out the projected dots are along a line.
- A unit direction is an arrow of length 1, so longer arrows do not win by cheating.
- A projection is the shadow a point makes on a chosen line.
- Orthogonal means at a right angle; later components must look in a new right-angle direction.
- Covariance summarizes which feature values rise and fall together.
- An eigenvector is a direction that covariance stretches without turning.
- A linear mixture is a new feature made by adding weighted old features, such as some height plus some arm span.
Variance sweep
projected variance: 144.61 / 144.62 PC1 maximum
PCA tests unit directions and chooses the one whose projected coordinates are most spread out.
projected variance
Formal notation
Let n be the number of rows, d the number of original features, and k the number of components we keep. Let X be the raw n x d table, mu the feature-mean vector, X_c the centered table, W_k the first k principal directions, and Z the compressed coordinates.
n x d
subtract feature means
d x k
project onto directions
expand and add mean
Center the table:
Plain meaning: subtract each feature average so PCA studies spread around the cloud center.
Summarize how features move together:
Plain meaning: covariance summarizes shared movement. Some libraries use 1/(n-1) instead of 1/n; the direction story on this page is unchanged.
Choose the first direction:
Plain meaning: try unit directions and choose the one where projected points are most spread out.
Principal directions satisfy:
Plain meaning: each principal direction is a stable covariance direction, and lambda_j is the variance carried by that direction.
Compress and reconstruct:
Plain meaning: replace each row with coordinates along kept directions, then expand those coordinates and add the mean back for an approximate table.
The share of total spread carried by component j is:
Trace lab: the algorithm state
PCA trace lab
Step 1/6: Raw table. The two measurements mostly rise together, so the table is wider than the real pattern.
2 raw columns
| A | 156, 153 |
|---|---|
| B | 160, 160 |
| C | 164, 164 |
| D | 168, 173 |
Implementation sketch
const mu = columnMeans(X);
const Xc = subtractColumnMeans(X, mu);
const C = multiply(transpose(Xc), Xc).scale(1 / X.length);
const components = eigenvectorsSortedByEigenvalue(C);
const Wk = components.slice(0, k);
const Z = multiply(Xc, Wk);
const Xhat = addColumnMeans(multiply(Z, transpose(Wk)), mu);
The code mirrors the trace: raw table, centered table, covariance, components, projected codes, and reconstruction. Practical libraries often compute PCA with SVD directly on X_c, truncated SVD, or randomized methods when only a few components are needed.
Correctness intuition
After centering, PC1 is the unit direction with maximum projected variance. PC2 is orthogonal to PC1 and captures the largest remaining variance. Continuing this way keeps mutually right-angle directions ordered from most spread to least spread.
For a fixed number of kept components, this same ordering minimizes squared reconstruction error among linear projections. Intuitively, if a direction has large centered spread and you throw it away, many points need long reconstruction corrections.
Reconstruction comparison
keep PC1: sum squared error 2.29. One rotated coordinate follows the shared movement.
The bars show total squared reconstruction error in the original unstandardized two-feature space.
Arm span is replaced by its feature mean.
Height is replaced by its feature mean.
One rotated coordinate follows the shared movement.
Both PCA coordinates rebuild the original two-feature table.
Complexity
touch every table entry once
compare every feature pair across rows
often expensive when feature count grows
multiply each row by k directions
More rows mainly make scanning and projection longer. More features can make covariance and full eigendecomposition grow much faster because feature pairs and the d x d matrix dominate.
Common confusions
PCA common confusions
PCA is not a curved-manifold method. If the data lies on a bent surface, PCA can only choose a flat linear view; methods such as Isomap or neighbor-embedding methods ask a different question.
Connections in the graph
Keep the directions where centered data varies most.
Place points so low-dimensional distances imitate the original distance table.
Use neighbor-graph shortest paths before applying an MDS-style layout.
Use labels to find projections that separate class means while keeping classes tight.
Let each class keep its own covariance, creating quadratic boundaries rather than one shared projection.
Match neighbor probabilities between high and low dimensions.
Repair SNE's crowding problem with a heavy-tailed low-dimensional similarity.
Build a fuzzy neighbor graph, then optimize a low-dimensional graph with similar membership strengths.
feature-mapmotivates PCA because both rewrite inputs into a representation; PCA learns a linear representation from the data itself.mdscontrasts with PCA because MDS starts from pairwise distances, while PCA starts from coordinate variance.ldacontrasts with PCA because LDA uses labels to find separating directions, while PCA ignores labels.
Prediction questions
- If one feature is measured in meters and another in millimeters, what might happen to PC1 before scaling?
- Why does PCA subtract the mean before measuring variance?
- In the fixture above, why does keeping PC1 beat keeping only raw height?
- Why would a curved data manifold be a poor match for plain PCA?
Graph connections : Principal Component Analysis