APR 25, 2026

Why loss functions and visual hierarchy are the same problem

Essay · ML + Design · 6 min read

I've been sitting with this idea for a while, and I think it is true: loss functions and visual hierarchy are solving the exact same problem. The math looks completely different. The intuition is identical.

Both are asking: what matters most, and by how much?

When you train a neural network, you define a loss function, a mathematical expression that tells the model how wrong it is. Mean squared error, cross-entropy, KL divergence: each one makes a different claim about what kind of wrongness matters. MSE punishes large errors more than small ones (squaring amplifies distance). Cross-entropy cares deeply about confident wrong predictions. If your model says "90% sure this is a cat" and it is a dog, cross-entropy is furious in a way that MSE is not.

The loss function is a statement of values. It says: this type of mistake costs more than that type. It is not neutral.

Designers do exactly this. They just do not call it a loss function.

When you establish visual hierarchy, deciding that the headline is 48px, the subhead is 24px, the body is 16px, you are defining a weighting scheme. You are saying: the headline carries the most signal. The caption is nearly noise. Someone scanning this page for three seconds should extract the headline and maybe the subhead. Everything else is acceptable loss.

Every typographic decision is a gradient step. You adjust the weight of an element, observe the result (does the eye go where I want?), and update. You are minimizing a loss that you have not written down but absolutely have internalized. The wrong thing is attracting attention. The important thing is being missed.

The deeper connection is tradeoffs

In ML, you rarely optimize for one thing cleanly. Precision and recall trade against each other. Push one up and the other tends to fall. A fraud detection model that is too aggressive catches all the fraud and also flags half your legitimate customers. You tune the threshold; you decide what the acceptable cost of each error type is. This is called the precision-recall tradeoff, and there is no objectively correct answer. There is only the answer that fits your situation.

Designers live in the same tradeoff space. Clarity and density trade against each other. A dashboard that shows everything is showing nothing, too much signal collapses into noise. A dashboard stripped to the minimum might hide the thing you needed. Every whitespace decision, every color hierarchy choice, every decision to include or exclude an element: it is all precision-recall negotiation. What is the cost of missing important information vs. the cost of being overwhelmed by unimportant information?

Neither field has a clean answer. Both fields have to decide an answer and live with it.

The regularization parallel

There is also what I'd call the regularization parallel.

In ML, regularization is the practice of penalizing model complexity. L2 regularization (weight decay) adds a penalty term to the loss function that grows with the magnitude of the model's weights. It says: all else being equal, simpler is better. The model that explains the data with smaller weights is preferred over the model that explains it with huge compensating weights, because the latter is probably memorizing rather than learning.

Design has Occam's razor built into its culture as a principle: remove everything that does not earn its place. Every element should be doing work. Decoration that does not carry meaning is complexity without justification. It is overfitting to aesthetics at the expense of function. "Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away." Saint-Exupéry said that, and it is a description of regularization.

Dieter Rams said it differently with his ten principles of good design, most famously: good design is as little design as possible. Same regularization impulse. Penalize unnecessary parameters.

Why this matters

I do not think this is a coincidence.

Both fields are trying to extract signal from noise. Both are making claims about what matters. Both are constructing systems that direct attention. A neural network learns where to look in the input space; a designed page teaches a human where to look in the visual space.

The tools differ. The underlying epistemology, assign weights, minimize error, respect tradeoffs, penalize unnecessary complexity, is the same.

Next time you're tuning a loss function, try thinking of it as a design decision. Next time you're adjusting a typographic scale, try thinking of it as defining a cost function. The metaphor is not just cute. I think it is actually load-bearing.

Still thinking about this one - notes accumulating.