mindstalk | ungameable metrics via pop quiz (Reply)

Often we want to measure complex things, like health care performance or bank safety. In a vacuum, a series of simple metrics may not be bad -- but once you make them performance targets, they get gamed. Ambulances rated on getting to patients in 8 minutes may give up if they cross the 8 minute threshold and and go to someone else, or lie about their times. Complex metrics (like lots of simple ones all weighted together) may not be any less gameable, are more expensive to measure, and as rules, may be overfitting, a la machine learning -- modern bank regulation may be more complex than the data about bank failures can possibly justify.

But an idea that's starting to go around is this: have lots of simple metrics, and use a random subset of them at surprise times. The analogy is to important examinations: there isn't time or energy to test you on everything you know, but by having a wide range of possible questions, it forces you to study the material to reliably do well on the test. Similarly, a good medical service or a good bank should look good whatever you choose to suddenly measure, without costing a lot to measure.

You do seem to lose something: if you're asking different questions at different times or places, then you can't as readily compare performances, especially with nice graphs. But if those graphs aren't meaningful anyway, because the results are gamed, then you're not really losing anything of substance.

Source: Messy, by Tim Harford, though the idea isn't original to him. The insight about exams goes back to Jeremy Bentham in 1830. Which suggests a different approach to utilitarianism than people usually think of.