It happens occasionally when I’m speaking to a group unfamiliar with measurement concepts that my audiences audibly gasp at some of the things I say. What can be so shocking about anything as mundane as measurement? A lot of things, in fact, since we are in the strange situation of having valid and rigorous intuitions about what measures ought to be, while we simultaneously have entire domains of life in which our measures almost never live up to those intuitions in practice.

So today I’d like to spell out a few things about measurement, graphically. First, I’m going to draw a picture of what good measurement looks like. This picture will illustrate why we value numbers and want to use them for managing what’s important. Then I’m going to draw a picture of what scores, ratings, and percentages look like. Here we’ll see how numbers do not automatically stand for something that adds up the way they do, and why we don’t want to use these funny numbers for managing anything we really care about. What we will see here, in effect, is why high stakes graduation, admissions, and professional certification and licensure testing agencies have long since abandoned scores, ratings, and percentages as their primary basis for making decisions.

After contrasting those pictures, a third picture will illustrate how to blend the valid intuitions informing what we expect from measures with the equally valid intuitions informing the observations expressed in scores, ratings, and percentages.

Imagine measuring everything in the room you’re in twice, once with a yardstick and once with a meterstick. You record every measure in inches and in centimeters. Then you plot these pairs of measures against each other, with inches on the vertical axis and centimeters on the horizontal. You would come up with a picture like Figure 1, below.

The key thing to appreciate about this plot is that the amounts of length measured by the two different instruments stay the same no matter which number line they are mapped onto. You would get a plot like this even if you sawed a yardstick in half and plotted the inches read off the two halves. You’d also get the same kind of a plot (obviously) if you paired up measures of the same things from two different manufacturer’s inch rulers, or from two different brands of metersticks. And you could do the same kind of thing with ounces and grams, or degrees Fahrenheit and Celsius.

So here we are immersed in the boring-to-the-point-of-being-banal details of measurement. We take these alignments completely for granted, but they are not given to us for nothing. They are products of the huge investments we make in metrological standards. Metrology came of age in the early nineteenth century. Until then, weights and measures varied from market to market. Units with the same name might be different sizes, and units with different names might be the same size. As was so rightly celebrated on World Metrology Day (May 20), metric uniformity contributes hugely to the world economy by reducing transaction costs and by structuring representations of fair value.

We are in dire need of similar metrological systems for human, social, and natural capital. Health care reform, improved education systems, and environmental management will not come anywhere near realizing their full potentials until we establish, implement, and monitor metrological standards that bring intangible forms of capital to economic life.

But can we construct plots like Figure 1 from the numeric scores, ratings, and percentages we commonly assume to be measures? Figure 2 shows the kind of picture we get when we plot percentages against each other (scores and ratings behave in the same way, for reasons given below). These data might be from easy and hard halves of the same reading or math test, from agreeable and disagreeable ends of the same rating scale survey, or from different tests or surveys that happen to vary in their difficulty or agreeability. The Figure 2 data might also come from different situations in which some event or outcome occurs more frequently in one place than it does in another (we’ll go more into this in Part Two of this report).

In contrast with the linear relation obtained in the comparison of inches and centimeters, here we have a curve. Why must this relation necessarily be curved? It cannot be linear because both instruments limit their measurement ranges, and they set different limits. So, if someone scores a 0 on the easy instrument, they are highly likely to also score 0 on the instrument posing more difficult or disagreeable questions. Conversely, if someone scores 100 on the hard instrument, they are highly likely to also score 100 on the easy one.

But what is going to happen in the rest of the measurement range? By the definition of easy and hard, scores on the easy instrument will be higher than those on the hard one. And because the same measured amount is associated with different ranges in the easy and hard score distributions, the scores vary at different rates (Part Two will explore this phenomenon in more detail).

These kinds of numbers are called ordinal because they meaningfully convey information about rank order. They do not, however, stand for amounts that add up. We are, of course, completely free to treat these ordinal numbers however we want, in any kind of arithmetical or statistical comparison. Whether such comparisons are meaningful and useful is a completely different issue.

Figure 3 shows the Figure 2 data transformed. The mathematical transformation of the percentages produces what is known as a logit, so called because it is a log-odds unit, obtained as the natural logarithm of the response odds. (The response odds are the response probabilities–the original percentages of the maximum possible score–divided by one minus themselves.) This is the simplest possible way of estimating linear measures. Virtually no computer program providing these kinds of estimates would employ an algorithm this simple and potentially fallible.

Although the relationship shown in Figure 3 is not as precise as that shown in Figure 1, especially at the extremes, the values plotted fall far closer to the identity line than the values in Figure 2 do. Like Figure 1, Figure 3 shows that constant amounts of the thing measured exist irrespective of the particular number line they happen to be mapped onto.

What this means is that the two instruments could be designed so that the same numbers are read off of them when the same amounts are measured. We value numbers as much as we do because they are so completely transparent: 2+2=4 no matter what. But this transparency can be a liability when we assume that every unit amount is the same as all the others and they actually vary substantially. When different units stand for different amounts, confusion reigns. But we can reasonably hope and strive for great things as we bring human, social, and natural capital to life via universally uniform metrics traceable to reference standards.

A large literature on these methods exists and ought to be more widely read. For more information, see http://www.rasch.org, http://www.livingcapitalmetrics.com, etc.

LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.

Based on a work at livingcapitalmetrics.wordpress.com.

Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.