Graphic Illustrations of Why Scores, Ratings, and Percentages Are Not Measures, Part Two

Part One of this two-part blog offered pictures illustrating the difference between numbers that stand for something that adds up and those that do not. The uncontrolled variation in the numbers that pass for measures in health care, education, satisfaction surveys, performance assessments, etc. is analogous to the variation in weights and measures found in Medieval European markets. It is well established that metric uniformity played a vital role in the industrial and scientific revolutions of the nineteenth century. Metrology will inevitably play a similarly central role in the economic and scientific revolutions taking place today.

Clients and students often express their need for measures that are manageable, understandable, and relevant. But sometimes it turns out that we do not understand what we think we understand. New understandings can make what previously seemed manageable and relevant appear unmanageable and irrelevant. Perhaps our misunderstandings about measurement will one day explain why we have failed to innovate and improve as much as we could have.

Of course, there are statistical methods for standardizing scores and proportions that make them comparable across different normal distributions, but I’ve never once seen them applied to employee, customer, or patient survey results reported to business or hospital managers. They certainly are not used in determining comparable proficiency levels of students under No Child Left Behind. Perhaps there are consultants and reporting systems that make standardized z-scores a routine part of their practices, but even if they are, why should anyone willingly base their decisions on the assumption that normal distributions have been obtained? Why not use methods that give the same result no matter how scores are distributed?

To bring the point home, if statistical standardization is a form of measurement, why don’t we use the z-scores for height distributions instead of the direct measures of how tall we each are? Plainly, the two kinds of numbers have different applications. Somehow, though, we try to make do without the measures in many applications involving tests and surveys, with the unfortunate consequence of much lost information and many lost opportunities for better communication.

Sometimes I wonder, if we would give a test on the meaning of the scores, percentages, and logits discussed in Part One to managers, executives, and entrepreneurs, would many do any better on the parts they think they understand than on the parts they find unfamiliar? I suspect not. Some executives whose pay-for-performance bonuses are inflated by statistical accidents are going to be unhappy with what I’m going to say here, but, as I’ve been saying for years, clarifying financial implications will go a long way toward motivating the needed changes.

How could that be true? Well, consider the way we treat percentages. Imagine that three different hospitals see their patients’ percents agreement with a key survey item change as follows. Which one changed the most?


A. from 30.85% to 50.00%: a 19.15% change

B. from 6.68% to 15.87%: a 9.18% change

C. from 69.15% to 84.13%: a 14.99% change

As is illustrated in Figure 1 below, given that all three pairs of administrations of the survey are included together in the same measure distribution, it is likely that the three changes were all the same size.

In this scenario, all the survey administrations shared the same standard deviation in the underlying measure distribution that the key item’s percentage was drawn from, and they started from different initial measures. Different ranges in the measures are associated with different parts of the sample’s distribution, and so different numbers and percentages of patients are associated with the same amount of measured change. It is easy to see that 100-unit measured gains in the range of 50-150 or 1000-1100 on the horizontal axis would scarcely amount to 1% changes, but the same measured gain in the middle of the distribution could be as much as 25%.

Figure 1. Different Percents, Same Measures

Figure 1. Different Percentages, Same Measures

Figure 1 shows how the same measured gain can look wildly different when expressed as a percentage, depending on where the initial measure is positioned in the distribution. But what happens when percentage gains are situated in different distributions that have different patterns of variation?

More specifically, consider a situation in which three different hospitals see their percents agreement with a key survey item change as follows.

A. from 30.85% to 50.00%: a 19.15% change

B. from 30.85% to 50.00%: a 19.15% change

C. from 30.85% to 50.00%: a 19.15% change

Did one change more than the others? Of course, the three percentages are all the same, so we would naturally think that the three increases are all the same. But what if the standard deviations characterizing the three different hospitals’ score distributions are different?

Figure 2, below, shows that the three 19.15% changes could be associated with quite different measured gains. When the distribution is wider and the standard deviation is larger, any given percentage change will be associated with a larger measured change than in cases with narrower distributions and smaller standard deviations.

Same Percentage Gains, Different Measured Gains

Figure 2. Same Percentage Gains, Different Measured Gains

And if this is not enough evidence as to the foolhardiness of treating percentages as measures, bear with me through one more example. Imagine another situation in which three different hospitals see their percents agreement with a key survey item change as follows.

A. from 30.85% to 50.00%: a 19.15% change

B. from 36.96% to 50.00%: a 13.04% change

C. from 36.96% to 50.00%: a 13.04% change

Did one change more than the others? Plainly A obtains the largest percentage gain. But Figure 3 shows that, depending on the underlying distribution, A’s 19.15% gain might be a smaller measured change than either B’s or C’s. Further, B’s and C’s measures might not be identical, contrary to what would be expected from the percentages alone.

Figure 3. Percentages Completely at Odds with Measures

Figure 3. Percentages Completely at Odds with Measures

Now we have a fuller appreciation of the scope of the problems associated with the changing unit size illustrated in Part One. Though we think we understand percentages and insist on using them as something familiar and routine, the world that they present to us is as crazily distorted as a carnival funhouse. And we won’t even begin to consider how things look in the context of distributions skewed toward one end of the continuum or the other! There is similarly no point at all in going to bimodal or multimodal distributions (ones that have more than one peak). The vast majority of business applications employing scores, ratings, and percentages as measures do not take the underlying distribution into account. Given the problems that arise in optimal conditions (i.e., with a normal distribution), there is no need to belabor the issue with an enumeration of all the possible things that could be going wrong. Far better to simply move on and construct measurement systems that remain invariant across the different shapes of local data sets’ particular distributions.

How could we have gone so far in making these nonsensical numbers the focus of our attention? To put things back in perspective, we need to keep in mind the evolving magnitude of the problems we face. When Florence Nightingale was deploring the lack of any available indications of the effectiveness of her efforts, a little bit of flawed information was a significant improvement over no information. Ordinal, situation-specific numbers provided highly useful information when problems emerged in local contexts on a scale that could be comprehended and addressed by individuals and small groups.

We no longer live in that world. Today’s problems require kinds of information that must be more meaningful, precise, and actionable than ever before. And not only that, this information cannot remain accessible only to managers, executives, researchers, and data managers. It must be brought to bear in every transaction and information exchange in the industry.

Information has to be formatted in the common currency of uniform metrics to make it as fluid and empowering as possible. Would the auto industry have been able to bring off a quality revolution if every worker’s toolkit was calibrated in a different unit? Could we expect to coordinate schedules easily if we each had clocks scaled in different time units? Obviously not; why should we expect quality revolutions in health care and education when nearly all of our relevant metrics are incommensurable?

Management consultants realized decades ago that information creates a sense of responsibility in the person who possesses it. We cannot expect clinicians and teachers to take full responsibility for the outcomes they produce until they have the information they need to evaluate and improve them. Existing data and systems plainly are not up to the task.

The problem is far less a matter of complex or difficult issues than it is one of culture and priorities. It often takes less effort to remain in a dysfunctional rut and deal with massive inefficiencies than it does to get out of the rut and invent a new system with new potentials. Big changes tend to take place only when systems become so bogged down by their problems that new systems emerge simply out of the need to find some way to keep things in motion. These blogs are written in the hope that we might be able to find our way to new methods without suffering the catastrophes of total system failure. One might well imagine an entrepreneurially-minded consortium of providers, researchers, payors, accreditors, and patient advocates joining forces in small pilot projects testing out new experimental systems.

To know how much of something we’re getting for our money and whether its a fair bargain, we need to be able to compare amounts across providers, vendors, treatment options, teaching methods, etc. Scores summed from tests, surveys, or assessments, individual ratings, and percentages of a maximum possible score or frequency do not provide this information because they are not measures. Their unit sizes vary across individuals, collections of indicators (instruments), time, and space. The consequences of treating scores and percentages as measures are not trivial. We will eventually come to see that measurement quality is the primary source of the differences between the current health care and education systems’ regional variations and endlessly spiralling costs, on the one hand, and the geographically uniform quality, costs, and improvements in the systems we will create in the future.

Markets are dysfunctional when quality and costs cannot be evaluated in common terms by consumers, providers’ quality improvement specialists, researchers, accreditors, and payers. There are widespread calls for greater transparency in purchasing decisions, but transparency is not being defined and operationalized meaningfully or usefully. As currently employed, transparency refers to making key data available for public scrutiny. But these data are almost always expressed as scores, ratings, or percentages that are anything but transparent. In addition to not adding up, these data are also usually presented in indigestibly large volumes, and are not quality assessed.

All things considered, we’re doing amazingly well with our health care and education systems given the way we’ve hobbled ourselves with dysfunctional, incommensurable measures. And that gives us real cause for hope! What will we be able to accomplish when we really put our minds to measuring what we want to manage? How much better will we be able to do when entrepreneurs have the tools they need to innovate new efficiences? Who knows what we’ll be capable of when we have meaningful measures that stand for amounts that really add up, when data volumes are dramatically reduced to manageable levels, and when data quality is effectively assessed and improved?

For more on the problems associated with these kinds of percentages in the context of NCLB, see Andrew Dean Ho’s article in the August/September, 2008 issue of Educational Researcher, and Charles Murray’s “By the Numbers” column in the July 25, 2006 Wall Street Journal.

This is not the end of the story as to what the new measurement paradigm brings to bear. Next, I’ll post a table contrasting the features of scores, ratings, and percentages with those of measures. Until then, check out the latest issue of the Journal of Applied Measurement at, see what’s new in measurement software at or, or look into what’s up in the way of measurement research projects with the BEAR group at UC Berkeley (

Finally, keep in mind that we are what we measure. It’s time we measured what we want to be.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at
Permissions beyond the scope of this license may be available at


Tags: , , , , , , , , , , , , , , , , , , , , , , , , , ,

2 Responses to “Graphic Illustrations of Why Scores, Ratings, and Percentages Are Not Measures, Part Two”

  1. Matt Barney Says:

    How did you draw your nice distributions?

    • livingcapitalmetrics Says:

      Using SPSS for one of them, and a little Java applet I designed, for the others. The applet is available here, with documentation at here.

      It’s fun to play with, as you can watch the percentages change as you slide the mean, standard deviation, standard cutoff, and sample size back and forth.

      The data can be simulated in Winsteps using the SIFILE= command, analyzed in Winsteps, and exported to an SPSS file. A histogram of the measures can then be produced, and the actual or expected response probabilities or percentages agreement for various measures can be found in the distribution and marked.

      It is amazing that these problems with percentages are so widely unrecognized. If two groups with different average measures near the mean improve at the same rate, their measures will remain the same distance apart as they get higher on the scale (further to the right). But as the equidistant average measures move, the percentage difference between them will decrease. A 12% difference in the middle of the scale will progressively drop to 10%, 7%, 4%, etc. only because there is a smaller and smaller proportion of people available to be included as you move further to the right on the scale. Lots of people see this and think the difference is shrinking, but it isn’t. Until we can line up policy and decision making with numbers that really add up, progress in many areas of life will continue to elude us.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: