Posts Tagged ‘comparable effectiveness’

Build it and they will come

February 8, 2011

“It” in the popular Kevin Costner movie, “Field of Dreams,” was a baseball diamond. He put it in a corn field. Not only did a ghost team conjure itself from the corn, so did a line of headlights on the road. There would seem to have been a stunning lack of preparation for crowds of fans, as parking, food, and toilet facilities were nowhere in sight.

Those things would be taken care of in due course, but that’s another story. The point has nothing to do with being realistic and everything to do with making dreams come true. Believing in yourself and your dreams is hard. Dreams are inherently unrealistic. As George Bernard Shaw said, reasonable people adapt to life and the world. It’s unreasonable people who think the world should adapt to them. And, accordingly, change comes about only because unreasonable and unrealistic people act to make things different.

I dream of a playing field, too. I can’t just go clear a few acres in a field to build it, though. The kind of clearing I’m dreaming of is more abstract. But the same idea applies. I, too, am certain that, if we build it, they will come.

What is it? Who are they? “It” is a better way for each of us to represent who we are to the world, and to see where we stand in it. It is a new language for speaking the truth of what we are each capable of. It is a way of tuning the instruments of a new science that will enable us to harmonize relationships of all kinds: personal, occupational, social, and economic.

Which brings us to who “they” are. They are us. Humanity. We are the players on this field that we will clear. We are the ones who care and who desire meaning. We are the ones who have been robbed of the trust, loyalty, and commitment we’ve invested in governments, corporations, and decades of failed institutions. We are the ones who know what has been lost, and what yet could still be gained. We are the ones who possess our individual skills, motivations, and health, but yet have no easy, transparent way to represent how much of any one of them we have, what quality it is, or how much it can be traded for. We are the ones who all share in the bounty of the earth’s fecund capacity for self-renewal, but who among us can show exactly how much the work we do every day adds or subtracts from the quality of the environment?

So why do I say, build it and they will come? Because this sort of thing is not something that can be created piecemeal. What if Costner’s character in the movie had not just built the field but had instead tried to find venture capital, recruit his dream team, set up a ticket sales vendor, hire management and staff, order uniforms and equipment, etc.? It never would have happened. It doesn’t work that way.

And so, finally, just what do we need to build? Just this: a new metric system. The task is to construct a system of measures for managing what’s most important in life: our relationships, our health, our capacity for productive and creative employment. We need a system that enables us to track our investments in intangible assets like education, health care, community, and quality of life. We need instruments tuned to the same scales, ones that take advantage of recently developed technical capacities for qualitatively meaningful quantification; for information synthesis across indicators/items/questions; for networked, collective thinking; for adaptive innovation support; and for creating fungible currencies in which human, social, and natural capital can be traded in efficient markets.

But this is not a system that can be built piecemeal. Infrastructure on this scale is too complex and too costly for any single individual, firm, or industry to create by itself. And building one part of it at a time will not work. We need to create the environment in which these new forms of life, these new species, these new markets for living capital, can take root and grow, organically. If we create that environment, with incentives and rewards capable of functioning like fertile soil, warm sun, and replenishing rain, it will be impossible to stop the growth.

You see, there are thousands of people around the world using new measurement methods to calibrate tests, surveys and assessments as valid and reliable instruments. But they are operating in an environment in which the fully viable seeds they have to plant are wasted. There’s no place for them to take root. There’s no sun, no water.

Why is the environment for the meaningful, uniform measurement of intangible assets so inhospitable? The primary answer to this question is cultural. We have ingrained and highly counterproductive attitudes toward what are often supposed to be the inherent properties of numbers. One very important attitude of this kind is that it is common to think that all numbers are quantitative. But lots of scoring systems and percentage reporting schemes involve numbers that do not stand for something that adds up. There is nothing automatic or simple about the way any given unit of calibrated measurement remains the same all up and down a scale. Arriving at a way to construct and maintain such a unit requires as much intensive research and imaginative investigation in the social sciences as it does in the natural sciences. But where the natural sciences and engineering have grown up around a focus on meaningful measurement, the social sciences have not.

One result of mistaken preconceptions about number is that even when tests, surveys, and assessments measure the same thing, they are disconnected from one another, tuned to different scales. There is no natural environment, no shared ecology, in which the growth of learning can take place in field-wide terms. There’s no common language in which to share what’s been learned. Even when research results are exactly the same, they look different.

But if there was a system of consensus-based reference standard metrics, one for each major construct–reading, writing, and math abilities; health status; physical and psychosocial functioning; quality of life; social and natural capital–there would be the expectation that instruments measuring the same thing should measure in the same unit. Researchers could be contributing to building larger systems when they calibrate new instruments and recalibrate old ones. They would more obviously be adding to the stock of human knowledge, understanding, and wisdom. Divergent results would demand explanations, and convergent ones would give us more confidence as we move forward.

Most importantly, quality improvement and consumer purchasing decisions and behaviors would be fluidly coordinated with no need for communicating and negotiating the details of each individual comparison. Education and health care lack common product definitions because their outcomes are measured in fragmented, incommensurable metrics. But if we had consensus-based reference standard metrics for every major form of capital employed in the economy, we could develop reasonable expectations expressed in a common language for how much change should typically be obtained in fifth-grade mathematics or from a hip replacement.

As is well-known in the business world, innovation is highly dependent on standards. We cannot empower the front line with the authority to make changes when decisions have to be based on information that is unavailable or impossible to interpret. Most of the previous entries in this blog take up various aspects of this situation.

All of this demands a very different way of thinking about what’s possible in the realm of measurement. The issues are complex. They are usually presented in difficult mathematical terms within specialized research reports. But the biggest problem has to do with thinking laterally, with moving ideas out of the vertical hierarchies of the silos where they are trapped and into a new field we can dream in. And the first seeds to be planted in such a field are the ones that say the dream is worth dreaming. When we hear that message, we are already on the way not just to building this dream, but to creating a world in which everyone can dream and envision more specific possibilities for their lives, their families, their creativity.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

You see, there are thousands of people around the world using these
new measurement methods to calibrate tests, surveys and assessments as
valid and reliable instruments. But they are operating in an
environment in which the fully viable seeds they have to plant are
wasted. There’s no place for them to take root. There’s no sun, no
water. 

This is because the instruments being calibrated are all disconnected.
Even instruments of the same kind measuring the same thing are
isolated from one another, tuned to different scales. There is no
natural environment, no shared ecology, in which the growth of
learning can take place. There’s no common language in which to share
what’s been learned. Even when results are exactly the same, they look
different.

 

You see, there are thousands of people around the world using these new measurement methods to calibrate tests, surveys and assessments as valid and reliable instruments. But they are operating in an environment in which the fully viable seeds they have to plant are wasted. There’s no place for them to take root. There’s no sun, no water. This is because the instruments being calibrated are all disconnected. Even instruments of the same kind measuring the same thing are isolated from one another, tuned to different scales. There is no natural environment, no shared ecology, in which the growth of learning can take place. There’s no common language in which to share what’s been learned. Even when results are exactly the same, they look different.

Advertisements

Three demands of meaningful measurement

September 28, 2009

The core issue in measurement is meaningfulness. There are three major aspects of meaningfulness to take into account in measurement. These have to do with the constancy of the unit, interpreting the size of differences in measures, and evaluating the coherence of the units and differences.

First, raw scores (counts of right answers or other events, sums of ratings, or rankings) do not stand for anything that adds up the way they do (see previous blogs for more on this). Any given raw score unit can be 4-5 times larger than another, depending on where they fall in the range. Meaningful measurement demands a constant unit. Instrument scaling methods provide it.

Second, meaningful measurement requires that we be able to say just what any quantitative amount of difference is supposed to represent. What does a difference between two measures stand for in the way of what is and isn’t done at those two levels? Is the difference within the range of error, and so random? Is the difference many times more than the error, and so repeatedly reproducible and constant? Meaningful measurement demands that we be able to make reliable distinctions.

Third, meaningful measurement demands that the items work together to measure the same thing. If reliable distinctions can be made between measures, what is the one thing that all of the items tap into? If the data exhibit a consistency that is shared across items and across persons, what is the nature of that consistency? Meaningful measurement posits a model of what data must look like to be interpretable and coherent, and then it evaluates data in light of that model.

When a constant unit is in hand, when the limits of randomness relative to stable differences are known, and when individual responses are consistent with one another, then, and only then, is measurement meaningful. Inconstant units, unknown amounts of random variation, and inconsistent data can never amount to the science we need for understanding and managing skills, abilities, health, motivations, social bonds, and environmental quality.

Managing our investments in human, social, and natural capital for positive returns demands that meaningful measurement be universalized in uniformly calibrated and accessible metrics. Scientifically rigorous, practical, and convenient methods for setting reference standards and making instruments traceable to them are readily available.

We have the means in hand for effecting order-of-magnitude improvements in the meaningfulness of the measures used in education, health care, human and environmental resource management, etc. It’s time we got to work on it.

We are what we measure. It’s time we measured what we want to be.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Contesting the Claim, Part II: Are Rasch Measures Really as Objective as Physical Measures?

July 22, 2009

When a raw score is sufficient to the task of measurement, the model is the Rasch model, we can estimate the parameters consistently, and we can evaluate the fit of the data to the model. The invariance properties that follow from a sufficient statistic include virtually the entire class of invariant rules (Hall, Wijsman, & Ghosh, 1965; Arnold, 1985), and similar relationships with other key measurement properties follow from there (Fischer, 1981, 1995; Newby, Conner, Grant, & Bunderson, 2009; Wright, 1977, 1997).

What does this all actually mean? Imagine we were able to ask an infinite number of people an infinite number of questions that all work together to measure the same thing. Because (1) the scores are sufficient statistics, (2) the ruler is not affected by what is measured, (3) the parameters separate, and (4) the data fit the model, any subset of the questions asked would give the same measure. This means that any subscore for any person measured would be a function of any and all other subscores. When a sufficient statistic is a function of all other sufficient statistics, it is not only sufficient, it is necessary, and is referred to as a minimally sufficient statistic. Thus, if separable, independent model parameters can be estimated, the model must be the Rasch model, and the raw score is both sufficient and necessary (Andersen, 1977; Dynkin, 1951; van der Linden, 1992).

This means that scores, ratings, and percentages actually stand for something measurable only when they fit a Rasch model.  After all, what actually would be the point of using data that do not support the estimation of independent parameters? If the meaning of the results is tied in unknown ways to the specific particulars of a given situation, then those results are meaningless, by definition (Roberts & Rosenbaum, 1986; Falmagne & Narens, 1983; Mundy, 1986; Narens, 2002; also see Embretson, 1996; Romanoski and Douglas, 2002). There would be no point in trying to learn anything from them, as whatever happened was a one-time unique event that tells us nothing we can use in any future event (Wright, 1977, 1997).

What we’ve done here is akin to taking a narrative stroll through a garden of mathematical proofs. These conceptual analyses can be very convincing, but actual demonstrations of them are essential. Demonstrations would be especially persuasive if there would be some way of showing three things. First, shouldn’t there be some way of constructing ordinal ratings or scores for one or another physical variable that, when scaled, give us measures that are the same as the usual measures we are accustomed to?

This would show that we can use the type of instrument usually found in the social sciences to construct physical measures with the characteristics we expect. There are four available examples, in fact, involving paired comparisons of weights (Choi, 1998), measures of short lengths (Fisher, 1988), ratings of medium-range distances (Moulton, 1993), and a recovery of the density scale (Pelton & Bunderson, 2003). In each case, the Rasch-calibrated experimental instruments produced measures equivalent to the controls, as shown in linear plots of the pairs of measures.

A second thing to build out from the mathematical proofs are experiments in which we check the purported stability of measures and calibrations. We can do this by splitting large data sets, using different groups of items to produce two or more measures for each person, or using different groups of respondents/examinees to provide data for two or more sets of item calibrations. This is a routine experimental procedure in many psychometric labs, and results tend to conform with theory, with strong associations found between increasing sample sizes and increasing reliability coefficients for the respective measures or calibrations. These associations can be plotted (Fisher, 2008), as can the pairs of calibrations estimated from different samples (Fisher, 1999), and the pairs of measures estimated from different instruments (Fisher, Harvey, Kilgore, et al., 1995; Smith & Taylor, 2004). The theoretical expectation of tighter plots for better designed instruments, larger sample sizes, and longer tests is confirmed so regularly that it should itself have the status of a law of nature (Linacre, 1993).

A third convincing demonstration is to compare studies of the same thing conducted in different times and places by different researchers using different instruments on different samples. If the instruments really measure the same thing, there will not only be obvious similarities in their item contents, but similar items will calibrate in similar positions on the metric across samples. Results of this kind have been obtained in at least three published studies (Fisher, 1997a, 1997b; Belyukova, Stone, & Fox, 2004).

All of these arguments are spelled out in greater length and detail, with illustrations, in a forthcoming article (Fisher, 2009). I learned all of this from Benjamin Wright, who worked directly with Rasch himself, and who, perhaps more importantly, was prepared for what he could learn from Rasch in his previous career as a physicist. Before encountering Rasch in 1960, Wright had worked with Feynman at Cornell, Townes at Bell Labs, and Mulliken at the University of Chicago. Taught and influenced not just by three of the great minds of twentieth-century physics, but also by Townes’ philosophical perspectives on meaning and beauty, Wright had left physics in search of life. He was happy to transfer his experience with computers into his new field of educational research, but he was dissatisfied with the quality of the data and how it was treated.

Rasch’s ideas gave Wright the conceptual tools he needed to integrate his scientific values with the demands of the field he was in. Over the course of his 40-year career in measurement, Wright wrote the first software for estimating Rasch model parameters and continuously improved it; he adapted new estimation algorithms for Rasch’s models and was involved in the articulation of new models; he applied the models to hundreds of data sets using his software; he vigorously invested himself in students and colleagues; he founded new professional societies, meetings, and journals;  and he never stopped learning how to think anew about measurement and the meaning of numbers. Through it all, there was always a yardstick handy as a simple way of conveying the basic requirements of measurement as we intuitively understand it in physical terms.

Those of us who spend a lot of time working with these ideas and trying them out on lots of different kinds of data forget or never realize how skewed our experience is relative to everyone else’s. I guess a person lives in a different world when you have the sustained luxury of working with very large databases, as I have had, and you see the constancy and stability of well-designed measures and calibrations over time, across instruments, and over repeated samples ranging from 30 to several million.

When you have that experience, it becomes a basic description of reasonable expectation to read the work of a colleague and see him say that “when the key features of a statistical model relevant to the analysis of social science data are the same as those of the laws of physics, then those features are difficult to ignore” (Andrich, 1988, p. 22). After calibrating dozens of instruments over 25 years, some of them many times over, it just seems like the plainest statement of the obvious to see the same guy say “Our measurement principles should be the same for properties of rocks as for the properties of people. What we say has to be consistent with physical measurement” (Andrich, 1998, p. 3).

And I find myself wishing more people held the opinion expressed by two other colleagues, that “scientific measures in the social sciences must hold to the same standards as do measures in the physical sciences if they are going to lead to the same quality of generalizations” (Bond & Fox, 2001, p. 2). When these sentiments are taken to their logical conclusion in a practical application, the real value of “attempting for reading comprehension what Newtonian mechanics achieved for astronomy” (Burdick & Stenner, 1996) becomes apparent. Rasch’s analogy of the structure of his model for reading tests and Newton’s Second Law can be restated relative to any physical law expressed as universal conditionals among variable triplets; a theory of the variable measured capable of predicting item calibrations provides the causal story for the observed variation (Burdick, Stone, & Stenner, 2006; DeBoeck & Wilson, 2004).

Knowing what I know, from the mathematical principles I’ve been trained in and from the extensive experimental work I’ve done, it seems amazing that so little attention is actually paid to tools and concepts that receive daily lip service as to their central importance in every facet of life, from health care to education to economics to business. Measurement technology rose up decades ago in preparation for the demands of today’s challenges. It is just plain weird the way we’re not using it to anything anywhere near its potential.

I’m convinced, though, that the problem is not a matter of persuasive rhetoric applied to the minds of the right people. Rather, someone, hopefully me, has got to configure the right combination of players in the right situation at the right time and at the right place to create a new form of real value that can’t be created any other way. Like they say, money talks. Persuasion is all well and good, but things will really take off only when people see that better measurement can aid in removing inefficiencies from the management of human, social, and natural capital, that better measurement is essential to creating sustainable and socially responsible policies and practices, and that better measurement means new sources of profitability.  I’m convinced that advanced measurement techniques are really nothing more than a new form of IT or communications technology. They will fit right into the existing networks and multiply their efficiencies many times over.

And when they do, we may be in a position to finally

“confront the remarkable fact that throughout the gigantic range of physical knowledge numerical laws assume a remarkably simple form provided fundamental measurement has taken place. Although the authors cannot explain this fact to their own satisfaction, the extension to behavioral science is obvious: we may have to await fundamental measurement before we will see any real progress in quantitative laws of behavior. In short, ordinal scales (even continuous ordinal scales) are perhaps not good enough and it may not be possible to live forever with a dozen different procedures for quantifying the same piece of behavior, each making strong but untestable and basically unlikely assumptions which result in nonlinear plots of one scale against another. Progress in physics would have been impossibly difficult without fundamental measurement and the reader who believes that all that is at stake in the axiomatic treatment of measurement is a possible criterion for canonizing one scaling procedure at the expense of others is missing the point” (Ramsay, Bloxom, and Cramer, 1975, p. 262).

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

A Tale of Two Industries: Contrasting Quality Assessment and Improvement Frameworks

July 8, 2009

Imagine the chaos that would result if industrial engineers each had their own tool sets calibrated in idiosyncratic metrics with unit sizes that changed depending on the size of what they measured, and they conducted quality improvement studies focusing on statistical significance tests of effect sizes. Furthermore, these engineers ignore the statistical power of their designs, and don’t know when they are finding statistically significant results by pure chance, and when they are not. And finally, they also ignore the substantive meaning of the numbers, so that they never consider the differences they’re studying in terms of varying probabilities of response to the questions they ask.

So when one engineer tries to generalize a result across applications, what happens is that it kind of works sometimes, doesn’t work at all other times, is often ignored, and does not command a compelling response from anyone because they are invested in their own metrics, samples, and results, which are different from everyone else’s. If there is any discussion of the relative merits of the research done, it is easy to fall into acrimonious and heated arguments that cannot be resolved because of the lack of consensus on what constitutes valid data, instrumentation, and theory.

Thus, the engineers put up the appearance of polite decorum. They smile and nod at each other’s local, sample-dependent, and irreproducible results, while they build mini-empires of funding, students, quoting circles, and professional associations on the basis of their personal authority and charisma. As they do so, costs in their industry go spiralling out of control, profits are almost nonexistent, fewer and fewer people can afford their products, smart people are going into other fields, and overall product quality is declining.

Of course, this is the state of affairs in education and health care, not in industrial engineering. In the latter field, the situation is much different. Here, everyone everywhere is very concerned to be sure they are always measuring the same thing as everyone else and in the same unit. Unexpected results of individual measures pop out instantly and are immediately redone. Innovations are more easily generated and disseminated because everyone is thinking together in the same language and seeing effects expressed in the same images. Someone else’s ideas and results can be easily fitted into anyone else’s experience, and the viability of a new way of doing things can be evaluated on the basis of one’s own experience and skills.

Arguments can be quite productive, as consensus on basic values drives the demand for evidence. Associations and successes are defined more in terms of merit earned from productivity and creativity demonstrated through the accumulation of generalized results. Costs in these industries are constantly dropping, profits are steady or increasing, more and more people can afford their products, smart people are coming into the field, and overall product quality is improving.

There is absolutely no reason why education and health care cannot thrive and grow like other industries. It is up to us to show how.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Publications Documenting Score, Rating, Percentage Contrasts with Real Measures

July 7, 2009

A few brief and easy introductions to the contrast between scores, ratings, and percentages vs measures include:

Linacre, J. M. (1992, Autumn). Why fuss about statistical sufficiency? Rasch Measurement Transactions, 6(3), 230 [http://www.rasch.org/rmt/rmt63c.htm].

Linacre, J. M. (1994, Summer). Likert or Rasch? Rasch Measurement Transactions, 8(2), 356 [http://www.rasch.org/rmt/rmt82d.htm].

Wright, B. D. (1992, Summer). Scores are not measures. Rasch Measurement Transactions, 6(1), 208 [http://www.rasch.org/rmt/rmt61n.htm].

Wright, B. D. (1989). Rasch model from counting right answers: Raw scores as sufficient statistics. Rasch Measurement Transactions, 3(2), 62 [http://www.rasch.org/rmt/rmt32e.htm].

Wright, B. D. (1993). Thinking with raw scores. Rasch Measurement Transactions, 7(2), 299-300 [http://www.rasch.org/rmt/rmt72r.htm].

Wright, B. D. (1999). Common sense for measurement. Rasch Measurement Transactions, 13(3), 704-5  [http://www.rasch.org/rmt/rmt133h.htm].

Longer and more technical comparisons include:

Andrich, D. (1989). Distinctions between assumptions and requirements in measurement in the social sciences. In J. A. Keats, R. Taft, R. A. Heath & S. H. Lovibond (Eds.), Mathematical and Theoretical Systems: Proceedings of the 24th International Congress of Psychology of the International Union of Psychological Science, Vol. 4 (pp. 7-16). North-Holland: Elsevier Science Publishers.

van Alphen, A., Halfens, R., Hasman, A., & Imbos, T. (1994). Likert or Rasch? Nothing is more applicable than good theory. Journal of Advanced Nursing, 20, 196-201.

Wright, B. D., & Linacre, J. M. (1989). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70(12), 857-867 [http://www.rasch.org/memo44.htm].

Zhu, W. (1996). Should total scores from a rating scale be used directly? Research Quarterly for Exercise and Sport, 67(3), 363-372.

The following lists provide some key resources. The lists are intended to be representative, not comprehensive.  There are many works in addition to these that document the claims in yesterday’s table. Many of these books and articles are highly technical.  Good introductions can be found in Bezruczko (2005), Bond and Fox (2007), Smith and Smith (2004), Wilson (2005), Wright and Stone (1979), Wright and Masters (1982), Wright and Linacre (1989), and elsewhere. The www.rasch.org web site has comprehensive and current information on seminars, consultants, software, full text articles, professional association meetings, etc.

Books and Journal Issues

Andrich, D. (1988). Rasch models for measurement. Sage University Paper Series on Quantitative Applications in the Social Sciences, vol. series no. 07-068. Beverly Hills, California: Sage Publications.

Andrich, D., & Douglas, G. A. (Eds.). (1982). Rasch models for measurement in educational and psychological research [Special issue]. Education Research and Perspectives, 9(1), 5-118. [Full text available at www.rasch.org.]

Bezruczko, N. (Ed.). (2005). Rasch measurement in health sciences. Maple Grove, MN: JAM Press.

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2d edition. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Choppin, B. (1985). In Memoriam: Bruce Choppin (T. N. Postlethwaite ed.) [Special issue]. Evaluation in Education: An International Review Series, 9(1).

DeBoeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. Statistics for Social and Behavioral Sciences). New York: Springer-Verlag.

Embretson, S. E., & Hershberger, S. L. (Eds.). (1999). The new rules of measurement: What every psychologist and educator should know. Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Engelhard, G., Jr., & Wilson, M. (1996). Objective measurement: Theory into practice, Vol. 3. Norwood, New Jersey: Ablex.

Fischer, G. H., & Molenaar, I. (1995). Rasch models: Foundations, recent developments, and applications. New York: Springer-Verlag.

Fisher, W. P., Jr., & Wright, B. D. (Eds.). (1994). Applications of Probabilistic Conjoint Measurement [Special Issue]. International Journal of Educational Research, 21(6), 557-664.

Garner, M., Draney, K., Wilson, M., Engelhard, G., Jr., & Fisher, W. P., Jr. (Eds.). (2009). Advances in Rasch measurement, Vol. One. Maple Grove, MN: JAM Press.

Granger, C. V., & Gresham, G. E. (Eds). (1993, August). New Developments in Functional Assessment [Special Issue]. Physical Medicine and Rehabilitation Clinics of North America, 4(3), 417-611.

Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, Illinois: MESA Press.

Liu, X., & Boone, W. (2006). Applications of Rasch measurement in science education. Maple Grove, MN: JAM Press.

Masters, G. N. (2007). Special issue: Programme for International Student Assessment (PISA). Journal of Applied Measurement, 8(3), 235-335.

Masters, G. N., & Keeves, J. P. (Eds.). (1999). Advances in measurement in educational research and assessment. New York: Pergamon.

Osborne, J. W. (Ed.). (2007). Best practices in quantitative methods. Thousand Oaks, CA: Sage.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Smith, E. V., Jr., & Smith, R. M. (Eds.) (2004). Introduction to Rasch measurement. Maple Grove, MN: JAM Press.

Smith, E. V., Jr., & Smith, R. M. (2007). Rasch measurement: Advanced and specialized applications. Maple Grove, MN: JAM Press.

Smith, R. M. (Ed.). (1997, June). Outcome Measurement [Special Issue]. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 261-428.

Smith, R. M. (1999). Rasch measurement models. Maple Grove, MN: JAM Press.

von Davier, M. (2006). Multivariate and mixture distribution Rasch models. New York: Springer.

Wilson, M. (1992). Objective measurement: Theory into practice, Vol. 1. Norwood, New Jersey: Ablex.

Wilson, M. (1994). Objective measurement: Theory into practice, Vol. 2. Norwood, New Jersey: Ablex.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wilson, M., Draney, K., Brown, N., & Duckor, B. (Eds.). (2009). Advances in Rasch measurement, Vol. Two (p. in press). Maple Grove, MN: JAM Press.

Wilson, M., & Engelhard, G. (2000). Objective measurement: Theory into practice, Vol. 5. Westport, Connecticut: Ablex Publishing.

Wilson, M., Engelhard, G., & Draney, K. (Eds.). (1997). Objective measurement: Theory into practice, Vol. 4. Norwood, New Jersey: Ablex.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1999). Measurement essentials. Wilmington, DE: Wide Range, Inc. [http://www.rasch.org/memos.htm#measess].

Key Articles

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-73.

Andrich, D. (2002). Understanding resistance to the data-model relationship in Rasch’s paradigm: A reflection for the next generation. Journal of Applied Measurement, 3(3), 325-59.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Beltyukova, S. A., Stone, G. E., & Fox, C. M. (2008). Magnitude estimation and categorical rating scaling in social sciences: A theoretical and psychometric controversy. Journal of Applied Measurement, 9(2), 151-159.

Choppin, B. (1968). An item bank using sample-free calibration. Nature, 219, 870-872.

Embretson, S. E. (1996, September). Item Response Theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, 20(3), 201-212.

Engelhard, G. (2008, July). Historical perspectives on invariant measurement: Guttman, Rasch, and Mokken. Measurement: Interdisciplinary Research & Perspectives, 6(3), 155-189.

Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.

Fischer, G. H. (1981, March). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.

Fischer, G. H. (1989). Applying the principles of specific objectivity and of generalizability to the measurement of change. Psychometrika, 52(4), 565-587.

Fisher, W. P., Jr. (1997). Physical disability construct convergence across instruments: Towards a universal metric. Journal of Outcome Measurement, 1(2), 87-113.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2009, July). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), in press.

Grosse, M. E., & Wright, B. D. (1986, Sep). Setting, evaluating, and maintaining certification standards with the Rasch model. Evaluation & the Health Professions, 9(3), 267-285.

Hall, W. J., Wijsman, R. A., & Ghosh, J. K. (1965). The relationship between sufficiency and invariance with applications in sequential analysis. Annals of Mathematical Statistics, 36, 575-614.

Kamata, A. (2001, March). Item analysis by the Hierarchical Generalized Linear Model. Journal of Educational Measurement, 38(1), 79-93.

Karabatsos, G., & Ullrich, J. R. (2002). Enumerating and testing conjoint measurement models. Mathematical Social Sciences, 43, 487-505.

Linacre, J. M. (1997). Instantaneous measurement and diagnosis. Physical Medicine and Rehabilitation State of the Art Reviews, 11(2), 315-324.

Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3(1), 85-106.

Lunz, M. E., & Bergstrom, B. A. (1991). Comparability of decision for computer adaptive and written examinations. Journal of Allied Health, 20(1), 15-23.

Lunz, M. E., Wright, B. D., & Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3/4, 331-345.

Masters, G. N. (1985, March). Common-person equating with the Rasch model. Applied Psychological Measurement, 9(1), 73-82.

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3-62.

Pelton, T., & Bunderson, V. (2003). The recovery of the density scale using a stochastic quasi-realization of additive conjoint measurement. Journal of Applied Measurement, 4(3), 269-81.

Rasch, G. (1961). On general laws and the meaning of measurement in psychology. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability (pp. 321-333 [http://www.rasch.org/memo1960.pdf]). Berkeley, California: University of California Press.

Rasch, G. (1966). An individualistic approach to item analysis. In P. F. Lazarsfeld & N. W. Henry (Eds.), Readings in mathematical social science (pp. 89-108). Chicago, Illinois: Science Research Associates.

Rasch, G. (1966, July). An informal report on the present state of a theory of objectivity in comparisons. Unpublished paper [http://www.rasch.org/memo1966.pdf].

Rasch, G. (1966). An item analysis which takes individual differences into account. British Journal of Mathematical and Statistical Psychology, 19, 49-57.

Rasch, G. (1968, September 6). A mathematical theory of objectivity and its consequences for model construction. [Unpublished paper [http://www.rasch.org/memo1968.pdf]], Amsterdam, the Netherlands: Institute of Mathematical Statistics, European Branch.

Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 14, 58-94.

Romanoski, J. T., & Douglas, G. (2002). Rasch-transformed raw scores and two-way ANOVA: A simulation analysis. Journal of Applied Measurement, 3(4), 421-430.

Smith, R. M. (1996). A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling, 3(1), 25-40.

Smith, R. M. (2000). Fit analysis in latent trait measurement models. Journal of Applied Measurement, 1(2), 199-218.

Stenner, A. J., & Smith III, M. (1982). Testing construct theories. Perceptual and Motor Skills, 55, 415-426.

Stenner, A. J. (1994). Specific objectivity – local and general. Rasch Measurement Transactions, 8(3), 374 [http://www.rasch.org/rmt/rmt83e.htm].

Stone, G. E., Beltyukova, S. A., & Fox, C. M. (2008). Objective standard setting for judge-mediated examinations. International Journal of Testing, 8(2), 180-196.

Stone, M. H. (2003). Substantive scale construction. Journal of Applied Measurement, 4(3), 282-97.

Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13(2), 181-208.

Wright, B. D. (1968). Sample-free test calibration and person measurement. In Proceedings of the 1967 invitational conference on testing problems (pp. 85-101 [http://www.rasch.org/memo1.htm]). Princeton, New Jersey: Educational Testing Service.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1980). Foreword, Afterword. In Probabilistic models for some intelligence and attainment tests, by Georg Rasch (pp. ix-xix, 185-199. http://www.rasch.org/memo63.htm). Chicago, Illinois: University of Chicago Press.

Wright, B. D. (1984). Despair and hope for educational measurement. Contemporary Education Review, 3(1), 281-288 [http://www.rasch.org/memo41.htm].

Wright, B. D. (1985). Additivity in psychological measurement. In E. Roskam (Ed.), Measurement and personality assessment. North Holland: Elsevier Science Ltd.

Wright, B. D. (1996). Comparing Rasch measurement and factor analysis. Structural Equation Modeling, 3(1), 3-24.

Wright, B. D. (1997, June). Fundamental measurement for outcome evaluation. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 261-88.

Wright, B. D. (1997, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45, 52 [http://www.rasch.org/memo62.htm].

Wright, B. D. (1999). Fundamental measurement for psychology. In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every educator and psychologist should know (pp. 65-104 [http://www.rasch.org/memo64.htm]). Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Wright, B. D., & Bell, S. R. (1984, Winter). Item banks: What, why, how. Journal of Educational Measurement, 21(4), 331-345 [http://www.rasch.org/memo43.htm].

Wright, B. D., & Linacre, J. M. (1989). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70(12), 857-867 [http://www.rasch.org/memo44.htm].

Wright, B. D., & Mok, M. (2000). Understanding Rasch measurement: Rasch models overview. Journal of Applied Measurement, 1(1), 83-106.

Model Applications

Adams, R. J., Wu, M. L., & Macaskill, G. (1997). Scaling methodology and procedures for the mathematics and science scales. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study Technical Report: Vol. 2: Implementation and Analysis – Primary and Middle School Years. Boston: Center for the Study of Testing, Evaluation, and Educational Policy.

Andrich, D., & Van Schoubroeck, L. (1989, May). The General Health Questionnaire: A psychometric analysis using latent trait theory. Psychological Medicine, 19(2), 469-485.

Beltyukova, S. A., Stone, G. E., & Fox, C. M. (2004). Equating student satisfaction measures. Journal of Applied Measurement, 5(1), 62-9.

Bergstrom, B. A., & Lunz, M. E. (1999). CAT for certification and licensure. In F. Drasgow & J. B. Olson-Buchanan (Eds.), Innovations in computerized assessment (pp. 67-91). Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc., Publishers.

Bond, T. G. (1994). Piaget and measurement II: Empirical validation of the Piagetian model. Archives de Psychologie, 63, 155-185.

Bunderson, C. V., & Newby, V. A. (2009). The relationships among design experiments, invariant measurement scales, and domain theories. Journal of Applied Measurement, 10(2), 117-137.

Cavanagh, R. F., & Romanoski, J. T. (2006, October). Rating scale instruments and measurement. Learning Environments Research, 9(3), 273-289.

Cipriani, D., Fox, C., Khuder, S., & Boudreau, N. (2005). Comparing Rasch analyses probability estimates to sensitivity, specificity and likelihood ratios when examining the utility of medical diagnostic tests. Journal of Applied Measurement, 6(2), 180-201.

Dawson, T. L. (2004, April). Assessing intellectual development: Three approaches, one sequence. Journal of Adult Development, 11(2), 71-85.

DeSalvo, K., Fisher, W. P. Jr., Tran, K., Bloser, N., Merrill, W., & Peabody, J. W. (2006, March). Assessing measurement properties of two single-item general health measures. Quality of Life Research, 15(2), 191-201.

Engelhard, G., Jr. (1992). The measurement of writing ability with a many-faceted Rasch model. Applied Measurement in Education, 5(3), 171-191.

Engelhard, G., Jr. (1997). Constructing rater and task banks for performance assessment. Journal of Outcome Measurement, 1(1), 19-33.

Fisher, W. P., Jr. (1998). A research program for accountable and patient-centered health status measures. Journal of Outcome Measurement, 2(3), 222-239.

Fisher, W. P., Jr., Harvey, R. F., Taylor, P., Kilgore, K. M., & Kelly, C. K. (1995, February). Rehabits: A common language of functional assessment. Archives of Physical Medicine and Rehabilitation, 76(2), 113-122.

Heinemann, A. W., Gershon, R., & Fisher, W. P., Jr. (2006). Development and application of the Orthotics and Prosthetics User Survey: Applications and opportunities for health care quality improvement. Journal of Prosthetics and Orthotics, 18(1), 80-85 [http://www.oandp.org/jpo/library/2006_01S_080.asp].

Heinemann, A. W., Linacre, J. M., Wright, B. D., Hamilton, B. B., & Granger, C. V. (1994). Prediction of rehabilitation outcomes with disability measures. Archives of Physical Medicine and Rehabilitation, 75(2), 133-143.

Hobart, J. C., Cano, S. J., O’Connor, R. J., Kinos, S., Heinzlef, O., Roullet, E. P., C., et al. (2003). Multiple Sclerosis Impact Scale-29 (MSIS-29):  Measurement stability across eight European countries. Multiple Sclerosis, 9, S23.

Hobart, J. C., Cano, S. J., Zajicek, J. P., & Thompson, A. J. (2007, December). Rating scales as outcome measures for clinical trials in neurology: Problems, solutions, and recommendations. Lancet Neurology, 6, 1094-1105.

Lai, J., Fisher, A., Magalhaes, L., & Bundy, A. C. (1996). Construct validity of the sensory integration and praxis tests. Occupational Therapy Journal of Research, 16(2), 75-97.

Lee, N. P., & Fisher, W. P., Jr. (2005). Evaluation of the Diabetes Self Care Scale. Journal of Applied Measurement, 6(4), 366-81.

Ludlow, L. H., & Haley, S. M. (1995, December). Rasch model logits: Interpretation, use, and transformation. Educational and Psychological Measurement, 55(6), 967-975.

Markward, N. J., & Fisher, W. P., Jr. (2004). Calibrating the genome. Journal of Applied Measurement, 5(2), 129-41.

Massof, R. W. (2007, August). An interval-scaled scoring algorithm for visual function questionnaires. Optometry & Vision Science, 84(8), E690-E705.

Massof, R. W. (2008, July-August). Editorial: Moving toward scientific measurements of quality of life. Ophthalmic Epidemiology, 15, 209-211.

Masters, G. N., Adams, R. J., & Lokan, J. (1994). Mapping student achievement. International Journal of Educational Research, 21(6), 595-610.

Mead, R. J. (2009). The ISR: Intelligent Student Reports. Journal of Applied Measurement, 10(2), 208-224.

Pelton, T., & Bunderson, V. (2003). The recovery of the density scale using a stochastic quasi-realization of additive conjoint measurement. Journal of Applied Measurement, 4(3), 269-81.

Smith, E. V., Jr. (2000). Metric development and score reporting in Rasch measurement. Journal of Applied Measurement, 1(3), 303-26.

Smith, R. M., & Taylor, P. (2004). Equating rehabilitation outcome scales: Developing common metrics. Journal of Applied Measurement, 5(3), 229-42.

Solloway, S., & Fisher, W. P., Jr. (2007). Mindfulness in measurement: Reconsidering the measurable in mindfulness. International Journal of Transpersonal Studies, 26, 58-81 [http://www.transpersonalstudies.org/volume_26_2007.html].

Stenner, A. J. (2001). The Lexile Framework: A common metric for matching readers and texts. California School Library Journal, 25(1), 41-2.

Wolfe, E. W., Ray, L. M., & Harris, D. C. (2004, October). A Rasch analysis of three measures of teacher perception generated from the School and Staffing Survey. Educational and Psychological Measurement, 64(5), 842-860.

Wolfe, F., Hawley, D., Goldenberg, D., Russell, I., Buskila, D., & Neumann, L. (2000, Aug). The assessment of functional impairment in fibromyalgia (FM): Rasch analyses of 5 functional scales and the development of the FM Health Assessment Questionnaire. Journal of Rheumatology, 27(8), 1989-99.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

W

endt, A., & Tatum, D. S. (2005). Credentialing health care professionals. In N. Bezruczko (Ed.), Rasch measurement in health sciences (pp. 161-75). Maple Grove, MN: JAM Press.

Graphic Illustrations of Why Scores, Ratings, and Percentages Are Not Measures, Part Two

July 2, 2009

Part One of this two-part blog offered pictures illustrating the difference between numbers that stand for something that adds up and those that do not. The uncontrolled variation in the numbers that pass for measures in health care, education, satisfaction surveys, performance assessments, etc. is analogous to the variation in weights and measures found in Medieval European markets. It is well established that metric uniformity played a vital role in the industrial and scientific revolutions of the nineteenth century. Metrology will inevitably play a similarly central role in the economic and scientific revolutions taking place today.

Clients and students often express their need for measures that are manageable, understandable, and relevant. But sometimes it turns out that we do not understand what we think we understand. New understandings can make what previously seemed manageable and relevant appear unmanageable and irrelevant. Perhaps our misunderstandings about measurement will one day explain why we have failed to innovate and improve as much as we could have.

Of course, there are statistical methods for standardizing scores and proportions that make them comparable across different normal distributions, but I’ve never once seen them applied to employee, customer, or patient survey results reported to business or hospital managers. They certainly are not used in determining comparable proficiency levels of students under No Child Left Behind. Perhaps there are consultants and reporting systems that make standardized z-scores a routine part of their practices, but even if they are, why should anyone willingly base their decisions on the assumption that normal distributions have been obtained? Why not use methods that give the same result no matter how scores are distributed?

To bring the point home, if statistical standardization is a form of measurement, why don’t we use the z-scores for height distributions instead of the direct measures of how tall we each are? Plainly, the two kinds of numbers have different applications. Somehow, though, we try to make do without the measures in many applications involving tests and surveys, with the unfortunate consequence of much lost information and many lost opportunities for better communication.

Sometimes I wonder, if we would give a test on the meaning of the scores, percentages, and logits discussed in Part One to managers, executives, and entrepreneurs, would many do any better on the parts they think they understand than on the parts they find unfamiliar? I suspect not. Some executives whose pay-for-performance bonuses are inflated by statistical accidents are going to be unhappy with what I’m going to say here, but, as I’ve been saying for years, clarifying financial implications will go a long way toward motivating the needed changes.

How could that be true? Well, consider the way we treat percentages. Imagine that three different hospitals see their patients’ percents agreement with a key survey item change as follows. Which one changed the most?

 

A. from 30.85% to 50.00%: a 19.15% change

B. from 6.68% to 15.87%: a 9.18% change

C. from 69.15% to 84.13%: a 14.99% change

As is illustrated in Figure 1 below, given that all three pairs of administrations of the survey are included together in the same measure distribution, it is likely that the three changes were all the same size.

In this scenario, all the survey administrations shared the same standard deviation in the underlying measure distribution that the key item’s percentage was drawn from, and they started from different initial measures. Different ranges in the measures are associated with different parts of the sample’s distribution, and so different numbers and percentages of patients are associated with the same amount of measured change. It is easy to see that 100-unit measured gains in the range of 50-150 or 1000-1100 on the horizontal axis would scarcely amount to 1% changes, but the same measured gain in the middle of the distribution could be as much as 25%.

Figure 1. Different Percents, Same Measures

Figure 1. Different Percentages, Same Measures

Figure 1 shows how the same measured gain can look wildly different when expressed as a percentage, depending on where the initial measure is positioned in the distribution. But what happens when percentage gains are situated in different distributions that have different patterns of variation?

More specifically, consider a situation in which three different hospitals see their percents agreement with a key survey item change as follows.

A. from 30.85% to 50.00%: a 19.15% change

B. from 30.85% to 50.00%: a 19.15% change

C. from 30.85% to 50.00%: a 19.15% change

Did one change more than the others? Of course, the three percentages are all the same, so we would naturally think that the three increases are all the same. But what if the standard deviations characterizing the three different hospitals’ score distributions are different?

Figure 2, below, shows that the three 19.15% changes could be associated with quite different measured gains. When the distribution is wider and the standard deviation is larger, any given percentage change will be associated with a larger measured change than in cases with narrower distributions and smaller standard deviations.

Same Percentage Gains, Different Measured Gains

Figure 2. Same Percentage Gains, Different Measured Gains

And if this is not enough evidence as to the foolhardiness of treating percentages as measures, bear with me through one more example. Imagine another situation in which three different hospitals see their percents agreement with a key survey item change as follows.

A. from 30.85% to 50.00%: a 19.15% change

B. from 36.96% to 50.00%: a 13.04% change

C. from 36.96% to 50.00%: a 13.04% change

Did one change more than the others? Plainly A obtains the largest percentage gain. But Figure 3 shows that, depending on the underlying distribution, A’s 19.15% gain might be a smaller measured change than either B’s or C’s. Further, B’s and C’s measures might not be identical, contrary to what would be expected from the percentages alone.

Figure 3. Percentages Completely at Odds with Measures

Figure 3. Percentages Completely at Odds with Measures

Now we have a fuller appreciation of the scope of the problems associated with the changing unit size illustrated in Part One. Though we think we understand percentages and insist on using them as something familiar and routine, the world that they present to us is as crazily distorted as a carnival funhouse. And we won’t even begin to consider how things look in the context of distributions skewed toward one end of the continuum or the other! There is similarly no point at all in going to bimodal or multimodal distributions (ones that have more than one peak). The vast majority of business applications employing scores, ratings, and percentages as measures do not take the underlying distribution into account. Given the problems that arise in optimal conditions (i.e., with a normal distribution), there is no need to belabor the issue with an enumeration of all the possible things that could be going wrong. Far better to simply move on and construct measurement systems that remain invariant across the different shapes of local data sets’ particular distributions.

How could we have gone so far in making these nonsensical numbers the focus of our attention? To put things back in perspective, we need to keep in mind the evolving magnitude of the problems we face. When Florence Nightingale was deploring the lack of any available indications of the effectiveness of her efforts, a little bit of flawed information was a significant improvement over no information. Ordinal, situation-specific numbers provided highly useful information when problems emerged in local contexts on a scale that could be comprehended and addressed by individuals and small groups.

We no longer live in that world. Today’s problems require kinds of information that must be more meaningful, precise, and actionable than ever before. And not only that, this information cannot remain accessible only to managers, executives, researchers, and data managers. It must be brought to bear in every transaction and information exchange in the industry.

Information has to be formatted in the common currency of uniform metrics to make it as fluid and empowering as possible. Would the auto industry have been able to bring off a quality revolution if every worker’s toolkit was calibrated in a different unit? Could we expect to coordinate schedules easily if we each had clocks scaled in different time units? Obviously not; why should we expect quality revolutions in health care and education when nearly all of our relevant metrics are incommensurable?

Management consultants realized decades ago that information creates a sense of responsibility in the person who possesses it. We cannot expect clinicians and teachers to take full responsibility for the outcomes they produce until they have the information they need to evaluate and improve them. Existing data and systems plainly are not up to the task.

The problem is far less a matter of complex or difficult issues than it is one of culture and priorities. It often takes less effort to remain in a dysfunctional rut and deal with massive inefficiencies than it does to get out of the rut and invent a new system with new potentials. Big changes tend to take place only when systems become so bogged down by their problems that new systems emerge simply out of the need to find some way to keep things in motion. These blogs are written in the hope that we might be able to find our way to new methods without suffering the catastrophes of total system failure. One might well imagine an entrepreneurially-minded consortium of providers, researchers, payors, accreditors, and patient advocates joining forces in small pilot projects testing out new experimental systems.

To know how much of something we’re getting for our money and whether its a fair bargain, we need to be able to compare amounts across providers, vendors, treatment options, teaching methods, etc. Scores summed from tests, surveys, or assessments, individual ratings, and percentages of a maximum possible score or frequency do not provide this information because they are not measures. Their unit sizes vary across individuals, collections of indicators (instruments), time, and space. The consequences of treating scores and percentages as measures are not trivial. We will eventually come to see that measurement quality is the primary source of the differences between the current health care and education systems’ regional variations and endlessly spiralling costs, on the one hand, and the geographically uniform quality, costs, and improvements in the systems we will create in the future.

Markets are dysfunctional when quality and costs cannot be evaluated in common terms by consumers, providers’ quality improvement specialists, researchers, accreditors, and payers. There are widespread calls for greater transparency in purchasing decisions, but transparency is not being defined and operationalized meaningfully or usefully. As currently employed, transparency refers to making key data available for public scrutiny. But these data are almost always expressed as scores, ratings, or percentages that are anything but transparent. In addition to not adding up, these data are also usually presented in indigestibly large volumes, and are not quality assessed.

All things considered, we’re doing amazingly well with our health care and education systems given the way we’ve hobbled ourselves with dysfunctional, incommensurable measures. And that gives us real cause for hope! What will we be able to accomplish when we really put our minds to measuring what we want to manage? How much better will we be able to do when entrepreneurs have the tools they need to innovate new efficiences? Who knows what we’ll be capable of when we have meaningful measures that stand for amounts that really add up, when data volumes are dramatically reduced to manageable levels, and when data quality is effectively assessed and improved?

For more on the problems associated with these kinds of percentages in the context of NCLB, see Andrew Dean Ho’s article in the August/September, 2008 issue of Educational Researcher, and Charles Murray’s “By the Numbers” column in the July 25, 2006 Wall Street Journal.

This is not the end of the story as to what the new measurement paradigm brings to bear. Next, I’ll post a table contrasting the features of scores, ratings, and percentages with those of measures. Until then, check out the latest issue of the Journal of Applied Measurement at http://www.jampress.org, see what’s new in measurement software at http://www.winsteps.com or http://www.rummlab.com.au, or look into what’s up in the way of measurement research projects with the BEAR group at UC Berkeley (http://gse.berkeley.edu/research/BEAR/research.html).

Finally, keep in mind that we are what we measure. It’s time we measured what we want to be.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Graphic Illustrations of Why Scores, Ratings, and Percentages Are Not Measures, Part One

July 1, 2009

It happens occasionally when I’m speaking to a group unfamiliar with measurement concepts that my audiences audibly gasp at some of the things I say. What can be so shocking about anything as mundane as measurement? A lot of things, in fact, since we are in the strange situation of having valid and rigorous intuitions about what measures ought to be, while we simultaneously have entire domains of life in which our measures almost never live up to those intuitions in practice.

So today I’d like to spell out a few things about measurement, graphically. First, I’m going to draw a picture of what good measurement looks like. This picture will illustrate why we value numbers and want to use them for managing what’s important. Then I’m going to draw a picture of what scores, ratings, and percentages look like. Here we’ll see how numbers do not automatically stand for something that adds up the way they do, and why we don’t want to use these funny numbers for managing anything we really care about. What we will see here, in effect, is why high stakes graduation, admissions, and professional certification and licensure testing agencies have long since abandoned scores, ratings, and percentages as their primary basis for making decisions.

After contrasting those pictures, a third picture will illustrate how to blend the valid intuitions informing what we expect from measures with the equally valid intuitions informing the observations expressed in scores, ratings, and percentages.

Imagine measuring everything in the room you’re in twice, once with a yardstick and once with a meterstick. You record every measure in inches and in centimeters. Then you plot these pairs of measures against each other, with inches on the vertical axis and centimeters on the horizontal. You would come up with a picture like Figure 1, below.

Figure 1. How We Expect Measures to Work

Figure 1. How We Expect Measures to Work

The key thing to appreciate about this plot is that the amounts of length measured by the two different instruments stay the same no matter which number line they are mapped onto. You would get a plot like this even if you sawed a yardstick in half and plotted the inches read off the two halves. You’d also get the same kind of a plot (obviously) if you paired up measures of the same things from two different manufacturer’s inch rulers, or from two different brands of metersticks. And you could do the same kind of thing with ounces and grams, or degrees Fahrenheit and Celsius.

So here we are immersed in the boring-to-the-point-of-being-banal details of measurement. We take these alignments completely for granted, but they are not given to us for nothing. They are products of the huge investments we make in metrological standards. Metrology came of age in the early nineteenth century. Until then, weights and measures varied from market to market. Units with the same name might be different sizes, and units with different names might be the same size. As was so rightly celebrated on World Metrology Day (May 20), metric uniformity contributes hugely to the world economy by reducing transaction costs and by structuring representations of fair value.

We are in dire need of similar metrological systems for human, social, and natural capital. Health care reform, improved education systems, and environmental management will not come anywhere near realizing their full potentials until we establish, implement, and monitor metrological standards that bring intangible forms of capital to economic life.

But can we construct plots like Figure 1 from the numeric scores, ratings, and percentages we commonly assume to be measures? Figure 2 shows the kind of picture we get when we plot percentages against each other (scores and ratings behave in the same way, for reasons given below). These data might be from easy and hard halves of the same reading or math test, from agreeable and disagreeable ends of the same rating scale survey, or from different tests or surveys that happen to vary in their difficulty or agreeability. The Figure 2 data might also come from different situations in which some event or outcome occurs more frequently in one place than it does in another (we’ll go more into this in Part Two of this report).

Figure 2. Percents Correct or Agreement from Different Tests or Surveys

Figure 2. Percents Correct or Agreement from Different Tests or Surveys

In contrast with the linear relation obtained in the comparison of inches and centimeters, here we have a curve. Why must this relation necessarily be curved? It cannot be linear because both instruments limit their measurement ranges, and they set different limits. So, if someone scores a 0 on the easy instrument, they are highly likely to also score 0 on the instrument posing more difficult or disagreeable questions. Conversely, if someone scores 100 on the hard instrument, they are highly likely to also score 100 on the easy one.

But what is going to happen in the rest of the measurement range? By the definition of easy and hard, scores on the easy instrument will be higher than those on the hard one. And because the same measured amount is associated with different ranges in the easy and hard score distributions, the scores vary at different rates (Part Two will explore this phenomenon in more detail).

These kinds of numbers are called ordinal because they meaningfully convey information about rank order. They do not, however, stand for amounts that add up. We are, of course, completely free to treat these ordinal numbers however we want, in any kind of arithmetical or statistical comparison. Whether such comparisons are meaningful and useful is a completely different issue.

Figure 3 shows the Figure 2 data transformed. The mathematical transformation of the percentages produces what is known as a logit, so called because it is a log-odds unit, obtained as the natural logarithm of the response odds. (The response odds are the response probabilities–the original percentages of the maximum possible score–divided by one minus themselves.) This is the simplest possible way of estimating linear measures. Virtually no computer program providing these kinds of estimates would employ an algorithm this simple and potentially fallible.

Figure 3. Logit (Log-Odds Units) Estimates of the Figure 2 Data

Figure 3. Logit (Log-Odds Units) Estimates of the Figure 2 Data

Although the relationship shown in Figure 3 is not as precise as that shown in Figure 1, especially at the extremes, the values plotted fall far closer to the identity line than the values in Figure 2 do. Like Figure 1, Figure 3 shows that constant amounts of the thing measured exist irrespective of the particular number line they happen to be mapped onto.

What this means is that the two instruments could be designed so that the same numbers are read off of them when the same amounts are measured. We value numbers as much as we do because they are so completely transparent: 2+2=4 no matter what. But this transparency can be a liability when we assume that every unit amount is the same as all the others and they actually vary substantially. When different units stand for different amounts, confusion reigns. But we can reasonably hope and strive for great things as we bring human, social, and natural capital to life via universally uniform metrics traceable to reference standards.

A large literature on these methods exists and ought to be more widely read. For more information, see http://www.rasch.org, http://www.livingcapitalmetrics.com, etc.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Infrastructure and Health Care Reform

June 25, 2009

As an educator and researcher involved in the theory and application of advanced measurement methods, I am both encouraged by the (June 14) New York Times Sunday magazine’s focus on infrastructure, and chagrined at the uninformed level at which ongoing health care and economic reform discussions and analyses are taking place (as evident in the Sunday, June 21, Times editorial and business pages).

Socialistic solutions to problems in education, health care, and the economy at large are the inevitable outcome of our incomplete implementation and understanding of market capitalism. Take, for instance, the rancorous debate as to whether we should create a new public health insurance plan to compete with private plans. None of the proposals or counter proposals amount to anything more than alternate ways of manhandling health care resources toward one or another politically predetermined end. Accordingly, we find ourselves in the dilemma of choosing between equally real dangers. On the one hand, reduced payments and cost-cutting might do nothing but lower the quality and quantity of the available services, and, on the other hand, maintaining quality and quantity will eventually make health care completely unaffordable.

And here is what really gets me: apart from blind faith in the power of reduced payments to promote innovation, there is nary a word about how to set up a market infrastructure that will allow the invisible hand to do its work in bringing supply and demand efficiently into balance. Far from seeking ways in which costs can be reduced and profits enhanced at the same time, as they are in other industries, the automatic assumption in health care always seems to be that lower costs mean lower profits. We have always thought socialistically about health care, with economists, since Arrow, widely holding that health care is constitutionally incapable of sustaining a market economy. Hope that the economists are wrong appears to spring eternal, but who is doing the work to find a new way?

A new direction shows itself when we listen more closely to ourselves, and follow through on our basically valid intuitions. For instance, issues of sustainability, justice, and responsibility in the economic conversation employ the word “capital” to refer to a wide variety of resources essential to productivity, such as health, literacy, numeracy, community, and the air, water, and food services provided by nature.

The problem is that there seems to be little or no interest in figuring out how to transform this usage from an empty metaphor into a powerful tool. We similarly repeat ad nauseum the mantra, “you manage what you measure,” but almost nothing is being done to employ the highly advantageous features of advanced measurement theory and practice in the management of intangible forms of capital.

Better measurement of living capital is, however, absolutely essential to health care reform, entrepreneurial innovations in education, and to reinventing capitalism.  Instead of continuing to rely on highly variable local efforts at measuring and managing human, social, and natural capital, we need a broad program of capacity building focused on a metrological infrastructure of living capital, and its implementations.  If there is any one single blind spot that prevents us from fully learning the lessons of our recent economic disasters, it is the potential that new measurement technologies offer for reduced frictions and lower transaction costs in the intangible capital markets.

We know where to start, from two basic principles of market economics. First, we know the transaction costs are the most important costs in any market.  High transaction costs can strangle a market as the flow of capital is stifled. Second, we know that innovation, essential to product development, improvements, marketing, and enhanced profitability, is almost never accomplished by an individual working in isolation. Innovation requires an environment in which it is safe to play, to make mistakes, and through which new value can be immediately and decisively recognized for what it is.

How can living capital market frictions be reduced? For starters, we could focus on effecting order-of-magnitude improvements in the meaningfulness of the metrics we use for screening, diagnosis, research, and accountability. We can do whatever arithmetic we want with the numbers we have at hand, but most of the numbers that pass for measures of health, functionality, quality of life and care, etc. do not actually stand for something that adds up. The good news is that, again, the intuitions informing our efforts so far are largely valid, and have the ball rolling in the right direction.

How can better measurement advance the cause of innovation in health care? By providing a common language that all stakeholders can think and act in together, harmoniously. Research over the last 80 years has repeatedly proven the viability of a kind of a metric system for the things we measure with surveys, assessments, and tests. Such a system of universally uniform metrics would provide the common currency unifying the health care economy and establishing the basis for market self-organization. But contrary to our predominant metaphysical faith, scientifically proven results do not magically propagate themselves into the world. We have to invent and construct the systems we need.

Our efforts in this direction are stymied, as Tom Vanderbilt put it in the Times Sunday magazine on infrastructure, to the extent that we have “an inimical incuriosity” about the banal fundamentals of the systems that shape our world. We simply take dry technicalities for granted, and notice them only when they fail us. Our problem with intangibles measurement, then, is compounded by the fact that the infrastructure we are taking for granted is not just invisible or broken, it is nonexistent. Until we make the effort to build our capacity for managing health and other forms of living capital by creating reference standard common currencies for expressing, managing, and trading on their value, all of our efforts at health care reform–and at reinventing capitalism–will fall far short of what is possible.
William P. Fisher, Jr., Ph.D.
william@livingcapitalmetrics.com
http://www.LivingCapitalMetrics.com

We are what we measure.
It’s time we measured what we want to be.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.