Archive for July, 2009

Posted today at HealthReform.gov

July 26, 2009

Any bill serious about health care reform needs to demand that the industry take advantage of readily available and dramatically improved measurement methods. We manage what we measure, and 99% of existing outcome measures are measures in name only. A kind of metric system for outcomes could provide standard product definitions, could effect huge reductions in information transaction costs, and could bring about a whole new magnitude of market efficiencies. Far from being a drag on the system, the profit motive is the best source of energy we have for driving innovation and resetting the cost-quality equation. But the disastrously low quality of our measures corrupts the data and prevents informed decision making by consumers and quality improvement experts. Any health care reform effort that does not demand improved measurement is doomed to fall far short of the potential that is within our reach. For more information, see www.Rasch.org, www.livingcapitalmetrics.com, http://dx.doi.org/10.1016/j.measurement.2009.03.014, or http://home.att.net/~rsmith.arm/RMHS_flyer.pdf.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Advertisements

Contesting the Claim, Part III: References

July 24, 2009

References

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.

Andersen, E. B. (1995). What George Rasch would have thought about this book. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 383-390). New York: Springer-Verlag.

Andrich, D. (1988). Rasch models for measurement. (Vols. series no. 07-068). Sage University Paper Series on Quantitative Applications in the Social Sciences). Beverly Hills, California: Sage Publications.

Andrich, D. (1998). Thresholds, steps and rating scale conceptualization. Rasch Measurement Transactions, 12(3), 648-9 [http://209.238.26.90/rmt/rmt1239.htm].

Arnold, S. F. (1985, September). Sufficiency and invariance. Statistics & Probability Letters, 3, 275-279.

Bond, T., & Fox, C. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Burdick, D. S., Stone, M. H., & Stenner, A. J. (2006). The Combined Gas Law and a Rasch Reading Law. Rasch Measurement Transactions, 20(2), 1059-60 [http://www.rasch.org/rmt/rmt202.pdf].

Burdick, H., & Stenner, A. J. (1996). Theoretical prediction of test items. Rasch Measurement Transactions, 10(1), 475 [http://www.rasch.org/rmt/rmt101b.htm].

Choi, E. (1998, Spring). Rasch invents “Ounces.” Popular Measurement, 1(1), 29 [http://www.rasch.org/pm/pm1-29.pdf].

Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist, 49, 997-1003.

DeBoeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. (Statistics for Social and Behavioral Sciences). New York: Springer-Verlag.

Dynkin, E. B. (1951). Necessary and sufficient statistics for a family of probability distributions. Selected Translations in Mathematical Statistics and Probability, 1, 23-41.

Embretson, S. E. (1996, September). Item Response Theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, pp. 201-212.

Falmagne, J.-C., & Narens, L. (1983). Scales and meaningfulness of quantitative laws. Synthese, 55, 287-325.

Fischer, G. H. (1981, March). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.

Fischer, G. H. (1995). Derivations of the Rasch model. In G. Fischer & I. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 15-38). New York: Springer-Verlag.

Fisher, W. P., Jr. (1988). Truth, method, and measurement: The hermeneutic of instrumentation and the Rasch model [diss]. Dissertation Abstracts International, 49, 0778A, Dept. of Education, Division of the Social Sciences: University of Chicago (376 pages, 23 figures, 31 tables).

Fisher, W. P., Jr. (1997). Physical disability construct convergence across instruments: Towards a universal metric. Journal of Outcome Measurement, 1(2), 87-113.

Fisher, W. P., Jr. (1997, June). What scale-free measurement means to health outcomes research. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 357-373.

Fisher, W. P., Jr. (1999). Foundations for health status metrology: The stability of MOS SF-36 PF-10 calibrations across samples. Journal of the Louisiana State Medical Society, 151(11), 566-578.

Fisher, W. P., Jr. (2000). Objectivity in psychosocial measurement: What, why, how. Journal of Outcome Measurement, 4(2), 527-563.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2008, Summer). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-3 [http://www.rasch.org/rmt/rmt221.pdf].

Fisher, W. P., Jr. (2009, July). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), in press.

Goodman, S. N. (1999, 15 June). Toward evidence-based medical statistics. 1: The p-value fallacy. Annals of Internal Medicine, 130(12), 995-1004.

Guttman, L. (1985). The illogic of statistical inference for cumulative science. Applied Stochastic Models and Data Analysis, 1, 3-10.

Hall, W. J., Wijsman, R. A., & Ghosh, J. K. (1965). The relationship between sufficiency and invariance with applications in sequential analysis. Annals of Mathematical Statistics, 36, 575-614.

Linacre, J. M. (1993). Rasch-based generalizability theory. Rasch Measurement Transactions, 7(1), 283-284 [http://www.rasch.org/rmt/rmt71h.htm].

Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint measurement: A new kind of fundamental measurement. Journal of Mathematical Psychology, 1(1), 1-27.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-34.

Michell, J. (1999). Measurement in psychology: A critical history of a methodological concept. Cambridge: Cambridge University Press.

Moulton, M. (1993). Probabilistic mapping. Rasch Measurement Transactions, 7(1), 268 [http://www.rasch.org/rmt/rmt71b.htm].

Mundy, B. (1986, June). On the general theory of meaningful representation. Synthese, 67(3), 391-437.

Narens, L. (2002). Theories of meaningfulness (S. W. Link & J. T. Townsend, Eds.). Scientific Psychology Series. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Newby, V. A., Conner, G. R., Grant, C. P., & Bunderson, C. V. (2009). The Rasch model and additive conjoint measurement. Journal of Applied Measurement, 10(4), 348-354.

Pelton, T., & Bunderson, V. (2003). The recovery of the density scale using a stochastic quasi-realization of additive conjoint measurement. Journal of Applied Measurement, 4(3), 269-81.

Ramsay, J. O., Bloxom, B., & Cramer, E. M. (1975, June). Review of Foundations of Measurement, Vol. 1, by D. H. Krantz et al. Psychometrika, 40(2), 257-262.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Roberts, F. S., & Rosenbaum, Z. (1986). Scale type, meaningfulness, and the possible psychophysical laws. Mathematical Social Sciences, 12, 77-95.

Romanoski, J. T., & Douglas, G. (2002). Rasch-transformed raw scores and two-way ANOVA: A simulation analysis. Journal of Applied Measurement, 3(4), 421-430.

Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. Psychological Bulletin, 57(5), 416-428.

Smith, R. M., & Taylor, P. (2004). Equating rehabilitation outcome scales: Developing common metrics. Journal of Applied Measurement, 5(3), 229-42.

Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, XXXIII, 529-544. Reprinted in L. L. Thurstone, The Measurement of Values. Midway Reprint Series. Chicago, Illinois: University of Chicago Press, 1959, pp. 215-233.

van der Linden, W. J. (1992). Sufficient and necessary statistics. Rasch Measurement Transactions, 6(3), 231 [http://www.rasch.org/rmt/rmt63d.htm].

Velleman, P. F., & Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician, 47(1), 65-72.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1997, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, pp. 33-45, 52 [http://www.rasch.org/memo62.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Contesting the Claim, Part II: Are Rasch Measures Really as Objective as Physical Measures?

July 22, 2009

When a raw score is sufficient to the task of measurement, the model is the Rasch model, we can estimate the parameters consistently, and we can evaluate the fit of the data to the model. The invariance properties that follow from a sufficient statistic include virtually the entire class of invariant rules (Hall, Wijsman, & Ghosh, 1965; Arnold, 1985), and similar relationships with other key measurement properties follow from there (Fischer, 1981, 1995; Newby, Conner, Grant, & Bunderson, 2009; Wright, 1977, 1997).

What does this all actually mean? Imagine we were able to ask an infinite number of people an infinite number of questions that all work together to measure the same thing. Because (1) the scores are sufficient statistics, (2) the ruler is not affected by what is measured, (3) the parameters separate, and (4) the data fit the model, any subset of the questions asked would give the same measure. This means that any subscore for any person measured would be a function of any and all other subscores. When a sufficient statistic is a function of all other sufficient statistics, it is not only sufficient, it is necessary, and is referred to as a minimally sufficient statistic. Thus, if separable, independent model parameters can be estimated, the model must be the Rasch model, and the raw score is both sufficient and necessary (Andersen, 1977; Dynkin, 1951; van der Linden, 1992).

This means that scores, ratings, and percentages actually stand for something measurable only when they fit a Rasch model.  After all, what actually would be the point of using data that do not support the estimation of independent parameters? If the meaning of the results is tied in unknown ways to the specific particulars of a given situation, then those results are meaningless, by definition (Roberts & Rosenbaum, 1986; Falmagne & Narens, 1983; Mundy, 1986; Narens, 2002; also see Embretson, 1996; Romanoski and Douglas, 2002). There would be no point in trying to learn anything from them, as whatever happened was a one-time unique event that tells us nothing we can use in any future event (Wright, 1977, 1997).

What we’ve done here is akin to taking a narrative stroll through a garden of mathematical proofs. These conceptual analyses can be very convincing, but actual demonstrations of them are essential. Demonstrations would be especially persuasive if there would be some way of showing three things. First, shouldn’t there be some way of constructing ordinal ratings or scores for one or another physical variable that, when scaled, give us measures that are the same as the usual measures we are accustomed to?

This would show that we can use the type of instrument usually found in the social sciences to construct physical measures with the characteristics we expect. There are four available examples, in fact, involving paired comparisons of weights (Choi, 1998), measures of short lengths (Fisher, 1988), ratings of medium-range distances (Moulton, 1993), and a recovery of the density scale (Pelton & Bunderson, 2003). In each case, the Rasch-calibrated experimental instruments produced measures equivalent to the controls, as shown in linear plots of the pairs of measures.

A second thing to build out from the mathematical proofs are experiments in which we check the purported stability of measures and calibrations. We can do this by splitting large data sets, using different groups of items to produce two or more measures for each person, or using different groups of respondents/examinees to provide data for two or more sets of item calibrations. This is a routine experimental procedure in many psychometric labs, and results tend to conform with theory, with strong associations found between increasing sample sizes and increasing reliability coefficients for the respective measures or calibrations. These associations can be plotted (Fisher, 2008), as can the pairs of calibrations estimated from different samples (Fisher, 1999), and the pairs of measures estimated from different instruments (Fisher, Harvey, Kilgore, et al., 1995; Smith & Taylor, 2004). The theoretical expectation of tighter plots for better designed instruments, larger sample sizes, and longer tests is confirmed so regularly that it should itself have the status of a law of nature (Linacre, 1993).

A third convincing demonstration is to compare studies of the same thing conducted in different times and places by different researchers using different instruments on different samples. If the instruments really measure the same thing, there will not only be obvious similarities in their item contents, but similar items will calibrate in similar positions on the metric across samples. Results of this kind have been obtained in at least three published studies (Fisher, 1997a, 1997b; Belyukova, Stone, & Fox, 2004).

All of these arguments are spelled out in greater length and detail, with illustrations, in a forthcoming article (Fisher, 2009). I learned all of this from Benjamin Wright, who worked directly with Rasch himself, and who, perhaps more importantly, was prepared for what he could learn from Rasch in his previous career as a physicist. Before encountering Rasch in 1960, Wright had worked with Feynman at Cornell, Townes at Bell Labs, and Mulliken at the University of Chicago. Taught and influenced not just by three of the great minds of twentieth-century physics, but also by Townes’ philosophical perspectives on meaning and beauty, Wright had left physics in search of life. He was happy to transfer his experience with computers into his new field of educational research, but he was dissatisfied with the quality of the data and how it was treated.

Rasch’s ideas gave Wright the conceptual tools he needed to integrate his scientific values with the demands of the field he was in. Over the course of his 40-year career in measurement, Wright wrote the first software for estimating Rasch model parameters and continuously improved it; he adapted new estimation algorithms for Rasch’s models and was involved in the articulation of new models; he applied the models to hundreds of data sets using his software; he vigorously invested himself in students and colleagues; he founded new professional societies, meetings, and journals;  and he never stopped learning how to think anew about measurement and the meaning of numbers. Through it all, there was always a yardstick handy as a simple way of conveying the basic requirements of measurement as we intuitively understand it in physical terms.

Those of us who spend a lot of time working with these ideas and trying them out on lots of different kinds of data forget or never realize how skewed our experience is relative to everyone else’s. I guess a person lives in a different world when you have the sustained luxury of working with very large databases, as I have had, and you see the constancy and stability of well-designed measures and calibrations over time, across instruments, and over repeated samples ranging from 30 to several million.

When you have that experience, it becomes a basic description of reasonable expectation to read the work of a colleague and see him say that “when the key features of a statistical model relevant to the analysis of social science data are the same as those of the laws of physics, then those features are difficult to ignore” (Andrich, 1988, p. 22). After calibrating dozens of instruments over 25 years, some of them many times over, it just seems like the plainest statement of the obvious to see the same guy say “Our measurement principles should be the same for properties of rocks as for the properties of people. What we say has to be consistent with physical measurement” (Andrich, 1998, p. 3).

And I find myself wishing more people held the opinion expressed by two other colleagues, that “scientific measures in the social sciences must hold to the same standards as do measures in the physical sciences if they are going to lead to the same quality of generalizations” (Bond & Fox, 2001, p. 2). When these sentiments are taken to their logical conclusion in a practical application, the real value of “attempting for reading comprehension what Newtonian mechanics achieved for astronomy” (Burdick & Stenner, 1996) becomes apparent. Rasch’s analogy of the structure of his model for reading tests and Newton’s Second Law can be restated relative to any physical law expressed as universal conditionals among variable triplets; a theory of the variable measured capable of predicting item calibrations provides the causal story for the observed variation (Burdick, Stone, & Stenner, 2006; DeBoeck & Wilson, 2004).

Knowing what I know, from the mathematical principles I’ve been trained in and from the extensive experimental work I’ve done, it seems amazing that so little attention is actually paid to tools and concepts that receive daily lip service as to their central importance in every facet of life, from health care to education to economics to business. Measurement technology rose up decades ago in preparation for the demands of today’s challenges. It is just plain weird the way we’re not using it to anything anywhere near its potential.

I’m convinced, though, that the problem is not a matter of persuasive rhetoric applied to the minds of the right people. Rather, someone, hopefully me, has got to configure the right combination of players in the right situation at the right time and at the right place to create a new form of real value that can’t be created any other way. Like they say, money talks. Persuasion is all well and good, but things will really take off only when people see that better measurement can aid in removing inefficiencies from the management of human, social, and natural capital, that better measurement is essential to creating sustainable and socially responsible policies and practices, and that better measurement means new sources of profitability.  I’m convinced that advanced measurement techniques are really nothing more than a new form of IT or communications technology. They will fit right into the existing networks and multiply their efficiencies many times over.

And when they do, we may be in a position to finally

“confront the remarkable fact that throughout the gigantic range of physical knowledge numerical laws assume a remarkably simple form provided fundamental measurement has taken place. Although the authors cannot explain this fact to their own satisfaction, the extension to behavioral science is obvious: we may have to await fundamental measurement before we will see any real progress in quantitative laws of behavior. In short, ordinal scales (even continuous ordinal scales) are perhaps not good enough and it may not be possible to live forever with a dozen different procedures for quantifying the same piece of behavior, each making strong but untestable and basically unlikely assumptions which result in nonlinear plots of one scale against another. Progress in physics would have been impossibly difficult without fundamental measurement and the reader who believes that all that is at stake in the axiomatic treatment of measurement is a possible criterion for canonizing one scaling procedure at the expense of others is missing the point” (Ramsay, Bloxom, and Cramer, 1975, p. 262).

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Contesting the Claim, Part I: Are Rasch Measures Really as Objective as Physical Measures?

July 21, 2009

Psychometricians, statisticians, metrologists, and measurement theoreticians tend to be pretty unassuming kinds of people. They’re unobtrusive and retiring, by and large. But there is one thing some of them are prone to say that will raise the ire of others in a flash, and the poor innocent geek will suddenly be subjected to previously unknown forms and degrees of social exclusion.

What is that one thing? “Instruments calibrated by fitting data to a Rasch model measure with the same kind of objectivity as is obtained with physical measures.” That’s one version. Another could be along these lines: “When data fit a Rasch model, we’ve discovered a pattern in human attitudes or behaviors so regular that it is conceptually equivalent to a law of nature.”

Maybe it is the implication of objectivity as something that must be politically incorrect that causes the looks of horror and recoiling retreats in the nonmetrically inclined when they hear things like this. Maybe it is the ingrained cultural predisposition to thinking such claims outrageously preposterous that makes those unfamiliar with 80 years of developments and applications so dismissive. Maybe it’s just fear of the unknown, or a desire not to have to be responsible for knowing something important that hardly anyone else knows.

Of course, it could just be a simple misunderstanding. When people hear the word “objective” do most of them have an image of an object in mind? Does objectivity connote physical concreteness to most people? That doesn’t hold up well for me, since we can be objective about events and things people do without any confusions involving being able to touch and feel what’s at issue.

No, I think something else is going on. I think it has to do with the persistent idea that objectivity requires a disconnected, alienated point of view, one that ignores the mutual implication of subject and object in favor of analytically tractable formulations of problems that, though solvable, are irrelevant to anything important or real. But that is hardly the only available meaning of objectivity, and it isn’t anywhere near the best. It certainly is not what is meant in the world of measurement theory and practice.

It’s better to think of objectivity as something having to do with things like the object of a conversation, or an object of linguistic reference: “chair” as referring to the entire class of all forms of seating technology, for instance. In these cases, we know right away that we’re dealing with what might be considered a heuristic ideal, an abstraction. It also helps to think of objectivity in terms of fairness and justice. After all, don’t we want our educational, health care, and social services systems to respect the equality of all individuals and their rights?

That is not, of course, how measurement theoreticians in psychology have always thought about objectivity. In fact, it was only 70-80 years ago that most psychologists gave up on objective measurement because they couldn’t find enough evidence of concrete phenomena to support the claims to objectivity they wanted to make (Michell, 1999). The focus on the reflex arc led a lot of psychologists into psychophysics, and the effects of operant conditioning led others to behaviorism. But a lot of the problems studied in these fields, though solvable, turned out to be uninteresting and unrelated to the larger issues of life demanding attention.

And so, with no physical entity that could be laid end-to-end and concatenated in the way weights are in a balance scale, psychologists just redefined measurement to suit what they perceived to be the inherent limits of their subject matter. Measurement didn’t have to be just ratio or interval, it could also be ordinal and even nominal. The important thing was to get numbers that could be statistically manipulated. That would provide more than enough credibility, or obfuscation, to create the appearance of legitimate science.

But while mainstream psychology was focused on hunting for statistically significant p-values, there were others trying to figure out if attitudes, abilities, and behaviors could be measured in a rigorously meaningful way.

Louis Thurstone, a former electrical engineer turned psychologist, was among the first to formulate the problem. Writing in 1928, Thurstone rightly focused on the instrument as the focus of attention:

The scale must transcend the group measured.–One crucial experimental test must be applied to our method of measuring attitudes before it can be accepted as valid. A measuring instrument must not be seriously affected in its measuring function by the object of measurement. To the extent that its measuring function is so affected, the validity of the instrument is impaired or limited. If a yardstick measured differently because of the fact that it was a rug, a picture, or a piece of paper that was being measured, then to that extent the trustworthiness of that yardstick as a measuring device would be impaired. Within the range of objects for which the measuring instrument is intended, its function must be independent of the object of measurement”  (Thurstone, 1959, p. 228).

Thurstone aptly captures what is meant when it is said that attitudes, abilities, or behaviors can be measured with the same kind of objectivity as is obtained in the natural sciences. Objectivity is realized when a test, survey, or assessment functions the same way no matter who is being measured, and, conversely (Thurstone took this up, too), an attitude, ability, or behavior exhibits the same amount of what is measured no matter which instrument is used.

This claim, too, may seem to some to be so outrageously improbable as to be worthy of rejecting out of hand. After all, hasn’t everyone learned how the fact of being measured changes the measure? Thing is, this is just as true in physics and ecology as it is in psychiatry or sociology, and the natural sciences haven’t abandoned their claims to objectivity. So what’s up?

What’s up is that all sciences now have participant observers. The old Cartesian duality of the subject-object split still resides in various rhetorical choices and affects our choices and behaviors, but, in actual practice, scientific methods have always had to deal with the way questions imply particular answers.

And there’s more. Qualitative methods have grown out of some of the deep philosophical introspections of the twentieth century, such as phenomenology, hermeneutics, deconstruction, postmodernism, etc. But most researchers who are adopting qualitative methods over quantitative ones don’t know that the philosophers legitimating the new focuses on narrative, interpretation, and the construction of meaning did quite a lot of very good thinking about mathematics and quantitative reasoning. Much of my own published work engages with these philosophers to find new ways of thinking about measurement (Fisher, 2004, for instance). And there are some very interesting connections to be made that show quantification does not necessarily have to involve a positivist, subject-object split.

So where does that leave us? Well, with probability. Not in the sense of statistical hypothesis testing, but in the sense of calibrating instruments with known probabilistic characteristics. If the social sciences are ever to be scientific, null hypothesis significance tests are going to have to be replaced with universally uniform metrics embodying and deploying the regularities of natural laws, as is the case in the physical sciences. Various arguments on this issue have been offered for decades (Cohen, 1994; Meehl, 1967, 1978; Goodman, 1999; Guttman, 1985; Rozeboom, 1960). The point is not to proscribe allowable statistics based on scale type  (Velleman & Wilkinson, 1993). Rather, we need to shift and simplify the focus of inference from the statistical analysis of data to the calibration and distribution of instruments that support distributed cognition, unify networks, lubricate markets, and coordinate collective thinking and acting (Fisher, 2000, 2009). Persuasion will likely matter far less in resolving the matter than an ability to create new value, efficiencies, and profits.

In 1964, Luce and Tukey gave us another way of stating what Thurstone was getting at:

“The axioms of conjoint measurement apply naturally to problems of classical physics and permit the measurement of conventional physical quantities on ratio scales…. In the various fields, including the behavioral and biological sciences, where factors producing orderable effects and responses deserve both more useful and more fundamental measurement, the moral seems clear: when no natural concatenation operation exists, one should try to discover a way to measure factors and responses such that the ‘effects’ of different factors are additive.”

In other words, if we cannot find some physical thing that we can make add up the way numbers do, as we did with length, weight, volts, temperature, time, etc., then we ought to ask questions in a way that allows the answers to reveal the kind of patterns we expect to see when things do concatenate. What Thurstone and others working in his wake have done is to see that we could possibly do some things virtually in terms of abstract relations that we cannot do actually in terms of concrete relations.

The concept is no more difficult to comprehend than understanding the difference between playing solitaire with actual cards and writing a computer program to play solitaire with virtual cards. Either way, the same relationships hold.

A Danish mathematician, Georg Rasch, understood this. Working in the 1950s with data from psychological and reading tests, Rasch worked from his training in the natural sciences and mathematics to arrive at a conception of measurement that would apply in the natural and human sciences equally well. He realized that

“…the acceleration of a body cannot be determined; the observation of it is admittedly liable to … ‘errors of measurement’, but … this admittance is paramount to defining the acceleration per se as a parameter in a probability distribution — e.g., the mean value of a Gaussian distribution — and it is such parameters, not the observed estimates, which are assumed to follow the multiplicative law [acceleration = force / mass, or mass * acceleration = force].

“Thus, in any case an actual observation can be taken as nothing more than an accidental response, as it were, of an object — a person, a solid body, etc. — to a stimulus — a test, an item, a push, etc. — taking place in accordance with a potential distribution of responses — the qualification ‘potential’ referring to experimental situations which cannot possibly be [exactly] reproduced.

“In the cases considered [earlier in the book] this distribution depended on one relevant parameter only, which could be chosen such as to follow the multiplicative law.

“Where this law can be applied it provides a principle of measurement on a ratio scale of both stimulus parameters and object parameters, the conceptual status of which is comparable to that of measuring mass and force. Thus, … the reading accuracy of a child … can be measured with the same kind of objectivity as we may tell its weight …” (Rasch, 1960, p. 115).

Rasch’s model not only sets the parameters for data sufficient to the task of measurement, it lays out the relationships that must be found in data for objective results to be possible. Rasch studied with Ronald Fisher in London in 1935, expanded his understanding of statistical sufficiency with him, and then applied it in his measurement work, but not in the way that most statisticians understand it. Yes, in the context of group-level statistics, sufficiency concerns the reproducibility of a normal distribution when all that is known are the mean and the standard deviation. But sufficiency is something quite different in the context of individual-level measurement. Here, counts of correct answers or sums of ratings serve as sufficient statistics  for any statistical model’s parameters when they contain all of the information needed to establish that the parameters are independent of one another, and are not interacting in ways that keep them tied together. So despite his respect for Ronald Fisher and the concept of sufficiency, Rasch’s work with models and methods that worked equally well with many different kinds of distributions led him to jokingly suggest (Andersen, 1995, p. 385) that all textbooks mentioning the normal distribution should be burned!

In plain English, all that we’re talking about here is what Thurstone said: the ruler has to work the same way no matter what or who it is measuring, and we have to get the same results for what or who we are measuring no matter which ruler we use. When parameters are not separable, when they stick together because some measures change depending on which questions are asked or because some calibrations change depending on who answers them, we have encountered a “failure of invariance” that tells us something is wrong. If we are to persist in our efforts to determine if something objective exists and can be measured, we need to investigate these interactions and explain them. Maybe there was a data entry error. Maybe a form was misprinted. Maybe a question was poorly phrased. Maybe we have questions that address different constructs all mixed together. Maybe math word problems work like reading test items for students who can’t read the language they’re written in.  Standard statistical modeling ignores these potential violations of construct validity in favor of adding more parameters to the model.

But that’s another story for another time. Tomorrow we’ll take a closer look at sufficiency, in both conceptual and practical terms. Cited references are always available on request, but I’ll post them in a couple of days.

The “Standard Model,” Part II: Natural Law, Economics, Measurement, and Capital

July 15, 2009

At Tjalling Koopmans’ invitation, Rasch became involved with the Cowles Commission, working at the University of Chicago in the 1947 academic year, and giving presentations in the same seminar series as Milton Friedman, Kenneth Arrow, and Jimmie Savage (Linacre, 1998; Cowles Foundation, 1947, 1952; Rasch, 1953). Savage would later be instrumental in bringing Rasch back to Chicago in 1960.

Rasch was prompted to approach Savage about giving a course at Chicago after receiving a particularly strong response to some of his ideas from his old mentor, Frisch, when Frisch had come to Copenhagen to receive an honorary doctorate in 1959. Frisch shared the first Nobel Prize in economics with Tinbergen, was a co-founder, with Irving Fisher, of the Econometric Society,  invented words such as “econometrics” and “macro-economics,” and was the editor of Econometrica for many years. As recounted by Rasch (1977, pp. 63-66; also see Andrich, 1997; Wright, 1980, 1998), Frisch was struck by the disappearance of the person parameter from the comparisons of item calibrations in the series of equations he presented. In response to Frisch’s reaction, Rasch formalized his mathematical ideas in a Separability Theorem.

Why were the separable parameters  significant to Frisch? Because they addressed the problem that was at the center of Frisch’s network of concepts: autonomy, better known today as structural invariance (Aldrich, 1989, p. 15; Boumans, 2005, pp. 51 ff.; Haavelmo, 1948). Autonomy concerns the capacity of data to represent a pattern of relationships that holds up across the local particulars. It is, in effect, Frisch’s own particular way of extending the Standard Model. Irving Fisher (1930) had similarly stated what he termed a Separation Theorem, which, in the manner of previous work by Walras, Jevons, and others, was also presented in terms of a multiplicative relation between three variables. Frisch (1930) complemented Irving Fisher’s focus on an instrumental approach with a mathematical, axiomatic approach (Boumans, 2005) offering necessary and sufficient conditions for tests of Irving Fisher’s theorem.

When Rasch left Frisch, he went directly to London to work with Ronald Fisher, where he remained for a year. In the following decades, Rasch became known as the foremost advocate of Ronald Fisher’s ideas in Denmark. In particular, he stressed the value of statistical sufficiency, calling it the “high mark” of Fisher’s work (Fisher, 1922). Rasch’s student, Erling Andersen, later showed that when raw scores are both necessary and sufficient statistics for autonomous, separable parameters, the model employed is Rasch’s (Andersen, 1977; Fischer, 1981; van der Linden, 1992).

Whether or not Rasch’s conditions exactly reproduce Frisch’s, and whether or not his Separability Theorem is identical with Irving Fisher’s Separation Theorem, it would seem that time with Frisch exerted a significant degree of influence on Rasch, likely focusing his attention on statistical sufficiency, the autonomy implied by separable parameters, and the multiplicative relations of variable triples.

These developments, and those documented in previous of my blogs, suggest the existence of powerful and untapped potentials hidden within psychometrics and econometrics. The story told thus far remains incomplete. However compelling the logic and personal histories may be, central questions remain unanswered. To provide a more well-rounded assessment of the situation, we must take up several unresolved philosophical issues (Fisher, 2003a, 2003b, 2004).

It is my contention that, for better measurement to become more mainstream, a certain kind of cultural shift is going to have to happen. This shift has already been underway for decades, and has precedents that go back centuries. Its features are becoming more apparent as long term economic sustainability is understood to involve significant investments in humanly, socially and environmentally responsible practices.  For such practices to be more than just superficial expressions of intentions that might be less interested in the greater good than selfish gain, they have to emerge organically from cultural roots that are already alive and thriving.

It is not difficult to see how such an organic emergence might happen, though describing it appropriately requires an ability to keep the relationship of the local individual to the global universal always in mind. And even if and when that description might be provided, having it in hand in no way shows how it could be brought about. All we can do is to persist in preparing ourselves for the opportunities that arise, reading, thinking, discussing, and practicing. Then, and only then, might we start to plant the seeds, nurture them, and see them grow.

References

Aldrich, J. (1989). Autonomy. Oxford Economic Papers, 41, 15-34.

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.

Andrich, D. (1997). Georg Rasch in his own words [excerpt from a 1979 interview]. Rasch Measurement Transactions, 11(1), 542-3. [http://www.rasch.org/rmt/rmt111.htm#Georg].

Boumans, M. (2001). Fisher’s instrumental approach to index numbers. In M. S. Morgan & J. Klein (Eds.), The age of economic measurement (pp. 313-44). Durham, North Carolina: Duke University Press.

Bjerkholt, O. (2001). Tracing Haavelmo’s steps from Confluence Analysis to the Probability Approach (Tech. Rep. No. 25). Oslo, Norway: Department of Economics, University of Oslo, in cooperation with The Frisch Centre for Economic Research.

Boumans, M. (1993). Paul Ehrenfest and Jan Tinbergen: A case of limited physics transfer. In N. De Marchi (Ed.), Non-natural social science: Reflecting on the enterprise of “More Heat than Light” (pp. 131-156). Durham, NC: Duke University Press.

Boumans, M. (2005). How economists model the world into numbers. New York: Routledge.

Burdick, D. S., Stone, M. H., & Stenner, A. J. (2006). The Combined Gas Law and a Rasch Reading Law. Rasch Measurement Transactions, 20(2), 1059-60 [http://www.rasch.org/rmt/rmt202.pdf].

Cowles Foundation for Research in Economics. (1947). Report for period 1947, Cowles Commission for Research in Economics. Retrieved 7 July 2009, from Yale University Dept. of Economics: http://cowles.econ.yale.edu/P/reports/1947.htm.

Cowles Foundation for Research in Economics. (1952). Biographies of Staff, Fellows, and Guests, 1932-1952. Retrieved 7 July 2009 from Yale University Dept. of Economics: http://cowles.econ.yale.edu/P/reports/1932-52d.htm#Biographies.

Fischer, G. H. (1981, March). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.

Fisher, I. (1930). The theory of interest. New York: Macmillan.

Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, A, 222, 309-368.

Fisher, W. P., Jr. (1992). Objectivity in measurement: A philosophical history of Rasch’s separability theorem. In M. Wilson (Ed.), Objective measurement: Theory into practice. Vol. I (pp. 29-58). Norwood, New Jersey: Ablex Publishing Corporation.

Fisher, W. P., Jr. (2003a, December). Mathematics, measurement, metaphor, metaphysics: Part I. Implications for method in postmodern science. Theory & Psychology, 13(6), 753-90.

Fisher, W. P., Jr. (2003b, December). Mathematics, measurement, metaphor, metaphysics: Part II. Accounting for Galileo’s “fateful omission.” Theory & Psychology, 13(6), 791-828.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2007, Summer). Living capital metrics. Rasch Measurement Transactions, 21(1), 1092-3 [http://www.rasch.org/rmt/rmt211.pdf].

Fisher, W. P., Jr. (2008, March 28). Rasch, Frisch, two Fishers and the prehistory of the Separability Theorem. In Session 67.056. Reading Rasch Closely: The History and Future of Measurement. American Educational Research Association, Rasch Measurement SIG, New York University, New York City.

Frisch, R. (1930). Necessary and sufficient conditions regarding the form of an index number which shall meet certain of Fisher’s tests. Journal of the American Statistical Association, 25, 397-406.

Haavelmo, T. (1948). The autonomy of an economic relation. In R. Frisch &  et al. (Eds.), Autonomy of economic relations. Oslo, Norway: Memo DE-UO, 25-38.

Heilbron, J. L. (1993). Weighing imponderables and other quantitative science around 1800 Historical studies in the physical and biological sciences, 24 (Supplement), Part I, 1-337.

Jammer, M. (1999). Concepts of mass in contemporary physics and philosophy. Princeton, NJ: Princeton University Press.

Linacre, J. M. (1998). Rasch at the Cowles Commission. Rasch Measurement Transactions, 11(4), 603.

Maas, H. (2001). An instrument can make a science: Jevons’s balancing acts in economics. In M. S. Morgan & J. Klein (Eds.), The age of economic measurement (pp. 277-302). Durham, North Carolina: Duke University Press.

Mirowski, P. (1988). Against mechanism. Lanham, MD: Rowman & Littlefield.

Rasch, G. (1953, March 17-19). On simultaneous factor analysis in several populations. From the Uppsala Symposium on Psychological Factor Analysis. Nordisk Psykologi’s Monograph Series, 3, 65-71, 76-79, 82-88, 90.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy,  14, 58-94.

van der Linden, W. J. (1992). Sufficient and necessary statistics. Rasch Measurement Transactions, 6(3), 231 [http://www.rasch.org/rmt/rmt63d.htm].

Wright, B. D. (1980). Foreword, Afterword. In Probabilistic models for some intelligence and attainment tests, by Georg Rasch (pp. ix-xix, 185-199. http://www.rasch.org/memo63.htm) [Reprint; original work published in 1960 by the Danish Institute for Educational Research]. Chicago, Illinois: University of Chicago Press.

Wright, B. D. (1994, Summer). Theory construction from empirical observations. Rasch Measurement Transactions, 8(2), 362 [http://www.rasch.org/rmt/rmt82h.htm].

Wright, B. D. (1998, Spring). Georg Rasch: The man behind the model. Popular Measurement, 1, 15-22 [http://www.rasch.org/pm/pm1-15.pdf].

The “Standard Model”, Part I: Natural Law, Economics, Measurement, and Capital

July 14, 2009

In the late 18th and early 19th centuries, scientists took Newton’s successful study of gravitation and the laws of motion as a model for the conduct of any other field of investigation that would purport to be a science. Heilbron (1993) documents how the “Standard Model” evolved and eventually informed the quantitative study of areas of physical nature that had previously been studied only qualitatively, such as cohesion, affinity, heat, light, electricity, and magnetism. Referred to as the “six imponderables,” scientists were widely influenced in experimental practice by the idea that satisfactory understandings of these fundamental forces would be obtained only when they could be treated mathematically in a manner analogous, for instance, with the relations of force, mass, and acceleration in Newton’s Second Law of Motion.

The basic concept is that each parameter in the model has to be measurable independently of the other two, and that any combination of two parameters has to predict the third.  These relationships are demonstrably causal, not just unexplained associations. So force has to be the product of mass and acceleration; mass has to be force divided by acceleration; and acceleration has to be force divided by mass.

The ideal of a mathematical model incorporating these kinds of relations not only guided much of 19th century science, the effects of acceleration and force on mass were a vital consideration for Einstein in his formulation of the relation of mass and energy relative to the speed of light, with the result that energy is now separated from mass in the context of relativity theory (Jammer, 1999, pp. 41-42). He realized that, in the same way humans experience nothing unpleasant or destructive as body mass (or, as is now held, its energy) increases when accelerated to the relatively high speeds of trains, so, too, might we experience similar changes in the relation of mass and energy relative to the speed of light. The basic intellectual accomplishment, however, was one in a still-growing history of analogies from the Standard Model, which itself deeply indebted to the insights of Plato and Euclid in geometry and arithmetic (Fisher, 1992).

Working an independent line of research, historians of economics and econometrics have documented another extension of the Standard Model. The analogies to the new field of energetics made in the period of 1850-1880, and the use of the balance scale as a model by early economists, such as Stanley Jevons and Irving Fisher, are too widespread to ignore.  Mirowski (1988, p. 2) says that, in Walras’ first effort at formulating a mathematical expression of economic relations, he “attempted to implement a Newtonian model of market relations, postulating that ‘the price of things is in inverse ratio to the quantity offered and in direct ratio to the quantity demanded.'”

Jevons similarly studied energetics, in his case, with Michael Faraday, in the 1850s. Pareto also trained as an engineer; he made “a direct extrapolation of the path-independence of equilibrium energy states in rational mechanics and thermodynamics” to “the path-independence of the realization of utility” (Mirowski, 1988, p. 21).

The concept of equilibrium models stems from this work, and was also extensively elaborated in the analogies Jan Tinbergen was well known for drawing between economic phenomena and James Clerk Maxwell’s encapsulation of Newton’s second law. In making these analogies, Tinbergen was deliberately employing Maxwell’s own method of analogy for guiding his thinking (Boumans, 2005, p. 24).

In his 1934-35 studies with Frisch in Oslo and with Ronald Fisher in London, the Danish mathematician Georg Rasch (Andrich, 1997; Wright, 1980) made the acquaintance of a number of Tinbergen’s students, such as Tjalling Koopmans (Bjerkholt 2001, p. 9), from whom he may have heard of Tinbergen’s use of Maxwell’s method of analogy (Fisher, 2008). Rasch employs such an analogy in the presentation of his measurement model (1960, p. 115), pointing out

“…the acceleration of a body cannot be determined; the observation of it is admittedly liable to … ‘errors of measurement’, but … this admittance is paramount to defining the acceleration per se as a parameter in a probability distribution — e.g., the mean value of a Gaussian distribution — and it is such parameters, not the observed estimates, which are assumed to follow the multiplicative law [acceleration = force / mass].
Thus, in any case an actual observation can be taken as nothing more than an accidental response, as it were, of an object — a person, a solid body, etc. — to a stimulus — a test, an item, a push, etc. — taking place in accordance with a potential distribution of responses — the qualification ‘potential’ referring to experimental situations which cannot possibly be [exactly] reproduced.
In the cases considered [earlier in the book] this distribution depended on one relevant parameter only, which could be chosen such as to follow the multiplicative law.
Where this law can be applied it provides a principle of measurement on a ratio scale of both stimulus parameters and object parameters, the conceptual status of which is comparable to that of measuring mass and force. Thus, … the reading accuracy of a child … can be measured with the same kind of objectivity as we may tell its weight ….”

What Rasch provides in the models that incorporate this structure is a portable way of applying Maxwell’s method of analogy from the Standard Model. Data fitting a Rasch model show a pattern of associations suggesting that richer causal explanatory processes may be at work, but model fit alone cannot, of course, provide a construct theory in and of itself (Burdick, Stone, & Stenner, 2006; Wright, 1994). This echoes Tinbergen’s repeated emphasis on the difference between the mathematical model and the substantive meaning of the relationships it represents.

It also shows appreciation for the reason why Ludwig Boltzmann was so enamored of Maxwell’s method of analogy. As Boumans (1993, p. 136; also see Boumans, 2005, p. 28), “it allowed him to continue to develop mechanical explanations without having to assert, for example, that a gas ‘really’ consists of molecules that ‘really’ interact with one another according to a particular force law. If a scientific theory is only an image or a picture of nature, one need not worry about developing ‘the only true theory,’ and one can be content to portray the phenomena as simply and clearly as possible.” Rasch (1980, pp. 37-38) similarly held that a model is meant to be useful, not true.

Part II continues soon with more on Rasch’s extrapolation of the Standard Model, and references cited.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

A Tale of Two Industries: Contrasting Quality Assessment and Improvement Frameworks

July 8, 2009

Imagine the chaos that would result if industrial engineers each had their own tool sets calibrated in idiosyncratic metrics with unit sizes that changed depending on the size of what they measured, and they conducted quality improvement studies focusing on statistical significance tests of effect sizes. Furthermore, these engineers ignore the statistical power of their designs, and don’t know when they are finding statistically significant results by pure chance, and when they are not. And finally, they also ignore the substantive meaning of the numbers, so that they never consider the differences they’re studying in terms of varying probabilities of response to the questions they ask.

So when one engineer tries to generalize a result across applications, what happens is that it kind of works sometimes, doesn’t work at all other times, is often ignored, and does not command a compelling response from anyone because they are invested in their own metrics, samples, and results, which are different from everyone else’s. If there is any discussion of the relative merits of the research done, it is easy to fall into acrimonious and heated arguments that cannot be resolved because of the lack of consensus on what constitutes valid data, instrumentation, and theory.

Thus, the engineers put up the appearance of polite decorum. They smile and nod at each other’s local, sample-dependent, and irreproducible results, while they build mini-empires of funding, students, quoting circles, and professional associations on the basis of their personal authority and charisma. As they do so, costs in their industry go spiralling out of control, profits are almost nonexistent, fewer and fewer people can afford their products, smart people are going into other fields, and overall product quality is declining.

Of course, this is the state of affairs in education and health care, not in industrial engineering. In the latter field, the situation is much different. Here, everyone everywhere is very concerned to be sure they are always measuring the same thing as everyone else and in the same unit. Unexpected results of individual measures pop out instantly and are immediately redone. Innovations are more easily generated and disseminated because everyone is thinking together in the same language and seeing effects expressed in the same images. Someone else’s ideas and results can be easily fitted into anyone else’s experience, and the viability of a new way of doing things can be evaluated on the basis of one’s own experience and skills.

Arguments can be quite productive, as consensus on basic values drives the demand for evidence. Associations and successes are defined more in terms of merit earned from productivity and creativity demonstrated through the accumulation of generalized results. Costs in these industries are constantly dropping, profits are steady or increasing, more and more people can afford their products, smart people are coming into the field, and overall product quality is improving.

There is absolutely no reason why education and health care cannot thrive and grow like other industries. It is up to us to show how.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Publications Documenting Score, Rating, Percentage Contrasts with Real Measures

July 7, 2009

A few brief and easy introductions to the contrast between scores, ratings, and percentages vs measures include:

Linacre, J. M. (1992, Autumn). Why fuss about statistical sufficiency? Rasch Measurement Transactions, 6(3), 230 [http://www.rasch.org/rmt/rmt63c.htm].

Linacre, J. M. (1994, Summer). Likert or Rasch? Rasch Measurement Transactions, 8(2), 356 [http://www.rasch.org/rmt/rmt82d.htm].

Wright, B. D. (1992, Summer). Scores are not measures. Rasch Measurement Transactions, 6(1), 208 [http://www.rasch.org/rmt/rmt61n.htm].

Wright, B. D. (1989). Rasch model from counting right answers: Raw scores as sufficient statistics. Rasch Measurement Transactions, 3(2), 62 [http://www.rasch.org/rmt/rmt32e.htm].

Wright, B. D. (1993). Thinking with raw scores. Rasch Measurement Transactions, 7(2), 299-300 [http://www.rasch.org/rmt/rmt72r.htm].

Wright, B. D. (1999). Common sense for measurement. Rasch Measurement Transactions, 13(3), 704-5  [http://www.rasch.org/rmt/rmt133h.htm].

Longer and more technical comparisons include:

Andrich, D. (1989). Distinctions between assumptions and requirements in measurement in the social sciences. In J. A. Keats, R. Taft, R. A. Heath & S. H. Lovibond (Eds.), Mathematical and Theoretical Systems: Proceedings of the 24th International Congress of Psychology of the International Union of Psychological Science, Vol. 4 (pp. 7-16). North-Holland: Elsevier Science Publishers.

van Alphen, A., Halfens, R., Hasman, A., & Imbos, T. (1994). Likert or Rasch? Nothing is more applicable than good theory. Journal of Advanced Nursing, 20, 196-201.

Wright, B. D., & Linacre, J. M. (1989). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70(12), 857-867 [http://www.rasch.org/memo44.htm].

Zhu, W. (1996). Should total scores from a rating scale be used directly? Research Quarterly for Exercise and Sport, 67(3), 363-372.

The following lists provide some key resources. The lists are intended to be representative, not comprehensive.  There are many works in addition to these that document the claims in yesterday’s table. Many of these books and articles are highly technical.  Good introductions can be found in Bezruczko (2005), Bond and Fox (2007), Smith and Smith (2004), Wilson (2005), Wright and Stone (1979), Wright and Masters (1982), Wright and Linacre (1989), and elsewhere. The www.rasch.org web site has comprehensive and current information on seminars, consultants, software, full text articles, professional association meetings, etc.

Books and Journal Issues

Andrich, D. (1988). Rasch models for measurement. Sage University Paper Series on Quantitative Applications in the Social Sciences, vol. series no. 07-068. Beverly Hills, California: Sage Publications.

Andrich, D., & Douglas, G. A. (Eds.). (1982). Rasch models for measurement in educational and psychological research [Special issue]. Education Research and Perspectives, 9(1), 5-118. [Full text available at www.rasch.org.]

Bezruczko, N. (Ed.). (2005). Rasch measurement in health sciences. Maple Grove, MN: JAM Press.

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2d edition. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Choppin, B. (1985). In Memoriam: Bruce Choppin (T. N. Postlethwaite ed.) [Special issue]. Evaluation in Education: An International Review Series, 9(1).

DeBoeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. Statistics for Social and Behavioral Sciences). New York: Springer-Verlag.

Embretson, S. E., & Hershberger, S. L. (Eds.). (1999). The new rules of measurement: What every psychologist and educator should know. Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Engelhard, G., Jr., & Wilson, M. (1996). Objective measurement: Theory into practice, Vol. 3. Norwood, New Jersey: Ablex.

Fischer, G. H., & Molenaar, I. (1995). Rasch models: Foundations, recent developments, and applications. New York: Springer-Verlag.

Fisher, W. P., Jr., & Wright, B. D. (Eds.). (1994). Applications of Probabilistic Conjoint Measurement [Special Issue]. International Journal of Educational Research, 21(6), 557-664.

Garner, M., Draney, K., Wilson, M., Engelhard, G., Jr., & Fisher, W. P., Jr. (Eds.). (2009). Advances in Rasch measurement, Vol. One. Maple Grove, MN: JAM Press.

Granger, C. V., & Gresham, G. E. (Eds). (1993, August). New Developments in Functional Assessment [Special Issue]. Physical Medicine and Rehabilitation Clinics of North America, 4(3), 417-611.

Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, Illinois: MESA Press.

Liu, X., & Boone, W. (2006). Applications of Rasch measurement in science education. Maple Grove, MN: JAM Press.

Masters, G. N. (2007). Special issue: Programme for International Student Assessment (PISA). Journal of Applied Measurement, 8(3), 235-335.

Masters, G. N., & Keeves, J. P. (Eds.). (1999). Advances in measurement in educational research and assessment. New York: Pergamon.

Osborne, J. W. (Ed.). (2007). Best practices in quantitative methods. Thousand Oaks, CA: Sage.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Smith, E. V., Jr., & Smith, R. M. (Eds.) (2004). Introduction to Rasch measurement. Maple Grove, MN: JAM Press.

Smith, E. V., Jr., & Smith, R. M. (2007). Rasch measurement: Advanced and specialized applications. Maple Grove, MN: JAM Press.

Smith, R. M. (Ed.). (1997, June). Outcome Measurement [Special Issue]. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 261-428.

Smith, R. M. (1999). Rasch measurement models. Maple Grove, MN: JAM Press.

von Davier, M. (2006). Multivariate and mixture distribution Rasch models. New York: Springer.

Wilson, M. (1992). Objective measurement: Theory into practice, Vol. 1. Norwood, New Jersey: Ablex.

Wilson, M. (1994). Objective measurement: Theory into practice, Vol. 2. Norwood, New Jersey: Ablex.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wilson, M., Draney, K., Brown, N., & Duckor, B. (Eds.). (2009). Advances in Rasch measurement, Vol. Two (p. in press). Maple Grove, MN: JAM Press.

Wilson, M., & Engelhard, G. (2000). Objective measurement: Theory into practice, Vol. 5. Westport, Connecticut: Ablex Publishing.

Wilson, M., Engelhard, G., & Draney, K. (Eds.). (1997). Objective measurement: Theory into practice, Vol. 4. Norwood, New Jersey: Ablex.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1999). Measurement essentials. Wilmington, DE: Wide Range, Inc. [http://www.rasch.org/memos.htm#measess].

Key Articles

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-73.

Andrich, D. (2002). Understanding resistance to the data-model relationship in Rasch’s paradigm: A reflection for the next generation. Journal of Applied Measurement, 3(3), 325-59.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Beltyukova, S. A., Stone, G. E., & Fox, C. M. (2008). Magnitude estimation and categorical rating scaling in social sciences: A theoretical and psychometric controversy. Journal of Applied Measurement, 9(2), 151-159.

Choppin, B. (1968). An item bank using sample-free calibration. Nature, 219, 870-872.

Embretson, S. E. (1996, September). Item Response Theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, 20(3), 201-212.

Engelhard, G. (2008, July). Historical perspectives on invariant measurement: Guttman, Rasch, and Mokken. Measurement: Interdisciplinary Research & Perspectives, 6(3), 155-189.

Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.

Fischer, G. H. (1981, March). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.

Fischer, G. H. (1989). Applying the principles of specific objectivity and of generalizability to the measurement of change. Psychometrika, 52(4), 565-587.

Fisher, W. P., Jr. (1997). Physical disability construct convergence across instruments: Towards a universal metric. Journal of Outcome Measurement, 1(2), 87-113.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2009, July). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), in press.

Grosse, M. E., & Wright, B. D. (1986, Sep). Setting, evaluating, and maintaining certification standards with the Rasch model. Evaluation & the Health Professions, 9(3), 267-285.

Hall, W. J., Wijsman, R. A., & Ghosh, J. K. (1965). The relationship between sufficiency and invariance with applications in sequential analysis. Annals of Mathematical Statistics, 36, 575-614.

Kamata, A. (2001, March). Item analysis by the Hierarchical Generalized Linear Model. Journal of Educational Measurement, 38(1), 79-93.

Karabatsos, G., & Ullrich, J. R. (2002). Enumerating and testing conjoint measurement models. Mathematical Social Sciences, 43, 487-505.

Linacre, J. M. (1997). Instantaneous measurement and diagnosis. Physical Medicine and Rehabilitation State of the Art Reviews, 11(2), 315-324.

Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3(1), 85-106.

Lunz, M. E., & Bergstrom, B. A. (1991). Comparability of decision for computer adaptive and written examinations. Journal of Allied Health, 20(1), 15-23.

Lunz, M. E., Wright, B. D., & Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3/4, 331-345.

Masters, G. N. (1985, March). Common-person equating with the Rasch model. Applied Psychological Measurement, 9(1), 73-82.

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3-62.

Pelton, T., & Bunderson, V. (2003). The recovery of the density scale using a stochastic quasi-realization of additive conjoint measurement. Journal of Applied Measurement, 4(3), 269-81.

Rasch, G. (1961). On general laws and the meaning of measurement in psychology. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability (pp. 321-333 [http://www.rasch.org/memo1960.pdf]). Berkeley, California: University of California Press.

Rasch, G. (1966). An individualistic approach to item analysis. In P. F. Lazarsfeld & N. W. Henry (Eds.), Readings in mathematical social science (pp. 89-108). Chicago, Illinois: Science Research Associates.

Rasch, G. (1966, July). An informal report on the present state of a theory of objectivity in comparisons. Unpublished paper [http://www.rasch.org/memo1966.pdf].

Rasch, G. (1966). An item analysis which takes individual differences into account. British Journal of Mathematical and Statistical Psychology, 19, 49-57.

Rasch, G. (1968, September 6). A mathematical theory of objectivity and its consequences for model construction. [Unpublished paper [http://www.rasch.org/memo1968.pdf]], Amsterdam, the Netherlands: Institute of Mathematical Statistics, European Branch.

Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 14, 58-94.

Romanoski, J. T., & Douglas, G. (2002). Rasch-transformed raw scores and two-way ANOVA: A simulation analysis. Journal of Applied Measurement, 3(4), 421-430.

Smith, R. M. (1996). A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling, 3(1), 25-40.

Smith, R. M. (2000). Fit analysis in latent trait measurement models. Journal of Applied Measurement, 1(2), 199-218.

Stenner, A. J., & Smith III, M. (1982). Testing construct theories. Perceptual and Motor Skills, 55, 415-426.

Stenner, A. J. (1994). Specific objectivity – local and general. Rasch Measurement Transactions, 8(3), 374 [http://www.rasch.org/rmt/rmt83e.htm].

Stone, G. E., Beltyukova, S. A., & Fox, C. M. (2008). Objective standard setting for judge-mediated examinations. International Journal of Testing, 8(2), 180-196.

Stone, M. H. (2003). Substantive scale construction. Journal of Applied Measurement, 4(3), 282-97.

Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13(2), 181-208.

Wright, B. D. (1968). Sample-free test calibration and person measurement. In Proceedings of the 1967 invitational conference on testing problems (pp. 85-101 [http://www.rasch.org/memo1.htm]). Princeton, New Jersey: Educational Testing Service.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1980). Foreword, Afterword. In Probabilistic models for some intelligence and attainment tests, by Georg Rasch (pp. ix-xix, 185-199. http://www.rasch.org/memo63.htm). Chicago, Illinois: University of Chicago Press.

Wright, B. D. (1984). Despair and hope for educational measurement. Contemporary Education Review, 3(1), 281-288 [http://www.rasch.org/memo41.htm].

Wright, B. D. (1985). Additivity in psychological measurement. In E. Roskam (Ed.), Measurement and personality assessment. North Holland: Elsevier Science Ltd.

Wright, B. D. (1996). Comparing Rasch measurement and factor analysis. Structural Equation Modeling, 3(1), 3-24.

Wright, B. D. (1997, June). Fundamental measurement for outcome evaluation. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 261-88.

Wright, B. D. (1997, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45, 52 [http://www.rasch.org/memo62.htm].

Wright, B. D. (1999). Fundamental measurement for psychology. In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every educator and psychologist should know (pp. 65-104 [http://www.rasch.org/memo64.htm]). Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Wright, B. D., & Bell, S. R. (1984, Winter). Item banks: What, why, how. Journal of Educational Measurement, 21(4), 331-345 [http://www.rasch.org/memo43.htm].

Wright, B. D., & Linacre, J. M. (1989). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70(12), 857-867 [http://www.rasch.org/memo44.htm].

Wright, B. D., & Mok, M. (2000). Understanding Rasch measurement: Rasch models overview. Journal of Applied Measurement, 1(1), 83-106.

Model Applications

Adams, R. J., Wu, M. L., & Macaskill, G. (1997). Scaling methodology and procedures for the mathematics and science scales. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study Technical Report: Vol. 2: Implementation and Analysis – Primary and Middle School Years. Boston: Center for the Study of Testing, Evaluation, and Educational Policy.

Andrich, D., & Van Schoubroeck, L. (1989, May). The General Health Questionnaire: A psychometric analysis using latent trait theory. Psychological Medicine, 19(2), 469-485.

Beltyukova, S. A., Stone, G. E., & Fox, C. M. (2004). Equating student satisfaction measures. Journal of Applied Measurement, 5(1), 62-9.

Bergstrom, B. A., & Lunz, M. E. (1999). CAT for certification and licensure. In F. Drasgow & J. B. Olson-Buchanan (Eds.), Innovations in computerized assessment (pp. 67-91). Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc., Publishers.

Bond, T. G. (1994). Piaget and measurement II: Empirical validation of the Piagetian model. Archives de Psychologie, 63, 155-185.

Bunderson, C. V., & Newby, V. A. (2009). The relationships among design experiments, invariant measurement scales, and domain theories. Journal of Applied Measurement, 10(2), 117-137.

Cavanagh, R. F., & Romanoski, J. T. (2006, October). Rating scale instruments and measurement. Learning Environments Research, 9(3), 273-289.

Cipriani, D., Fox, C., Khuder, S., & Boudreau, N. (2005). Comparing Rasch analyses probability estimates to sensitivity, specificity and likelihood ratios when examining the utility of medical diagnostic tests. Journal of Applied Measurement, 6(2), 180-201.

Dawson, T. L. (2004, April). Assessing intellectual development: Three approaches, one sequence. Journal of Adult Development, 11(2), 71-85.

DeSalvo, K., Fisher, W. P. Jr., Tran, K., Bloser, N., Merrill, W., & Peabody, J. W. (2006, March). Assessing measurement properties of two single-item general health measures. Quality of Life Research, 15(2), 191-201.

Engelhard, G., Jr. (1992). The measurement of writing ability with a many-faceted Rasch model. Applied Measurement in Education, 5(3), 171-191.

Engelhard, G., Jr. (1997). Constructing rater and task banks for performance assessment. Journal of Outcome Measurement, 1(1), 19-33.

Fisher, W. P., Jr. (1998). A research program for accountable and patient-centered health status measures. Journal of Outcome Measurement, 2(3), 222-239.

Fisher, W. P., Jr., Harvey, R. F., Taylor, P., Kilgore, K. M., & Kelly, C. K. (1995, February). Rehabits: A common language of functional assessment. Archives of Physical Medicine and Rehabilitation, 76(2), 113-122.

Heinemann, A. W., Gershon, R., & Fisher, W. P., Jr. (2006). Development and application of the Orthotics and Prosthetics User Survey: Applications and opportunities for health care quality improvement. Journal of Prosthetics and Orthotics, 18(1), 80-85 [http://www.oandp.org/jpo/library/2006_01S_080.asp].

Heinemann, A. W., Linacre, J. M., Wright, B. D., Hamilton, B. B., & Granger, C. V. (1994). Prediction of rehabilitation outcomes with disability measures. Archives of Physical Medicine and Rehabilitation, 75(2), 133-143.

Hobart, J. C., Cano, S. J., O’Connor, R. J., Kinos, S., Heinzlef, O., Roullet, E. P., C., et al. (2003). Multiple Sclerosis Impact Scale-29 (MSIS-29):  Measurement stability across eight European countries. Multiple Sclerosis, 9, S23.

Hobart, J. C., Cano, S. J., Zajicek, J. P., & Thompson, A. J. (2007, December). Rating scales as outcome measures for clinical trials in neurology: Problems, solutions, and recommendations. Lancet Neurology, 6, 1094-1105.

Lai, J., Fisher, A., Magalhaes, L., & Bundy, A. C. (1996). Construct validity of the sensory integration and praxis tests. Occupational Therapy Journal of Research, 16(2), 75-97.

Lee, N. P., & Fisher, W. P., Jr. (2005). Evaluation of the Diabetes Self Care Scale. Journal of Applied Measurement, 6(4), 366-81.

Ludlow, L. H., & Haley, S. M. (1995, December). Rasch model logits: Interpretation, use, and transformation. Educational and Psychological Measurement, 55(6), 967-975.

Markward, N. J., & Fisher, W. P., Jr. (2004). Calibrating the genome. Journal of Applied Measurement, 5(2), 129-41.

Massof, R. W. (2007, August). An interval-scaled scoring algorithm for visual function questionnaires. Optometry & Vision Science, 84(8), E690-E705.

Massof, R. W. (2008, July-August). Editorial: Moving toward scientific measurements of quality of life. Ophthalmic Epidemiology, 15, 209-211.

Masters, G. N., Adams, R. J., & Lokan, J. (1994). Mapping student achievement. International Journal of Educational Research, 21(6), 595-610.

Mead, R. J. (2009). The ISR: Intelligent Student Reports. Journal of Applied Measurement, 10(2), 208-224.

Pelton, T., & Bunderson, V. (2003). The recovery of the density scale using a stochastic quasi-realization of additive conjoint measurement. Journal of Applied Measurement, 4(3), 269-81.

Smith, E. V., Jr. (2000). Metric development and score reporting in Rasch measurement. Journal of Applied Measurement, 1(3), 303-26.

Smith, R. M., & Taylor, P. (2004). Equating rehabilitation outcome scales: Developing common metrics. Journal of Applied Measurement, 5(3), 229-42.

Solloway, S., & Fisher, W. P., Jr. (2007). Mindfulness in measurement: Reconsidering the measurable in mindfulness. International Journal of Transpersonal Studies, 26, 58-81 [http://www.transpersonalstudies.org/volume_26_2007.html].

Stenner, A. J. (2001). The Lexile Framework: A common metric for matching readers and texts. California School Library Journal, 25(1), 41-2.

Wolfe, E. W., Ray, L. M., & Harris, D. C. (2004, October). A Rasch analysis of three measures of teacher perception generated from the School and Staffing Survey. Educational and Psychological Measurement, 64(5), 842-860.

Wolfe, F., Hawley, D., Goldenberg, D., Russell, I., Buskila, D., & Neumann, L. (2000, Aug). The assessment of functional impairment in fibromyalgia (FM): Rasch analyses of 5 functional scales and the development of the FM Health Assessment Questionnaire. Journal of Rheumatology, 27(8), 1989-99.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

W

endt, A., & Tatum, D. S. (2005). Credentialing health care professionals. In N. Bezruczko (Ed.), Rasch measurement in health sciences (pp. 161-75). Maple Grove, MN: JAM Press.

Table Comparing Scores, Ratings, and Percentages with Real Measures

July 6, 2009

(Documentation to be posted tomorrow.)

Characteristics Raw Scores and/or Percentages Rasch Measurement

Quantitative hypothesis

Neither formulated nor tested

Formulated and tested

Criteria for falsifying quantitative hypothesis None Additivity, conjoint transitivity, parameter separation, unidimensionality, invariance, statistical sufficiency, monotonicity, homogeneity, infinite divisibility, etc.
Relation to sample distribution Dependent Independent

Paradigm

Descriptive statistics

Prescriptive measurement

Model-data relation

Models describe data, models fit to data, model with best statistics chosen

Models prescribe data quality needed for objective inference, data fit to models, GIGO principle

Relation to structure of natural laws

None

Identical

Statistical tests of quantitative hypothesis None Information-weighted and outlier-sensitive model fit, Principal Components Analysis, many other fit statistics available
Reliability coefficients Cronbach’s alpha, KR-20, etc. Cronbach’s alpha, KR-20, etc. and Separation, Strata
Reliability error source Unexplained portion of variance Mean square of individual error estimates
Range of measurement Arbitrary, from minimum to maximum score Nonarbitrary, infinite
Unit status Ordinal, nonlinear Interval, linear
Unit status assumed in statistical comparisons Interval, linear Interval, linear
Proofs of unit status Correlational Axiomatic; reproduced physical metrics; graphical plots; independent cross-sample recalibrations; etc.
Error theory for individual scores/measures None Derived from sampling theory
Architecture (capacity to add/delete items) Closed Open
Supports adaptive administration and mass customization No (changes to items change meaning of scores) Yes (changes to items do not change meaning of measure)
Supports traceability to metrological reference standard No Yes
Domains scored Either persons or items but rarely both All facets in model (persons, items, rating categories, judges, tasks, etc.)
Comparability of domains scored Would be incomparable if scored Comparable; each interpreted in terms of the other
Unscored domain characteristics Assumed all same score or random (though probably not)
No unscored domain
Relation with other measures of same construct Incommensurable Commensurable and equatable
Construct definition None Consistency, meaningfulness, interpretability, and predictability of calibration/measure hierarchies
Focus of interpretation Mean scores or percentages relative to demographics or experimental groups Measures relative to calibrations and vice versa; measures relative to demographics or experimental groups
Relation to qualitative methods Stark difference in philosophical commitments Rooted in same philosophical commitments
Quality of research dialogue Researchers’ expertise elevated relative to research subjects Research subjects voice individual and collective perspectives on coherence of construct as defined by researchers’ questions
Source of narrative theme Researcher Object of unfolding dialogue

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Graphic Illustrations of Why Scores, Ratings, and Percentages Are Not Measures, Part Two

July 2, 2009

Part One of this two-part blog offered pictures illustrating the difference between numbers that stand for something that adds up and those that do not. The uncontrolled variation in the numbers that pass for measures in health care, education, satisfaction surveys, performance assessments, etc. is analogous to the variation in weights and measures found in Medieval European markets. It is well established that metric uniformity played a vital role in the industrial and scientific revolutions of the nineteenth century. Metrology will inevitably play a similarly central role in the economic and scientific revolutions taking place today.

Clients and students often express their need for measures that are manageable, understandable, and relevant. But sometimes it turns out that we do not understand what we think we understand. New understandings can make what previously seemed manageable and relevant appear unmanageable and irrelevant. Perhaps our misunderstandings about measurement will one day explain why we have failed to innovate and improve as much as we could have.

Of course, there are statistical methods for standardizing scores and proportions that make them comparable across different normal distributions, but I’ve never once seen them applied to employee, customer, or patient survey results reported to business or hospital managers. They certainly are not used in determining comparable proficiency levels of students under No Child Left Behind. Perhaps there are consultants and reporting systems that make standardized z-scores a routine part of their practices, but even if they are, why should anyone willingly base their decisions on the assumption that normal distributions have been obtained? Why not use methods that give the same result no matter how scores are distributed?

To bring the point home, if statistical standardization is a form of measurement, why don’t we use the z-scores for height distributions instead of the direct measures of how tall we each are? Plainly, the two kinds of numbers have different applications. Somehow, though, we try to make do without the measures in many applications involving tests and surveys, with the unfortunate consequence of much lost information and many lost opportunities for better communication.

Sometimes I wonder, if we would give a test on the meaning of the scores, percentages, and logits discussed in Part One to managers, executives, and entrepreneurs, would many do any better on the parts they think they understand than on the parts they find unfamiliar? I suspect not. Some executives whose pay-for-performance bonuses are inflated by statistical accidents are going to be unhappy with what I’m going to say here, but, as I’ve been saying for years, clarifying financial implications will go a long way toward motivating the needed changes.

How could that be true? Well, consider the way we treat percentages. Imagine that three different hospitals see their patients’ percents agreement with a key survey item change as follows. Which one changed the most?

 

A. from 30.85% to 50.00%: a 19.15% change

B. from 6.68% to 15.87%: a 9.18% change

C. from 69.15% to 84.13%: a 14.99% change

As is illustrated in Figure 1 below, given that all three pairs of administrations of the survey are included together in the same measure distribution, it is likely that the three changes were all the same size.

In this scenario, all the survey administrations shared the same standard deviation in the underlying measure distribution that the key item’s percentage was drawn from, and they started from different initial measures. Different ranges in the measures are associated with different parts of the sample’s distribution, and so different numbers and percentages of patients are associated with the same amount of measured change. It is easy to see that 100-unit measured gains in the range of 50-150 or 1000-1100 on the horizontal axis would scarcely amount to 1% changes, but the same measured gain in the middle of the distribution could be as much as 25%.

Figure 1. Different Percents, Same Measures

Figure 1. Different Percentages, Same Measures

Figure 1 shows how the same measured gain can look wildly different when expressed as a percentage, depending on where the initial measure is positioned in the distribution. But what happens when percentage gains are situated in different distributions that have different patterns of variation?

More specifically, consider a situation in which three different hospitals see their percents agreement with a key survey item change as follows.

A. from 30.85% to 50.00%: a 19.15% change

B. from 30.85% to 50.00%: a 19.15% change

C. from 30.85% to 50.00%: a 19.15% change

Did one change more than the others? Of course, the three percentages are all the same, so we would naturally think that the three increases are all the same. But what if the standard deviations characterizing the three different hospitals’ score distributions are different?

Figure 2, below, shows that the three 19.15% changes could be associated with quite different measured gains. When the distribution is wider and the standard deviation is larger, any given percentage change will be associated with a larger measured change than in cases with narrower distributions and smaller standard deviations.

Same Percentage Gains, Different Measured Gains

Figure 2. Same Percentage Gains, Different Measured Gains

And if this is not enough evidence as to the foolhardiness of treating percentages as measures, bear with me through one more example. Imagine another situation in which three different hospitals see their percents agreement with a key survey item change as follows.

A. from 30.85% to 50.00%: a 19.15% change

B. from 36.96% to 50.00%: a 13.04% change

C. from 36.96% to 50.00%: a 13.04% change

Did one change more than the others? Plainly A obtains the largest percentage gain. But Figure 3 shows that, depending on the underlying distribution, A’s 19.15% gain might be a smaller measured change than either B’s or C’s. Further, B’s and C’s measures might not be identical, contrary to what would be expected from the percentages alone.

Figure 3. Percentages Completely at Odds with Measures

Figure 3. Percentages Completely at Odds with Measures

Now we have a fuller appreciation of the scope of the problems associated with the changing unit size illustrated in Part One. Though we think we understand percentages and insist on using them as something familiar and routine, the world that they present to us is as crazily distorted as a carnival funhouse. And we won’t even begin to consider how things look in the context of distributions skewed toward one end of the continuum or the other! There is similarly no point at all in going to bimodal or multimodal distributions (ones that have more than one peak). The vast majority of business applications employing scores, ratings, and percentages as measures do not take the underlying distribution into account. Given the problems that arise in optimal conditions (i.e., with a normal distribution), there is no need to belabor the issue with an enumeration of all the possible things that could be going wrong. Far better to simply move on and construct measurement systems that remain invariant across the different shapes of local data sets’ particular distributions.

How could we have gone so far in making these nonsensical numbers the focus of our attention? To put things back in perspective, we need to keep in mind the evolving magnitude of the problems we face. When Florence Nightingale was deploring the lack of any available indications of the effectiveness of her efforts, a little bit of flawed information was a significant improvement over no information. Ordinal, situation-specific numbers provided highly useful information when problems emerged in local contexts on a scale that could be comprehended and addressed by individuals and small groups.

We no longer live in that world. Today’s problems require kinds of information that must be more meaningful, precise, and actionable than ever before. And not only that, this information cannot remain accessible only to managers, executives, researchers, and data managers. It must be brought to bear in every transaction and information exchange in the industry.

Information has to be formatted in the common currency of uniform metrics to make it as fluid and empowering as possible. Would the auto industry have been able to bring off a quality revolution if every worker’s toolkit was calibrated in a different unit? Could we expect to coordinate schedules easily if we each had clocks scaled in different time units? Obviously not; why should we expect quality revolutions in health care and education when nearly all of our relevant metrics are incommensurable?

Management consultants realized decades ago that information creates a sense of responsibility in the person who possesses it. We cannot expect clinicians and teachers to take full responsibility for the outcomes they produce until they have the information they need to evaluate and improve them. Existing data and systems plainly are not up to the task.

The problem is far less a matter of complex or difficult issues than it is one of culture and priorities. It often takes less effort to remain in a dysfunctional rut and deal with massive inefficiencies than it does to get out of the rut and invent a new system with new potentials. Big changes tend to take place only when systems become so bogged down by their problems that new systems emerge simply out of the need to find some way to keep things in motion. These blogs are written in the hope that we might be able to find our way to new methods without suffering the catastrophes of total system failure. One might well imagine an entrepreneurially-minded consortium of providers, researchers, payors, accreditors, and patient advocates joining forces in small pilot projects testing out new experimental systems.

To know how much of something we’re getting for our money and whether its a fair bargain, we need to be able to compare amounts across providers, vendors, treatment options, teaching methods, etc. Scores summed from tests, surveys, or assessments, individual ratings, and percentages of a maximum possible score or frequency do not provide this information because they are not measures. Their unit sizes vary across individuals, collections of indicators (instruments), time, and space. The consequences of treating scores and percentages as measures are not trivial. We will eventually come to see that measurement quality is the primary source of the differences between the current health care and education systems’ regional variations and endlessly spiralling costs, on the one hand, and the geographically uniform quality, costs, and improvements in the systems we will create in the future.

Markets are dysfunctional when quality and costs cannot be evaluated in common terms by consumers, providers’ quality improvement specialists, researchers, accreditors, and payers. There are widespread calls for greater transparency in purchasing decisions, but transparency is not being defined and operationalized meaningfully or usefully. As currently employed, transparency refers to making key data available for public scrutiny. But these data are almost always expressed as scores, ratings, or percentages that are anything but transparent. In addition to not adding up, these data are also usually presented in indigestibly large volumes, and are not quality assessed.

All things considered, we’re doing amazingly well with our health care and education systems given the way we’ve hobbled ourselves with dysfunctional, incommensurable measures. And that gives us real cause for hope! What will we be able to accomplish when we really put our minds to measuring what we want to manage? How much better will we be able to do when entrepreneurs have the tools they need to innovate new efficiences? Who knows what we’ll be capable of when we have meaningful measures that stand for amounts that really add up, when data volumes are dramatically reduced to manageable levels, and when data quality is effectively assessed and improved?

For more on the problems associated with these kinds of percentages in the context of NCLB, see Andrew Dean Ho’s article in the August/September, 2008 issue of Educational Researcher, and Charles Murray’s “By the Numbers” column in the July 25, 2006 Wall Street Journal.

This is not the end of the story as to what the new measurement paradigm brings to bear. Next, I’ll post a table contrasting the features of scores, ratings, and percentages with those of measures. Until then, check out the latest issue of the Journal of Applied Measurement at http://www.jampress.org, see what’s new in measurement software at http://www.winsteps.com or http://www.rummlab.com.au, or look into what’s up in the way of measurement research projects with the BEAR group at UC Berkeley (http://gse.berkeley.edu/research/BEAR/research.html).

Finally, keep in mind that we are what we measure. It’s time we measured what we want to be.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.