Posts Tagged ‘Statistics’

The Counterproductive Consequences of Common Study Designs and Statistical Methods

May 21, 2015

Because of the ways studies are designed and the ways data are analyzed, research results in psychology and the social sciences often appear to be nonlinear, sample- and instrument-dependent, and incommensurable, even when they need not be. In contrast with what are common assumptions about the nature of the constructs involved, invariant relations may be more obscured than clarified by typically employed research designs and statistical methods.

To take a particularly salient example, the number of small factors with Eigenvalues greater than 1.0 identified via factor analysis increases as the number of modes in a multi-modal distribution also increases, and the interpretation of results is further complicated by the fact that the number of factors identified decreases as sample size increases (Smith, 1996).

Similarly, variation in employment test validity across settings was established as a basic assumption by the 1970s, after 50 years of studies observing the situational specificity of results. But then Schmidt and Hunter (1977) identified sampling error, measurement error, and range restriction as major sources of what was only the appearance of incommensurable variation in employment test validity. In other words, for most of the 20th century, the identification of constructs and comparisons of results across studies were pointlessly confused by mixed populations, uncontrolled variation in reliability, and unnoted floor and/or ceiling effects. Though they do nothing to establish information systems deploying common languages structured by standard units of measurement (Feinstein, 1995), meta-analysis techniques are a step forward in equating effect sizes (Hunter & Schmidt, 2004).

Wright and Stone’s (1979) Best Test Design, in contrast, takes up each of these problems in an explicit way. Sampling error is addressed in that both the sample’s and the items’ representations of the same populations of persons and expressions of a construct are evaluated. The evaluation of reliability is foregrounded and clarified by taking advantage of the availability of individualized measurement uncertainty (error) estimates (following Andrich, 1982, presented at AERA in 1977). And range restriction becomes manageable in terms of equating and linking instruments measuring in different ranges of the same construct. As was demonstrated by Duncan (1985; Allerup, Bech, Loldrup, et al., 1994; Andrich & Styles, 1998), for instance, the restricted ranges of various studies assessing relationships between measures of attitudes and behaviors led to the mistaken conclusion that these were separate constructs. When the entire range of variation was explicitly modeled and studied, a consistent relationship was found.

Statistical and correlational methods have long histories of preventing the discovery, assessment, and practical application of invariant relations because they fail to test for invariant units of measurement, do not define standard metrics, never calibrate all instruments measuring the same thing in common units, and have no concept of formal measurement systems of interconnected instruments. Wider appreciation of the distinction between statistics and measurement (Duncan & Stenbeck, 1988; Fisher, 2010; Wilson, 2013a), and of the potential for metrological traceability we have within our reach (Fisher, 2009, 2012; Fisher & Stenner, 2013; Mari & Wilson, 2013; Pendrill, 2014; Pendrill & Fisher, 2015; Wilson, 2013b; Wilson, Mari, Maul, & Torres Irribarra, 2015), are demonstrably fundamental to the advancement of a wide range of fields.

References

Allerup, P., Bech, P., Loldrup, D., Alvarez, P., Banegil, T., Styles, I., & Tenenbaum, G. (1994). Psychiatric, business, and psychological applications of fundamental measurement models. International Journal of Educational Research, 21(6), 611-622.

Andrich, D. (1982). An index of person separation in Latent Trait Theory, the traditional KR-20 index, and the Guttman scale response pattern. Education Research and Perspectives, 9(1), 95-104 [http://www.rasch.org/erp7.htm].

Andrich, D., & Styles, I. M. (1998). The structural relationship between attitude and behavior statements from the unfolding perspective. Psychological Methods, 3(4), 454-469.

Duncan, O. D. (1985). Probability, disposition and the inconsistency of attitudes and behaviour. Synthese, 42, 21-34.

Duncan, O. D., & Stenbeck, M. (1988). Panels and cohorts: Design and model in the study of voting turnout. In C. C. Clogg (Ed.), Sociological Methodology 1988 (pp. 1-35). Washington, DC: American Sociological Association.

Feinstein, A. R. (1995). Meta-analysis: Statistical alchemy for the 21st century. Journal of Clinical Epidemiology, 48(1), 71-79.

Fisher, W. P., Jr. (2009). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement, 42(9), 1278-1287.

Fisher, W. P., Jr. (2010). Statistics and measurement: Clarifying the differences. Rasch Measurement Transactions, 23(4), 1229-1230.

Fisher, W. P., Jr. (2012, May/June). What the world needs now: A bold plan for new standards [Third place, 2011 NIST/SES World Standards Day paper competition]. Standards Engineering, 64(3), 1 & 3-5.

Fisher, W. P., Jr., & Stenner, A. J. (2013). Overcoming the invisibility of metrology: A reading measurement network for education and the social sciences. Journal of Physics: Conference Series, 459(012024), http://iopscience.iop.org/1742-6596/459/1/012024.

Hunter, J. E., & Schmidt, F. L. (Eds.). (2004). Methods of meta-analysis: Correcting error and bias in research findings. Thousand Oaks, CA: Sage.

Mari, L., & Wilson, M. (2013). A gentle introduction to Rasch measurement models for metrologists. Journal of Physics Conference Series, 459(1), http://iopscience.iop.org/1742-6596/459/1/012002/pdf/1742-6596_459_1_012002.pdf.

Pendrill, L. (2014). Man as a measurement instrument [Special Feature]. NCSLi Measure: The Journal of Measurement Science, 9(4), 22-33.

Pendrill, L., & Fisher, W. P., Jr. (2015). Counting and quantification: Comparing psychometric and metrological perspectives on visual perceptions of number. Measurement, 71, 46-55. doi: http://dx.doi.org/10.1016/j.measurement.2015.04.010

Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62(5), 529-540.

Smith, R. M. (1996). A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling, 3(1), 25-40.

Wilson, M. R. (2013a). Seeking a balance between the statistical and scientific elements in psychometrics. Psychometrika, 78(2), 211-236.

Wilson, M. R. (2013b). Using the concept of a measurement system to characterize measurement models used in psychometrics. Measurement, 46, 3766-3774.

Wilson, M., Mari, L., Maul, A., & Torres Irribarra, D. (2015). A comparison of measurement concepts across physical science and social science domains: Instrument design, calibration, and measurement. Journal of Physics: Conference Series, 588(012034), http://iopscience.iop.org/1742-6596/588/1/012034.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, Illinois: MESA Press.

Advertisements

Number lines, counting, and measuring in arithmetic education

July 29, 2011

Over the course of two days spent at a meeting on mathematics education, a question started to form in my mind, one I don’t know how to answer, and to which there may be no answer. I’d like to try to formulate what’s on my mind in writing, and see if it’s just nonsense, a curiosity, some old debate that’s been long since resolved, issues too complex to try to use in elementary education, or something we might actually want to try to do something about.

The question stems from my long experience in measurement. It is one of the basic principles of the field that counting and measuring are different things (see the list of publications on this, below). Counts don’t behave like measures unless the things being counted are units of measurement established as equal ratios or intervals that remain invariant independent of the local particulars of the sample and instrument.

Plainly, if you count two groups of small and large rocks or oranges, the two groups can have the same number of things and the group with the larger things will have more rock or orange than the group with the smaller things. But the association of counting numbers and arithmetic operations with number lines insinuates and reinforces to the point of automatic intuition the false idea that numbers always represent quantity. I know that number lines are supposed to represent an abstract continuum but I think it must be nearly impossible for children to not assume that the number line is basically a kind of ruler, a real physical thing that behaves much like a row of same size wooden blocks laid end to end.

This could be completely irrelevant if the distinction between “How many?” and “How much?” is intensively taught and drilled into kids. Somehow I think it isn’t, though. And here’s where I get to the first part of my real question. Might not the universal, early, and continuous reinforcement of this simplistic equating of number and quantity have a lot to do with the equally simplistic assumption that all numeric data and statistical analysis is somehow quantitative? We count rocks or fish or sticks and call the resulting numbers quantities, and so we do the same thing when we count correct answers or ratings of “Strongly Agree.”

Though that counting is a natural and obvious point from which to begin studying whether something is quantitatively measurable, there are no defined units of measurement in the ordinal data gathered up from tests and surveys. The difference between any two adjacent scores varies depending on which two adjacent scores are compared. This has profound implications for the inferences we make and for our ability to think together as a field about our objects of investigation.

Over the last 30 years and more, we have become increasingly sensitized to the way our words prefigure our expectations and color our perceptions. This struggle to say what we mean and to not prejudicially exclude others from recognition as full human beings is admirable and good. But if that is so, why is it then that we nonetheless go on unjustifiably reducing the real characteristics of people’s abilities, health, performances, etc. to numbers that do not and cannot stand for quantitative amounts? Why do we keep on referring to counts as quantities? Why do we insist on referring to inconstant and locally dependent scores as measures? And why do we refuse to use the readily available methods we have at our disposal to create universally uniform measures that consistently represent the same unit amount always and everywhere?

It seems to me that the image of the number line as a kind of ruler is so indelibly impressed on us as a habit of thought that it is very difficult to relinquish it in favor of a more abstract model of number. Might it be important for us to begin to plant the seeds for more sophisticated understandings of number early in mathematics education? I’m going to wonder out loud about this to some of my math education colleagues…

Cooper, G., & Humphry, S. M. (2010). The ontological distinction between units and entities. Synthese, pp. DOI 10.1007/s11229-010-9832-1.

Wright, B. D. (1989). Rasch model from counting right answers: Raw scores as sufficient statistics. Rasch Measurement Transactions, 3(2), 62 [http://www.rasch.org/rmt/rmt32e.htm].

Wright, B. D. (1993). Thinking with raw scores. Rasch Measurement Transactions, 7(2), 299-300 [http://www.rasch.org/rmt/rmt72r.htm].

Wright, B. D. (1994, Autumn). Measuring and counting. Rasch Measurement Transactions, 8(3), 371 [http://www.rasch.org/rmt/rmt83c.htm].

Wright, B. D., & Linacre, J. M. (1989). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70(12), 857-867 [http://www.rasch.org/memo44.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

A New Agenda for Measurement Theory and Practice in Education and Health Care

April 15, 2011

Two key issues on my agenda offer different answers to the question “Why do you do things the way you do in measurement theory and practice?”

First, we can take up the “Because of…” answer to this question. We need to articulate an historical account of measurement that does three things:

  1. that builds on Rasch’s use of Maxwell’s method of analogy by employing it and expanding on it in new applications;
  2. that unifies the vocabulary and concepts of measurement across the sciences into a single framework so far as possible by situating probabilistic models of invariant individual-level within-variable phenomena in the context of measurement’s GIGO principle and data-to-model fit, as distinct from the interactions of group-level between-variable phenomena in the context of statistics’ model-to-data fit; and
  3. that stresses the social, collective cognition facilitated by networks of individuals whose point-of-use measurement-informed decisions and behaviors are coordinated and harmonized virtually, at a distance, with no need for communication or negotiation.

We need multiple publications in leading journals on these issues, as well as one or more books that people can cite as a way of making this real and true history of measurement, properly speaking, credible and accepted in the mainstream. This web site http://ssrn.com/abstract=1698919 is a draft article of my own in this vein that I offer for critique; other material is available on request. Anyone who works on this paper with me and makes a substantial contribution to its publication will be added as co-author.

Second, we can take up the “In order that…” answer to the question “Why do you do things the way you do?” From this point of view, we need to broaden the scope of the measurement research agenda beyond data analysis, estimation, models, and fit assessment in three ways:

  1. by emphasizing predictive construct theories that exhibit the fullest possible understanding of what is measured and so enable the routine reproduction of desired proportionate effects efficiently, with no need to analyze data to obtain an estimate;
  2. by defining the standard units to which all calibrated instruments measuring given constructs are traceable; and
  3. by disseminating to front line users on mass scales instruments measuring in publicly available standard units and giving immediate feedback at the point of use.

These two sets of issues define a series of talking points that together constitute a new narrative for measurement in education, psychology, health care, and many other fields. We and others may see our way to organizing new professional societies, new journals, new university-based programs of study, etc. around these principles.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

The Moral Implications of the Concept of Human Capital: More on How to Create Living Capital Markets

March 22, 2011

The moral reprehensibility of the concept of human capital hinges on its use in rationalizing impersonal business decisions in the name of profits. Even when the viability of the organization is at stake, the discarding of people (referred to in some human resource departments as “taking out the trash”) entails degrees of psychological and economic injury no one should have to suffer, or inflict.

There certainly is a justified need for a general concept naming the productive capacity of labor. But labor is far more than a capacity for work. No one’s working life should be reduced to a job description. Labor involves a wide range of different combinations of skills, abilities, motivations, health, and trustworthiness. Human capital has then come to be broken down into a wide variety of forms, such as literacy capital, health capital, social capital, etc.

The metaphoric use of the word “capital” in the phrase “human capital” referring to stocks of available human resources rings hollow. The traditional concept of labor as a form of capital is an unjustified reduction of diverse capacities in itself. But the problem goes deeper. Intangible resources like labor are not represented and managed in the forms that make markets for tangible resources efficient. Transferable representations, like titles and deeds, give property a legal status as owned and an economic status as financially fungible. And in those legal and economic terms, tangible forms of capital give capitalism its hallmark signification as the lifeblood of the cycle of investment, profits, and reinvestment.

Intangible forms of capital, in contrast, are managed without the benefit of any standardized way of proving what is owned, what quantity or quality of it exists, and what it costs. Human, social, and natural forms of capital are therefore managed directly, by acting in an unmediated way on whomever or whatever embodies them. Such management requires, even in capitalist economies, the use of what are inherently socialistic methods, as these are the only methods available for dealing with the concrete individual people, communities, and ecologies involved (Fisher, 2002, 2011; drawing from Hayek, 1948, 1988; De Soto, 2000).

The assumption that transferable representations of intangible assets are inconceivable or inherently reductionist is, however, completely mistaken. All economic capital is ultimately brought to life (conceived, gestated, midwifed, and nurtured to maturity) as scientific capital. Scientific measurability is what makes it possible to add up the value of shares of stock across holdings, to divide something owned into shares, and to represent something in a court or a bank in a portable form (Latour, 1987; Fisher, 2002, 2011).

Only when you appreciate this distinction between dead and living capital, between capital represented on transferable instruments and capital that is not, then you can see that the real tragedy is not in the treatment of labor as capital. No, the real tragedy is in the way everyone is denied the full exercise of their rights over the skills, abilities, health, motivations, trustworthiness, and environmental resources that are rightly their own personal, private property.

Being homogenized at the population level into an interchangeable statistic is tragic enough. But when we leave the matter here, we fail to see and to grasp the meaning of the opportunities that are lost in that myopic world view. As I have been at pains in this blog to show, statistics are not measures. Statistical models of interactions between several variables at the group level are not the same thing as measurement models of interactions within a single variable at the individual level. When statistical models are used in place of measurement models, the result is inevitably numbers without a soul. When measurement models of individual response processes are used to produce meaningful estimates of how much of something someone possesses, a whole different world of possibilities opens up.

In the same way that the Pythagorean Theorem applies to any triangle, so, too, do the coordinates from the international geodetic survey make it possible to know everything that needs to be known about the location and disposition of a piece of real estate. Advanced measurement models in the psychosocial sciences are making it possible to arrive at similarly convenient and objective ways of representing the quality and quantity of intangible assets. Instead of being just one number among many others, real measures tell a story that situates each of us relative to everyone else in a meaningful way.

The practical meaning of the maxim “you manage what you measure” stems from those instances in which measures embody the fullness of the very thing that is the object of management interest. An engine’s fuel efficiency, or the volume of commodities produced, for instance, are things that can be managed less or more efficiently because there are measures of them that directly represent just what we want to control. Lean thinking enables the removal of resources that do not contribute to the production of the desired end result.

Many metrics, however, tend to obscure and distract from what need to be managed. The objects of measurement may seem to be obviously related to what needs to be managed, but dealing with each of them piecemeal results in inefficient and ineffective management. In these instances, instead of the characteristic cycle of investment, profit, and reinvestment, there seems only a bottomless pit absorbing ever more investment and never producing a profit. Why?

The economic dysfunctionality of intangible asset markets is intimately tied up with the moral dysfunctionality of those markets. Drawing an analogy from a recent analysis of political freedom (Shirky, 2010), economic freedom has to be accompanied by a market society economically literate enough, economically empowered enough, and interconnected enough to trade on the capital stocks issued. Western society, and increasingly the entire global society, is arguably economically literate and sufficiently interconnected to exercise economic freedom.

Economic empowerment is another matter entirely. There is no economic power without fungible capital, without ways of representing resources of all kinds, tangible and intangible, that transparently show what is available, how much of it there is, and what quality it is. A form of currency expressing the value of that capital is essential, but money is wildly insufficient to the task of determining the quality and quantity of the available capital stocks.

Today’s education, health care, human resource, and environmental quality markets are the diametric opposite of the markets in which investors, producers, and consumers are empowered. Only when dead human, social, and natural capital is brought to life in efficient markets (Fisher, 2011) will we empower ourselves with fuller degrees of creative control over our economic lives.

The crux of the economic empowerment issue is this: in the current context of inefficient intangibles markets, everyone is personally commodified. Everything that makes me valuable to an employer or investor or customer, my skills, motivations, health, and trustworthiness, is unjustifiably reduced to a homogenized unit of labor. And in the social and environmental quality markets, voting our shares is cumbersome, expensive, and often ineffective because of the immense amount of work that has to be done to defend each particular living manifestation of the value we want to protect.

Concentrated economic power is exercised in the mass markets of dead, socialized intangible assets in ways that we are taught to think of as impersonal and indifferent to each of us as individuals, but which is actually experienced by us as intensely personal.

So what is the difference between being treated personally as a commodity and being treated impersonally as a commodity? This is the same as asking what it would mean to be empowered economically with creative control over the stocks of human, social, and natural capital that are rightfully our private property. This difference is the difference between dead and living capital (Fisher, 2002, 2011).

Freedom of economic communication, realized in the trade of privately owned stocks of any form of capital, ought to be the highest priority in the way we think about the infrastructure of a sustainable and socially responsible economy. For maximum efficiency, that freedom requires a common meaningful and rigorous quantitative language enabling determinations of what exactly is for sale, and its quality, quantity, and unit price. As I have ad nauseum repeated in this blog, measurement based in scientifically calibrated instrumentation traceable to consensus standards is absolutely essential to meeting this need.

Coming in at a very close second to the highest priority is securing the ability to trade. A strong market society, where people can exercise the right to control their own private property—their personal stocks of human, social, and natural capital—in highly efficient markets, is more important than policies, regulations, and five-year plans dictating how masses of supposedly homogenous labor, social, and environmental commodities are priced and managed.

So instead of reacting to the downside of the business cycle with a socialistic safety net, how might a capitalistic one prove more humane, moral, and economically profitable? Instead of guaranteeing a limited amount of unemployment insurance funded through taxes, what we should have are requirements for minimum investments in social capital. Instead of employment in the usual sense of the term, with its implications of hiring and firing, we should have an open market for fungible human capital, in which everyone can track the price of their stock, attract and make new investments, take profits and income, upgrade the quality and/or quantity of their stock, etc.

In this context, instead of receiving unemployment compensation, workers not currently engaged in remunerated use of their skills would cash in some of their accumulated stock of social capital. The cost of social capital would go up in periods of high demand, as during the recent economic downturns caused by betrayals of trust and commitment (which are, in effect, involuntary expenditures of social capital). Conversely, the cost of human capital would also fluctuate with supply and demand, with the profits (currently referred to as wages) turned by individual workers rising and falling with the price of their stocks. These ups and downs, being absorbed by everyone in proportion to their investments, would reduce the distorted proportions we see today in the shares of the rewards and punishments allotted.

Though no one would have a guaranteed wage, everyone would have the opportunity to manage their capital to the fullest, by upgrading it, keeping it current, and selling it to the highest bidder. Ebbing and flowing tides would more truly lift and drop all boats together, with the drops backed up with the social capital markets’ tangible reassurance that we are all in this together. This kind of a social capitalism transforms the supposedly impersonal but actually highly personal indifference of flows in human capital into a more fully impersonal indifference in which individuals have the potential to maximize the realization of their personal goals.

What we need is to create a visible alternative to the bankrupt economic system in a kind of reverse shock doctrine. Eleanor Roosevelt often said that the thing we are most afraid of is the thing we most need to confront if we are to grow. The more we struggle against what we fear, the further we are carried away from what we want. Only when we relax into the binding constraints do we find them loosened. Only when we channel overwhelming force against itself or in a productive direction can we withstand attack. When we find the courage to go where the wild things are and look the monsters in the eye will we have the opportunity to see if their fearful aspect is transformed to playfulness. What is left is often a more mundane set of challenges, the residuals of a developmental transition to a new level of hierarchical complexity.

And this is the case with the moral implications of the concept of human capital. Treating individuals as fungible commodities is a way that some use to protect themselves from feeling like monsters and from being discarded as well. Those who find themselves removed from the satisfactions of working life can blame the shortsightedness of their former colleagues, or the ugliness of the unfeeling system. But neither defensive nor offensive rationalizations do anything to address the actual problem, and the problem has nothing to do with the morality or the immorality of the concept of human capital.

The problem is the problem. That is, the way we approach and define the problem delimits the sphere of the creative options we have for solving it. As Henry Ford is supposed to have said, whether you think you can or you think you cannot, you’re probably right. It is up to us to decide whether we can create an economic system that justifies its reductions and actually lives up to its billing as impersonal and unbiased, or if we cannot. Either way, we’ll have to accept and live with the consequences.

References

DeSoto, H. (2000). The mystery of capital: Why capitalism triumphs in the West and fails everywhere else. New York: Basic Books.

Fisher, W. P., Jr. (2002, Spring). “The Mystery of Capital” and the human sciences. Rasch Measurement Transactions, 15(4), 854 [http://www.rasch.org/rmt/rmt154j.htm].

Fisher, W. P., Jr. (2011, Spring). Bringing human, social, and natural capital to life: Practical consequences and opportunities. Journal of Applied Measurement, 12(1), in press.

Hayek, F. A. (1948). Individualism and economic order. Chicago: University of Chicago Press.

Hayek, F. A. (1988). The fatal conceit: The errors of socialism (W. W. Bartley, III, Ed.) The Collected Works of F. A. Hayek. Chicago: University of Chicago Press.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York: Cambridge University Press.

Shirky, C. (2010, December 20). The political power of social media: Technology, the public sphere, and political change. Foreign Affairs, 90(1), http://www.foreignaffairs.com/articles/67038/clay-shirky/the-political-power-of-social-media.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

A Second Simple Example of Measurement’s Role in Reducing Transaction Costs, Enhancing Market Efficiency, and Enables the Pricing of Intangible Assets

March 9, 2011

The prior post here showed why we should not confuse counts of things with measures of amounts, though counts are the natural starting place to begin constructing measures. That first simple example focused on an analogy between counting oranges and measuring the weight of oranges, versus counting correct answers on tests and measuring amounts of ability. This second example extends the first by, in effect, showing what happens when we want to aggregate value not just across different counts of some one thing but across different counts of different things. The point will be, in effect, to show how the relative values of apples, oranges, grapes, and bananas can be put into a common frame of reference and compared in a practical and convenient way.

For instance, you may go into a grocery store to buy raspberries and blackberries, and I go in to buy cantaloupe and watermelon. Your cost per individual fruit will be very low, and mine will be very high, but neither of us will find this annoying, confusing, or inconvenient because your fruits are very small, and mine, very large. Conversely, your cost per kilogram will be much higher than mine, but this won’t cause either of us any distress because we both recognize the differences in the labor, handling, nutritional, and culinary value of our purchases.

But what happens when we try to purchase something as complex as a unit of socioeconomic development? The eight UN Millennium Development Goals (MDGs) represent a start at a systematic effort to bring human, social, and natural capital together into the same economic and accountability framework as liquid and manufactured capital, and property. But that effort is stymied by the inefficiency and cost of making and using measures of the goals achieved. The existing MDG databases (http://data.un.org/Browse.aspx?d=MDG), and summary reports present overwhelming numbers of numbers. Individual indicators are presented for each year, each country, each region, and each program, goal by goal, target by target, indicator by indicator, and series by series, in an indigestible volume of data.

Though there are no doubt complex mathematical methods by which a philanthropic, governmental, or NGO investor might determine how much development is gained per million dollars invested, the cost of obtaining impact measures is so high that most funding decisions are made with little information concerning expected returns (Goldberg, 2009). Further, the percentages of various needs met by leading social enterprises typically range from 0.07% to 3.30%, and needs are growing, not diminishing. Progress at current rates means that it would take thousands of years to solve today’s problems of human suffering, social disparity, and environmental quality. The inefficiency of human, social, and natural capital markets is so overwhelming that there is little hope for significant improvements without the introduction of fundamental infrastructural supports, such as an Intangible Assets Metric System.

A basic question that needs to be asked of the MDG system is, how can anyone make any sense out of so much data? Most of the indicators are evaluated in terms of counts of the number of times something happens, the number of people affected, or the number of things observed to be present. These counts are usually then divided by the maximum possible (the count of the total population) and are expressed as percentages or rates.

As previously explained in various posts in this blog, counts and percentages are not measures in any meaningful sense. They are notoriously difficult to interpret, since the quantitative meaning of any given unit difference varies depending on the size of what is counted, or where the percentage falls in the 0-100 continuum. And because counts and percentages are interpreted one at a time, it is very difficult to know if and when any number included in the sheer mass of data is reasonable, all else considered, or if it is inconsistent with other available facts.

A study of the MDG data must focus on these three potential areas of data quality improvement: consistency evaluation, volume reduction, and interpretability. Each builds on the others. With consistent data lending themselves to summarization in sufficient statistics, data volume can be drastically reduced with no loss of information (Andersen, 1977, 1999; Wright, 1977, 1997), data quality can be readily assessed in terms of sufficiency violations (Smith, 2000; Smith & Plackner, 2009), and quantitative measures can be made interpretable in terms of a calibrated ruler’s repeatedly reproducible hierarchy of indicators (Bond & Fox, 2007; Masters, Lokan, & Doig, 1994).

The primary data quality criteria are qualitative relevance and meaningfulness, on the one hand, and mathematical rigor, on the other. The point here is one of following through on the maxim that we manage what we measure, with the goal of measuring in such a way that management is better focused on the program mission and not distracted by accounting irrelevancies.

Method

As written and deployed, each of the MDG indicators has the face and content validity of providing information on each respective substantive area of interest. But, as has been the focus of repeated emphases in this blog, counting something is not the same thing as measuring it.

Counts or rates of literacy or unemployment are not, in and of themselves, measures of development. Their capacity to serve as contributing indications of developmental progress is an empirical question that must be evaluated experimentally against the observable evidence. The measurement of progress toward an overarching developmental goal requires inferences made from a conceptual order of magnitude above and beyond that provided in the individual indicators. The calibration of an instrument for assessing progress toward the realization of the Millennium Development Goals requires, first, a reorganization of the existing data, and then an analysis that tests explicitly the relevant hypotheses as to the potential for quantification, before inferences supporting the comparison of measures can be scientifically supported.

A subset of the MDG data was selected from the MDG database available at http://data.un.org/Browse.aspx?d=MDG, recoded, and analyzed using Winsteps (Linacre, 2011). At least one indicator was selected from each of the eight goals, with 22 in total. All available data from these 22 indicators were recorded for each of 64 countries.

The reorganization of the data is nothing but a way of making the interpretation of the percentages explicit. The meaning of any one country’s percentage or rate of youth unemployment, cell phone users, or literacy has to be kept in context relative to expectations formed from other countries’ experiences. It would be nonsense to interpret any single indicator as good or bad in isolation. Sometimes 30% represents an excellent state of affairs, other times, a terrible one.

Therefore, the distributions of each indicator’s percentages across the 64 countries were divided into ranges and converted to ratings. A lower rating uniformly indicates a status further away from the goal than a higher rating. The ratings were devised by dividing the frequency distribution of each indicator roughly into thirds.

For instance, the youth unemployment rate was found to vary such that the countries furthest from the desired goal had rates of 25% and more(rated 1), and those closest to or exceeding the goal had rates of 0-10% (rated 3), leaving the middle range (10-25%) rated 2. In contrast, percentages of the population that are undernourished were rated 1 for 35% or more, 2 for 15-35%, and 3 for less than 15%.

Thirds of the distributions were decided upon only on the basis of the investigator’s prior experience with data of this kind. A more thorough approach to the data would begin from a finer-grained rating system, like that structuring the MDG table at http://mdgs.un.org/unsd/mdg/Resources/Static/Products/Progress2008/MDG_Report_2008_Progress_Chart_En.pdf. This greater detail would be sought in order to determine empirically just how many distinctions each indicator can support and contribute to the overall measurement system.

Sixty-four of the available 336 data points were selected for their representativeness, with no duplications of values and with a proportionate distribution along the entire continuum of observed values.

Data from the same 64 countries and the same years were then sought for the subsequent indicators. It turned out that the years in which data were available varied across data sets. Data within one or two years of the target year were sometimes substituted for missing data.

The data were analyzed twice, first with each indicator allowed its own rating scale, parameterizing each of the category difficulties separately for each item, and then with the full rating scale model, as the results of the first analysis showed all indicators shared strong consistency in the rating structure.

Results

Data were 65.2% complete. Countries were assessed on an average of 14.3 of the 22 indicators, and each indicator was applied on average to 41.7 of the 64 country cases. Measurement reliability was .89-.90, depending on how measurement error is estimated. Cronbach’s alpha for the by-country scores was .94. Calibration reliability was .93-.95. The rating scale worked well (see Linacre, 2002, for criteria). The data fit the measurement model reasonably well, with satisfactory data consistency, meaning that the hypothesis of a measurable developmental construct was not falsified.

The main result for our purposes here concerns how satisfactory data consistency makes it possible to dramatically reduce data volume and improve data interpretability. The figure below illustrates how. What does it mean for data volume to be drastically reduced with no loss of information? Let’s see exactly how much the data volume is reduced for the ten item data subset shown in the figure below.

The horizontal continuum from -100 to 1300 in the figure is the metric, the ruler or yardstick. The number of countries at various locations along that ruler is shown across the bottom of the figure. The mean (M), first standard deviation (S), and second standard deviation (T) are shown beneath the numbers of countries. There are ten countries with a measure of just below 400, just to the left of the mean (M).

The MDG indicators are listed on the right of the figure, with the indicator most often found being achieved relative to the goals at the bottom, and the indicator least often being achieved at the top. The ratings in the middle of the figure increase from 1 to 3 left to right as the probability of goal achievement increases as the measures go from low to high. The position of the ratings in the middle of the figure shifts from left to right as one reads up the list of indicators because the difficulty of achieving the goals is increasing.

Because the ratings of the 64 countries relative to these ten goals are internally consistent, nothing but the developmental level of the country and the developmental challenge of the indicator affects the probability that a given rating will be attained. It is this relation that defines fit to a measurement model, the sufficiency of the summed ratings, and the interpretability of the scores. Given sufficient fit and consistency, any country’s measure implies a given rating on each of the ten indicators.

For instance, imagine a vertical line drawn through the figure at a measure of 500, just above the mean (M). This measure is interpreted relative to the places at which the vertical line crosses the ratings in each row associated with each of the ten items. A measure of 500 is read as implying, within a given range of error, uncertainty, or confidence, a rating of

  • 3 on debt service and female-to-male parity in literacy,
  • 2 or 3 on how much of the population is undernourished and how many children under five years of age are moderately or severely underweight,
  • 2 on infant mortality, the percent of the population aged 15 to 49 with HIV, and the youth unemployment rate,
  • 1 or 2 the poor’s share of the national income, and
  • 1 on CO2 emissions and the rate of personal computers per 100 inhabitants.

For any one country with a measure of 500 on this scale, ten percentages or rates that appear completely incommensurable and incomparable are found to contribute consistently to a single valued function, developmental goal achievement. Instead of managing each separate indicator as a universe unto itself, this scale makes it possible to manage development itself at its own level of complexity. This ten-to-one ratio of reduced data volume is more than doubled when the total of 22 items included in the scale is taken into account.

This reduction is conceptually and practically important because it focuses attention on the actual object of management, development. When the individual indicators are the focus of attention, the forest is lost for the trees. Those who disparage the validity of the maxim, you manage what you measure, are often discouraged by the the feeling of being pulled in too many directions at once. But a measure of the HIV infection rate is not in itself a measure of anything but the HIV infection rate. Interpreting it in terms of broader developmental goals requires evidence that it in fact takes a place in that larger context.

And once a connection with that larger context is established, the consistency of individual data points remains a matter of interest. As the world turns, the order of things may change, but, more likely, data entry errors, temporary data blips, and other factors will alter data quality. Such changes cannot be detected outside of the context defined by an explicit interpretive framework that requires consistent observations.

-100  100     300     500     700     900    1100    1300
|-------+-------+-------+-------+-------+-------+-------|  NUM   INDCTR
1                                 1  :    2    :  3     3    9  PcsPer100
1                         1   :   2    :   3            3    8  CO2Emissions
1                    1  :    2    :   3                 3   10  PoorShareNatInc
1                 1  :    2    :  3                     3   19  YouthUnempRatMF
1              1   :    2   :   3                       3    1  %HIV15-49
1            1   :   2    :   3                         3    7  InfantMortality
1          1  :    2    :  3                            3    4  ChildrenUnder5ModSevUndWgt
1         1   :    2    :  3                            3   12  PopUndernourished
1    1   :    2   :   3                                 3    6  F2MParityLit
1   :    2    :  3                                      3    5  DebtServExpInc
|-------+-------+-------+-------+-------+-------+-------|  NUM   INDCTR
-100  100     300     500     700     900    1100    1300
                   1
       1   1 13445403312323 41 221    2   1   1            COUNTRIES
       T      S       M      S       T

Discussion

A key element in the results obtained here concerns the fact that the data were about 35% missing. Whether or not any given indicator was actually rated for any given country, the measure can still be interpreted as implying the expected rating. This capacity to take missing data into account can be taken advantage of systematically by calibrating a large bank of indicators. With this in hand, it becomes possible to gather only the amount of data needed to make a specific determination, or to adaptively administer the indicators so as to obtain the lowest-error (most reliable) measure at the lowest cost (with the fewest indicators administered). Perhaps most importantly, different collections of indicators can then be equated to measure in the same unit, so that impacts may be compared more efficiently.

Instead of an international developmental aid market that is so inefficient as to preclude any expectation of measured returns on investment, setting up a calibrated bank of indicators to which all measures are traceable opens up numerous desirable possibilities. The cost of assessing and interpreting the data informing aid transactions could be reduced to negligible amounts, and the management of the processes and outcomes in which that aid is invested would be made much more efficient by reduced data volume and enhanced information content. Because capital would flow more efficiently to where supply is meeting demand, nonproducers would be cut out of the market, and the effectiveness of the aid provided would be multiplied many times over.

The capacity to harmonize counts of different but related events into a single measurement system presents the possibility that there may be a bright future for outcomes-based budgeting in education, health care, human resource management, environmental management, housing, corrections, social services, philanthropy, and international development. It may seem wildly unrealistic to imagine such a thing, but the return on the investment would be so monumental that not checking it out would be even crazier.

A full report on the MDG data, with the other references cited, is available on my SSRN page at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1739386.

Goldberg, S. H. (2009). Billions of drops in millions of buckets: Why philanthropy doesn’t advance social progress. New York: Wiley.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Guttman on sufficiency, statistics, and cumulative science

April 29, 2010

“R. A. Fisher employed maximum likelihood…as a way of finding sufficient statistics if they exist. Now, sufficient statistics rarely exist, and even when they do, their use need not be optimal for estimation problems. As enlarged on in Reference 7 [an unpublished 1984 paper of Guttman’s], for each best unbiased sufficient statistic there generally is a better–and not necessarily sufficient–biased one. To use maximum likelihood requires knowledge of the complete sampling distribution, but biased estimation is proved to be better in a distribution-free fashion.” (Guttman, 1985, pp. 7-8; 1994, pp. 345-6)

In this passage, Guttman may be addressing issues related to the kind of biases that can affect extreme scores in Joint Maximum Likelihood Estimation (JMLE, formerly UCON) (Jansen, van den Wollenberg, Wierda, 1988; Wright, 1988; Wright & Panchapakesan, 1969). But what’s more interesting is the combination of the awareness of sufficiency and estimation issues revealed in this remark with the context in which it is made.

Guttman targets and rightly skewers a good number of inconsistencies and internal contradictions in statistical inference. But he shares in many of them himself. That is, Guttman’s valuable insights as to measurement are limited by his failure to consider at all the importance of the instrument in science, and by his limited appreciation of the value of theory. This is so despite his realization that “There can be no solution [to the problem of sampling items from one or more indefinitely large universes of content] without a structural theory” (1994, p. 329, in his Psychometrika review of Gulliksen’s Theory of Mental Tests), which is fully in tune with his emphasis on the central role of substantive replication in science (1994, p. 343).

But in the way he articulates his concern with replication, we see that, for Guttman, as for so many others, measurement is a matter of data analysis and not one of calibrating instruments. Measurement is not primarily a statistical process performed on computers, but is an individual event performed with an instrument. Calibrated instruments remove the necessity for data analysis (though other kinds of analysis may, of course, be continued or commenced).

In reading Guttman, it is difficult to follow through on his pithy and rich observations on the inconsistencies and illogic of statistical inference because he does not offer a clear alternative path, a measurement path structured by instrumentation. In his review of Lord and Novick (1968), for instance, Guttman remarks on the authors’ failure to provide their promised synthetic theory of tests and measurement, but does not offer or point toward one himself, even after noting the inclusion of Rasch’s Poisson models in the Lord and Novick classification system. Though much has been done to connect Guttman with Rasch (Andrich, 1982, 1985; Douglas & Wright, 1989; Engelhard, 2008; Linacre, 1991, 2000; Linacre & Wright, 1996; Tenenbaum, 1999; Wilson, 1989), and to advance in the direction of point-of-use measurement (Bode, 1999; Bode, Heinemann, & Semik, 2000; Connolly, Nachtman, & Pritchett, 1971; Davis, Perruccio, Canizares, Tennant, Hawker, et al., 2008; Linacre, 1997; many others), much more remains to be done.

Andrich, D. (1982, June). An index of person separation in Latent Trait Theory, the traditional KR-20 index, and the Guttman scale response pattern. Education Research and Perspectives, 9(1), 95-104 [http://www.rasch.org/erp7.htm].

Andrich, D. (1985). An elaboration of Guttman scaling with Rasch models for measurement. In N. B. Tuma (Ed.), Sociological methodology 1985 (pp. 33-80). San Francisco, California: Jossey-Bass.

Bode, R. K. (1999). Self-scoring key for Galveston Orientation and Amnesia Test. Rasch Measurement Transactions, 13(1), 680 [http://www.rasch.org/rmt/rmt131c.htm].

Bode, R. K., Heinemann, A. W., & Semik, P. (2000, Feb). Measurement properties of the Galveston Orientation and Amnesia Test (GOAT) and improvement patterns during inpatient rehabilitation. Journal of Head Trauma Rehabilitation, 15(1), 637-55.

Connolly, A. J., Nachtman, W., & Pritchett, E. M. (1971). Keymath: Diagnostic Arithmetic Test. Circle Pines, Minnesota: American Guidance Service.

Davis, A. M., Perruccio, A. V., Canizares, M., Tennant, A., Hawker, G. A., Conaghan, P. G., et al. (2008, May). The development of a short measure of physical function for hip OA HOOS-Physical Function Shortform (HOOS-PS): An OARSI/OMERACT initiative. Osteoarthritis Cartilage, 16(5), 551-9.

Douglas, G. A., & Wright, B. D. (1989). Response patterns and their probabilities. Rasch Measurement Transactions, 3(4), 75-77 [http://www.rasch.org/rmt/rmt34.htm].

Engelhard, G. (2008, July). Historical perspectives on invariant measurement: Guttman, Rasch, and Mokken. Measurement: Interdisciplinary Research & Perspectives, 6(3), 155-189.

Guttman, L. (1985). The illogic of statistical inference for cumulative science. Applied Stochastic Models and Data Analysis, 1, 3-10. (Reprinted in Guttman 1994, pp. 341-348)

Guttman, L. (1994). Louis Guttman on theory and methodology: Selected writings (S. Levy, Ed.). Dartmouth Benchmark Series. Brookfield, VT: Dartmouth Publishing Company.

Jansen, P., Van den Wollenberg, A., & Wierda, F. (1988). Correcting unconditional parameter estimates in the Rasch model for inconsistency. Applied Psychological Measurement, 12(3), 297-306.

Linacre, J. M. (1991, Spring). Stochastic Guttman order. Rasch Measurement Transactions, 5(4), 189 [http://www.rasch.org/rmt/rmt54p.htm].

Linacre, J. M. (1997). Instantaneous measurement and diagnosis. Physical Medicine and Rehabilitation State of the Art Reviews, 11(2), 315-324 [http://www.rasch.org/memo60.htm].

Linacre, J. M. (2000, Autumn). Guttman coefficients and Rasch data. Rasch Measurement Transactions, 14(2), 746-7 [http://www.rasch.org/rmt/rmt142e.htm].

Linacre, J. M., & Wright, B. D. (1996, Autumn). Guttman-style item location maps. Rasch Measurement Transactions, 10(2), 492-3 [http://www.rasch.org/rmt/rmt102h.htm].

Lord, F. M., & Novick, M. R. (Eds.). (1968). Statistical theories of mental test scores. Reading, Massachusetts: Addison-Wesley.

Tenenbaum, G. (1999, Jan-Mar). The implementation of Thurstone’s and Guttman’s measurement ideas in Rasch analysis. International Journal of Sport Psychology, 30(1), 3-16.

Wilson, M. (1989). A comparison of deterministic and probabilistic approaches to learning structures. Australian Journal of Education, 33(2), 127-140.

Wright, B. D. (1988, Sep). The efficacy of unconditional maximum likelihood bias correction: Comment on Jansen, Van den Wollenberg, and Wierda. Applied Psychological Measurement, 12(3), 315-318.

Wright, B. D., & Panchapakesan, N. (1969). A procedure for sample-free item analysis. Educational and Psychological Measurement, 29(1), 23-48.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Reasoning by analogy in social science education: On the need for a new curriculum

April 12, 2010

I’d like to revisit the distinction between measurement models and statistical models. Rasch was well known for joking about burning all books containing the words “normal distribution” (Andersen, 1995, p. 385). Rasch’s book and 1961 article both start on their first pages with a distinction between statistical models describing intervariable relations at the group level and measurement models prescribing intravariable relations at the individual level. I think confusion between these kinds of models has caused huge problems.

We typically assume all statistical analyses are quantitative. We refer to any research that uses numbers as quantitative even when nothing is done to map a substantive and invariant unit on a number line. We distinguish between qualitative and quantitative data and methods as though quantification has ever been achieved in the history of science without substantive qualitative understandings of the constructs.

Quantification in fact predates the emergence of statistics by millennia. It seems to me that there is a great deal to be gained from maintaining a careful distinction between statistics and measurement. Measurement is not primarily performed by someone sitting at a computer analyzing data. Measurement is done by individuals using calibrated instruments to obtain immediately useful quantitative information expressed in a universally uniform unit.

Rasch was correct in his assertion that we can measure the reading ability of a child with the same kind of objectivity with which we measure his or her weight or height. But we don’t commonly express individual height and weight measures in statistical terms. 

Information overload is one of the big topics of the day. Which will contribute more to reducing that overload in efficient and meaningful ways: calibrated instruments measuring in common units giving individual users immediate feedback that summarizes responses to dozens of questions, or ordinal group-level item-by-item statistics reported six months too late to do anything about them?

Instrument calibration certainly makes use of statistics, and statistical models usually assume measurement has taken place, but much stands to be gained from a clear distinction between inter- and intra-variable models. And so I respectfully disagree with those who assert that “the Rasch model is first of all a statistical model.” Maxwell’s method of making analogies from well known physical laws (Nersessian, 2002; Turner, 1955) was adopted by Rasch (1960, pp. 110-115) so that his model would have the same structure as the laws of physics.

Statistical models are a different class of models from the laws of physics (Meehl, 1967), since they allow cross-variable interactions in ways that compromise and defeat the possibility of testing the hypotheses of constant unit size, parameter separation, sufficiency, etc.

I’d like to suggest a paraphrase of the first sentence of the abstract from a recent paper (Silva, 2007) on using analogies in science education: Despite its great importance, many students and even their teachers still cannot recognize the relevance of measurement models to build up psychosocial knowledge and are unable to develop qualitative explanations for mathematical expressions of the lawful structural invariances that exist within the social sciences.

And so, here’s a challenge: we need to make an analogy from Silva’s (2007) work in physics science education and develop a curriculum for social science education that follows a parallel track. We could trace the development of reading measurement from Rasch (1960) through the Anchor Test Study (Jaeger, 1973; Rentz & Bashaw, 1977) to the introduction of the Lexile Framework for Reading (Stenner, 2001) and its explicit continuity with Rasch’s use of Maxwell’s method of analogy (Burdick, Stone, & Stenner, 2006) and full blown predictive theory (Stenner & Stone, 2003).

With the example of the Rasch Reading Law in hand, we could then train students and teachers to think about structural invariance in the context of psychosocial constructs. It may be that, without the development and dissemination of at least a college-level curriculum of this kind, we will never overcome the confusion between statistical and measurement models.

References

Andersen, E. B. (1995). What George Rasch would have thought about this book. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 383-390). New York: Springer-Verlag.

Burdick, D. S., Stone, M. H., & Stenner, A. J. (2006). The Combined Gas Law and a Rasch Reading Law. Rasch Measurement Transactions, 20(2), 1059-60 [http://www.rasch.org/rmt/rmt202.pdf].

Jaeger, R. M. (1973). The national test equating study in reading (The Anchor Test Study). Measurement in Education, 4, 1-8.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.

Nersessian, N. J. (2002). Maxwell and “the Method of Physical Analogy”: Model-based reasoning, generic abstraction, and conceptual change. In D. Malament (Ed.), Essays in the history and philosophy of science and mathematics (pp. 129-166). Lasalle, Illinois: Open Court.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Rasch, G. (1961). On general laws and the meaning of measurement in psychology. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability (pp. 321-333 [http://www.rasch.org/memo1960.pdf]). Berkeley, California: University of California Press.

Rentz, R. R., & Bashaw, W. L. (1977, Summer). The National Reference Scale for Reading: An application of the Rasch model. Journal of Educational Measurement, 14(2), 161-179.

Silva, C. C. (2007, August). The role of models and analogies in the electromagnetic theory: A historical case study. Science & Education, 16(7-8), 835-848.

Stenner, A. J. (2001). The Lexile Framework: A common metric for matching readers and texts. California School Library Journal, 25(1), 41-2.

Stenner, A. J., & Stone, M. (2003). Item specification vs. item banking. Rasch Measurement Transactions, 17(3), 929-30 [http://www.rasch.org/rmt/rmt173a.htm].

Turner, J. (1955, November). Maxwell on the method of physical analogy. British Journal for the Philosophy of Science, 6, 226-238.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

How Evidence-Based Decision Making Suffers in the Absence of Theory and Instrument: The Power of a More Balanced Approach

January 28, 2010

The Basis of Evidence in Theory and Instrument

The ostensible point of basing decisions in evidence is to have reasons for proceeding in one direction versus any other. We want to be able to say why we are proceeding as we are. When we give evidence-based reasons for our decisions, we typically couch them in terms of what worked in past experience. That experience might have been accrued over time in practical applications, or it might have been deliberately arranged in one or more experimental comparisons and tests of concisely stated hypotheses.

At its best, generalizing from past experience to as yet unmet future experiences enables us to navigate life and succeed in ways that would not be possible if we could not learn and had no memories. The application of a lesson learned from particular past events to particular future events involves a very specific inferential process. To be able to recognize repeated iterations of the same things requires the accumulation of patterns of evidence. Experience in observing such patterns allows us to develop confidence in our understanding of what that pattern represents in terms of pleasant or painful consequences. When we are able to conceptualize and articulate an idea of a pattern, and when we are then able to recognize a new occurrence of that pattern, we have an idea of it.

Evidence-based decision making is then a matter of formulating expectations from repeatedly demonstrated and routinely reproducible patterns of observations that lend themselves to conceptual representations, as ideas expressed in words. Linguistic and cultural frameworks selectively focus attention by projecting expectations and filtering observations into meaningful patterns represented by words, numbers, and other symbols. The point of efforts aimed at basing decisions in evidence is to try to go with the flow of this inferential process more deliberately and effectively than might otherwise be the case.

None of this is new or controversial. However, the inferential step from evidence to decision always involves unexamined and unjustified assumptions. That is, there is always an element of metaphysical faith behind the expectation that any given symbol or word is going to work as a representation of something in the same way that it has in the past. We can never completely eliminate this leap of faith, since we cannot predict the future with 100% confidence. We can, however, do a lot to reduce the size of the leap, and the risks that go with it, by questioning our assumptions in experimental research that tests hypotheses as to the invariant stability and predictive utility of the representations we make.

Theoretical and Instrumental Assumptions Hidden Behind the Evidence

For instance, evidence as to the effectiveness of an intervention or treatment is often expressed in terms of measures commonly described as quantitative. But it is unusual for any evidence to be produced justifying that description in terms of something that really adds up in the way numbers do. So we often find ourselves in situations in which our evidence is much less meaningful, reliable, and valid than we suppose it to be.

Quantitative measures are often valued as the hallmark of rational science. But their capacity to live up to this billing depends on the quality of the inferences that can be supported. Very few researchers thoroughly investigate the quality of their measures and justify the inferences they make relative to that quality.

Measurement presumes a reproducible pattern of evidence that can serve as the basis for a decision concerning how much of something has been observed. It naturally follows that we often base measurement in counts of some kind—successes, failures, ratings, frequencies, etc. The counts, scores, or sums are then often transformed into percentages by dividing them into the maximum possible that could be obtained. Sometimes the scores are averaged for each person measured, and/or for each item or question on the test, assessment, or survey. These scores and percentages are then almost universally fed directly into decision processes or statistical analyses with no further consideration.

The reproducible pattern of evidence on which decisions are based is presumed to exist between the measures, not within them. In other words, the focus is on the group or population statistics, not on the individual measures. Attention is typically focused on the tip of the iceberg, the score or percentage, not on the much larger, but hidden, mass of information beneath it. Evidence is presumed to be sufficient to the task when the differences between groups of scores are of a consistent size or magnitude, but is this sufficient?

Going Past Assumptions to Testable Hypotheses

In other words, does not science require that evidence be explained by theory, and embodied in instrumentation that provides a shared medium of observation? As shown in the blue lines in the Figure below,

  • theory, whether or not it is explicitly articulated, inevitably influences both what counts as valid data and the configuration of the medium of its representation, the instrument;
  • data, whether or not it is systematically gathered and evaluated, inevitably influences both the medium of its representation, the instrument, and the implicit or explicit theory that explains its properties and justifies its applications; and
  • instruments, whether or not they are actually calibrated from a mapping of symbols and substantive amounts, inevitably influence data gathering and the image of the object explained by theory.

The rhetoric of evidence-based decision making skips over the roles of theory and instrumentation, drawing a direct line from data to decision. In leaving theory laxly formulated, we allow any story that makes a bit of sense and is communicated by someone with a bit of charm or power to carry the day. In not requiring calibrated instrumentation, we allow any data that cross the threshold into our awareness to serve as an acceptable basis for decisions.

What we want, however, is to require meaningful measures that really provide the evidence needed for instruments that exhibit invariant calibrations and for theories that provide predictive explanatory control over the variable. As shown in the Figure, we want data that push theory away from the instrument, theory that separates the data and instrument, and instruments that get in between the theory and data.

We all know to distrust too close a correspondence between theory and data, but we too rarely understand or capitalize on the role of the instrument in mediating the theory-data relation. Similarly, when the questions used as a medium for making observations are obviously biased to produce responses conforming overly closely with a predetermined result, we see that the theory and the instrument are too close for the data to serve as an effective mediator.

Finally, the situation predominating in the social sciences is one in which both construct and measurement theories are nearly nonexistent, which leaves data completely dependent on the instrument it came from. In other words, because counts of correct answers or sums of ratings are mistakenly treated as measures, instruments fully determine and restrict the range of measurement to that defined by the numbers of items and rating categories. Once the instrument is put in play, changes to it would make new data incommensurable with old, so, to retain at least the appearance of comparability, the data structure then fully determines and restricts the instrument.

What we want, though, is a situation in which construct and measurement theories work together to make the data autonomous of the particular instrument it came from. We want a theory that explains what is measured well enough for us to be able to modify existing instruments, or create entirely new ones, that give the same measures for the same amounts as the old instruments. We want to be able to predict item calibrations from the properties of the items, we want to obtain the same item calibrations across data sets, and we want to be able to predict measures on the basis of the observed responses (data) no matter which items or instrument was used to produce them.

Most importantly, we want a theory and practice of measurement that allows us to take missing data into account by providing us with the structural invariances we need as media for predicting the future from the past. As Ben Wright (1997, p. 34) said, any data analysis method that requires complete data to produce results disqualifies itself automatically as a viable basis for inference because we never have complete data—any practical system of measurement has to be positioned so as to be ready to receive, process, and incorporate all of the data we have yet to gather. This goal is accomplished to varying degrees in Rasch measurement (Rasch, 1960; Burdick, Stone, & Stenner, 2006; Dawson, 2004). Stenner and colleagues (Stenner, Burdick, Sanford, & Burdick, 2006) provide a trajectory of increasing degrees to which predictive theory is employed in contemporary measurement practice.

The explanatory and predictive power of theory is embodied in instruments that focus attention on recording observations of salient phenomena. These observations become data that inform the calibration of instruments, which then are used to gather further data that can be used in practical applications and in checks on the calibrations and the theory.

“Nothing is so practical as a good theory” (Lewin, 1951, p. 169). Good theory makes it possible to create symbolic representations of things that are easy to think with. To facilitate clear thinking, our words, numbers, and instruments must be transparent. We have to be able to look right through them at the thing itself, with no concern as to distortions introduced by the instrument, the sample, the observer, the time, the place, etc. This happens only when the structure of the instrument corresponds with invariant features of the world. And where words effect this transparency to an extent, it is realized most completely when we can measure in ways that repeatedly give the same results for the same amounts in the same conditions no matter which instrument, sample, operator, etc. is involved.

Where Might Full Mathematization Lead?

The attainment of mathematical transparency in measurement is remarkable for the way it focuses attention and constrains the imagination. It is essential to appreciate the context in which this focusing occurs, as popular opinion is at odds with historical research in this regard. Over the last 60 years, historians of science have come to vigorously challenge the widespread assumption that technology is a product of experimentation and/or theory (Kuhn, 1961/1977; Latour, 1987, 2005; Maas, 2001; Mendelsohn, 1992; Rabkin, 1992; Schaffer, 1992; Heilbron, 1993; Hankins & Silverman, 1999; Baird, 2002). Neither theory nor experiment typically advances until a key technology is widely available to end users in applied and/or research contexts. Rabkin (1992) documents multiple roles played by instruments in the professionalization of scientific fields. Thus, “it is not just a clever historical aphorism, but a general truth, that ‘thermodynamics owes much more to the steam engine than ever the steam engine owed to thermodynamics’” (Price, 1986, p. 240).

The prior existence of the relevant technology comes to bear on theory and experiment again in the common, but mistaken, assumption that measures are made and experimentally compared in order to discover scientific laws. History shows that measures are rarely made until the relevant law is effectively embodied in an instrument (Kuhn, 1961/1977, pp. 218-9): “…historically the arrow of causality is largely from the technology to the science” (Price, 1986, p. 240). Instruments do not provide just measures; rather they produce the phenomenon itself in a way that can be controlled, varied, played with, and learned from (Heilbron, 1993, p. 3; Hankins & Silverman, 1999; Rabkin, 1992). The term “technoscience” has emerged as an expression denoting recognition of this priority of the instrument (Baird, 1997; Ihde & Selinger, 2003; Latour, 1987).

Because technology often dictates what, if any, phenomena can be consistently produced, it constrains experimentation and theorizing by focusing attention selectively on reproducible, potentially interpretable effects, even when those effects are not well understood (Ackermann, 1985; Daston & Galison, 1992; Ihde, 1998; Hankins & Silverman, 1999; Maasen & Weingart, 2001). Criteria for theory choice in this context stem from competing explanatory frameworks’ experimental capacities to facilitate instrument improvements, prediction of experimental results, and gains in the efficiency with which a phenomenon is produced.

In this context, the relatively recent introduction of measurement models requiring additive, invariant parameterizations (Rasch, 1960) provokes speculation as to the effect on the human sciences that might be wrought by the widespread availability of consistently reproducible effects expressed in common quantitative languages. Paraphrasing Price’s comment on steam engines and thermodynamics, might it one day be said that as yet unforeseeable advances in reading theory will owe far more to the Lexile analyzer (Stenner, et al., 2006) than ever the Lexile analyzer owed reading theory?

Kuhn (1961/1977) speculated that the second scientific revolution of the early- to mid-nineteenth century followed in large part from the full mathematization of physics, i.e., the emergence of metrology as a professional discipline focused on providing universally accessible, theoretically predictable, and evidence-supported uniform units of measurement (Roche, 1998). Kuhn (1961/1977, p. 220) specifically suggests that a number of vitally important developments converged about 1840 (also see Hacking, 1983, p. 234). This was the year in which the metric system was formally instituted in France after 50 years of development (it had already been obligatory in other nations for 20 years at that point), and metrology emerged as a professional discipline (Alder, 2002, p. 328, 330; Heilbron, 1993, p. 274; Kula, 1986, p. 263). Daston (1992) independently suggests that the concept of objectivity came of age in the period from 1821 to 1856, and gives examples illustrating the way in which the emergence of strong theory, shared metric standards, and experimental data converged in a context of particular social mores to winnow out unsubstantiated and unsupportable ideas and contentions.

Might a similar revolution and new advances in the human sciences follow from the introduction of evidence-based, theoretically predictive, instrumentally mediated, and mathematical uniform measures? We won’t know until we try.

Figure. The Dialectical Interactions and Mutual Mediations of Theory, Data, and Instruments

Figure. The Dialectical Interactions and Mutual Mediations of Theory, Data, and Instruments

Acknowledgment. These ideas have been drawn in part from long consideration of many works in the history and philosophy of science, primarily Ackermann (1985), Ihde (1991), and various works of Martin Heidegger, as well as key works in measurement theory and practice. A few obvious points of departure are listed in the references.

References

Ackermann, J. R. (1985). Data, instruments, and theory: A dialectical approach to understanding science. Princeton, New Jersey: Princeton University Press.

Alder, K. (2002). The measure of all things: The seven-year odyssey and hidden error that transformed the world. New York: The Free Press.

Aldrich, J. (1989). Autonomy. Oxford Economic Papers, 41, 15-34.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Baird, D. (1997, Spring-Summer). Scientific instrument making, epistemology, and the conflict between gift and commodity economics. Techné: Journal of the Society for Philosophy and Technology, 3-4, 25-46. Retrieved 08/28/2009, from http://scholar.lib.vt.edu/ejournals/SPT/v2n3n4/baird.html.

Baird, D. (2002, Winter). Thing knowledge – function and truth. Techné: Journal of the Society for Philosophy and Technology, 6(2). Retrieved 19/08/2003, from http://scholar.lib.vt.edu/ejournals/SPT/v6n2/baird.html.

Burdick, D. S., Stone, M. H., & Stenner, A. J. (2006). The Combined Gas Law and a Rasch Reading Law. Rasch Measurement Transactions, 20(2), 1059-60 [http://www.rasch.org/rmt/rmt202.pdf].

Carroll-Burke, P. (2001). Tools, instruments and engines: Getting a handle on the specificity of engine science. Social Studies of Science, 31(4), 593-625.

Daston, L. (1992). Baconian facts, academic civility, and the prehistory of objectivity. Annals of Scholarship, 8, 337-363. (Rpt. in L. Daston, (Ed.). (1994). Rethinking objectivity (pp. 37-64). Durham, North Carolina: Duke University Press.)

Daston, L., & Galison, P. (1992, Fall). The image of objectivity. Representations, 40, 81-128.

Dawson, T. L. (2004, April). Assessing intellectual development: Three approaches, one sequence. Journal of Adult Development, 11(2), 71-85.

Galison, P. (1999). Trading zone: Coordinating action and belief. In M. Biagioli (Ed.), The science studies reader (pp. 137-160). New York, New York: Routledge.

Hacking, I. (1983). Representing and intervening: Introductory topics in the philosophy of natural science. Cambridge: Cambridge University Press.

Hankins, T. L., & Silverman, R. J. (1999). Instruments and the imagination. Princeton, New Jersey: Princeton University Press.

Heelan, P. A. (1983, June). Natural science as a hermeneutic of instrumentation. Philosophy of Science, 50, 181-204.

Heelan, P. A. (1998, June). The scope of hermeneutics in natural science. Studies in History and Philosophy of Science Part A, 29(2), 273-98.

Heidegger, M. (1977). Modern science, metaphysics, and mathematics. In D. F. Krell (Ed.), Basic writings [reprinted from M. Heidegger, What is a thing? South Bend, Regnery, 1967, pp. 66-108] (pp. 243-282). New York: Harper & Row.

Heidegger, M. (1977). The question concerning technology. In D. F. Krell (Ed.), Basic writings (pp. 283-317). New York: Harper & Row.

Heilbron, J. L. (1993). Weighing imponderables and other quantitative science around 1800. Historical studies in the physical and biological sciences), 24(Supplement), Part I, pp. 1-337.

Hessenbruch, A. (2000). Calibration and work in the X-ray economy, 1896-1928. Social Studies of Science, 30(3), 397-420.

Ihde, D. (1983). The historical and ontological priority of technology over science. In D. Ihde, Existential technics (pp. 25-46). Albany, New York: State University of New York Press.

Ihde, D. (1991). Instrumental realism: The interface between philosophy of science and philosophy of technology. (The Indiana Series in the Philosophy of Technology). Bloomington, Indiana: Indiana University Press.

Ihde, D. (1998). Expanding hermeneutics: Visualism in science. Northwestern University Studies in Phenomenology and Existential Philosophy). Evanston, Illinois: Northwestern University Press.

Ihde, D., & Selinger, E. (Eds.). (2003). Chasing technoscience: Matrix for materiality. (Indiana Series in Philosophy of Technology). Bloomington, Indiana: Indiana University Press.

Kuhn, T. S. (1961/1977). The function of measurement in modern physical science. Isis, 52(168), 161-193. (Rpt. In T. S. Kuhn, The essential tension: Selected studies in scientific tradition and change (pp. 178-224). Chicago: University of Chicago Press, 1977).

Kula, W. (1986). Measures and men (R. Screter, Trans.). Princeton, New Jersey: Princeton University Press (Original work published 1970).

Lapre, M. A., & Van Wassenhove, L. N. (2002, October). Learning across lines: The secret to more efficient factories. Harvard Business Review, 80(10), 107-11.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York, New York: Cambridge University Press.

Latour, B. (2005). Reassembling the social: An introduction to Actor-Network-Theory. (Clarendon Lectures in Management Studies). Oxford, England: Oxford University Press.

Lewin, K. (1951). Field theory in social science: Selected theoretical papers (D. Cartwright, Ed.). New York: Harper & Row.

Maas, H. (2001). An instrument can make a science: Jevons’s balancing acts in economics. In M. S. Morgan & J. Klein (Eds.), The age of economic measurement (pp. 277-302). Durham, North Carolina: Duke University Press.

Maasen, S., & Weingart, P. (2001). Metaphors and the dynamics of knowledge. (Vol. 26. Routledge Studies in Social and Political Thought). London: Routledge.

Mendelsohn, E. (1992). The social locus of scientific instruments. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 5-22). Bellingham, WA: SPIE Optical Engineering Press.

Polanyi, M. (1964/1946). Science, faith and society. Chicago: University of Chicago Press.

Price, D. J. d. S. (1986). Of sealing wax and string. In Little Science, Big Science–and Beyond (pp. 237-253). New York, New York: Columbia University Press.

Rabkin, Y. M. (1992). Rediscovering the instrument: Research, industry, and education. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 57-82). Bellingham, Washington: SPIE Optical Engineering Press.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Roche, J. (1998). The mathematics of measurement: A critical history. London: The Athlone Press.

Schaffer, S. (1992). Late Victorian metrology and its instrumentation: A manufactory of Ohms. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 23-56). Bellingham, WA: SPIE Optical Engineering Press.

Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2006). How accurate are Lexile text measures? Journal of Applied Measurement, 7(3), 307-22.

Thurstone, L. L. (1959). The measurement of values. Chicago: University of Chicago Press, Midway Reprint Series.

Wright, B. D. (1997, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45, 52 [http://www.rasch.org/memo62.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Information and Leadership: New Opportunities for Advancing Strategy, Engaging Customers, and Motivating Employees

December 9, 2009

Or, What’s a Mathematical Model a Model Of, After All?
Or, How to Build Scale Models of Organizations and Use Them to Learn About Organizational Identity, Purpose, and Mission

William P. Fisher, Jr., Ph.D.

The greatest opportunity and most significant challenge to leadership in every area of life today is the management of information. So says Carol Bartz, CEO of Yahoo! in her entry in The Economist’s annual overview of world events, “The World in 2010.” Information can be both a blessing and a curse. The right information in the right hands at the right time is essential to effectiveness and efficiency. But unorganized and incoherent information can be worse than none at all. Too often leaders and managers are faced with deciding between gut instincts based in unaccountable intuitions and facts that are potentially seriously flawed, or that are merely presented in such overwhelming volumes as to be useless.

This situation is only going to get worse as information volumes continue to increase. The upside is that solutions exist, solutions that not only reduce data volume by factors as high as hundreds to one with no loss of information, but which also distinguish between merely apparent and really reliable information. What we have in these solutions are the means of following through on Carol Bartz’s information leadership warnings and recommendations.

Clearly communicating what matters, for instance, requires leaders to find meaning in new facts and the changing scene. They have to be able to use their vision of the organization, its mission, and its place in the world to tell what’s important and what isn’t, to put each event or opportunity in perspective. And what’s more is that the vision of the organization has to be dynamic. It, too, has to be able to change with the changing circumstances.

And this is where a whole new class of useful information solutions comes to bear. It may seem odd to say so, but leadership is fundamentally mathematical. You can begin to get a sense of what I mean in the ambiguity of the way leaders can be calculating. Making use of people’s skills and talents is a challenge that requires being able to assess facts and potentials in a way that intuitively gauges likelihoods of success. It is possible to lead, of course, without being manipulative; the point is that leadership requires an ability to envision and project an abstract heuristic ideal as a fundamental principle for focusing attention and separating the wheat from the chaff. A leader who dithers and wastes time and resources on irrelevancies is a contradiction in terms. An organization is supposed to have an identity, a purpose, and a mission in life independent of the local particulars of who its actual employees, customers, and suppliers are, and independent of the opportunities and challenges that arise in different times and places.

Of course, every organization is colored and shaped to some extent by every different person that comes into contact with it, and by the times and places it finds itself in. No one wants to feel like an interchangeable part in machine, but neither does anyone want to feel completely out of place, with no role to play. If an organization was entirely dependent on the particulars of who, what, when, and where, it’s status as a coherent organization with an identifiable presence would be compromised. So what we need is to find the right balance between the ideal and the real, the abstract and the concrete, and, as the philosopher Paul Ricoeur put it, between belonging and distanciation.

And indeed, scientists often note that no mathematical model ever holds in every detail in the real world. That isn’t what they’re intended to do, in fact. Mathematical models serve the purpose of being guides to creating meaningful, useful relationships. One of the leading lights of measurement theory, Georg Rasch, said it well over 50 years ago: models aren’t meant to be true, but to be useful.

Rasch accordingly also pointed out that, if we measure mass, force, and acceleration with enough precision, we see that even Newton’s laws of motion are not perfectly true. Measured to the nth decimal place, what we find is that observed amounts of mass, force, and acceleration form probability distributions that do indeed satisfy Newton’s laws. Even in classical physics, then, measurement models are best conceived probabilistically.

Over the last several decades, use of Rasch’s probabilistic measurement models in scaling tests, surveys, and assessments has grown exponentially. As has been explored at length in previous posts in this blog, most applications of Rasch’s models mistakenly treat them as statistical models, as so their real value and importance is missed. But even those actively engaged in using the models appropriately often do not engage with the basic question concerning what the model is a model of, in their particular application of it. The basic assumption seems to be that the model is a mathematical representation of relations between observations recorded in a data set, but this is an extremely narrow and unproductive point of view.

Let’s ask ourselves, instead, how we would model an organization. Why would we want to do that? We would want to do that for the same reasons we model anything, such as creating a safe and efficient way of experimenting with different configurations, and of coming to new understandings of basic principles. If we had a standard model of organizations of a certain type, or of organizations in a particular industry, we could use it to see how different variations on the basic structure and processes cause or are associated with different outcomes. Further, given that such models could be used to calibrate scales meaningfully measuring organizational development, industry-wide standards could be brought to bear in policy, decision making, and education, effecting new degrees of efficiency and effectiveness.

So, we’d previously said that the extent to which an organization finds its identity, realizes its purpose, and advances its mission (i.e., develops) is, within certain limits, a function of its capacity to be independent from local particulars. What we mean by this is that we expect employees to be able to perform their jobs no matter what day of the week it is, no matter who the customer is, no matter which particular instance of a product is involved, etc. Though no amount of skill, training, or experience can prepare someone for every possible contingency, people working in a given job description prepare themselves for a certain set of tasks, and are chosen by the organization for their capacities in that regard.

Similarly, we expect policies, job descriptions, work flows, etc. to function in similar fashions. Though the exact specifics of each employee’s abilities and each situation’s demands cannot be known in advance, enough is known that the defined aims will be achieved with high degrees of success. Of course, this is the point at which the interchangeability of employee ability and task difficulty can become demeaning and alienating. It will be important that we allow room for some creative play, and situate each level of ability along a continuum that allows everyone to see a developmental trajectory personalized to their particular strengths and needs.

So, how do we mathematically model the independence of the organization from its employees, policies, customers, and challenges, and scientifically evaluate that independence?

One way to begin is to posit that organizational development is equal to the differences between the abilities of the people employed, the efficiencies of the policies, alignments, and linkages implemented; and the challenges presented by the market. If we observe the abilities, efficiencies, and challenges in by means of a rating scale, the resulting model could be written as:

ln(Pmoas/(1-Pmoas)) = bm – fo – ca – rs

which hypothesizes that the natural logarithm of the response odds (the response probabilities divided by one minus themselves) is equal to the ability b of employee m minus the efficiency f of policy o minus the challenge c of market a minus the difficulty r of obtaining rating in category s. This model has the form of a multifaceted Rasch model (Linacre, 1989; others), used in academic research, rehabilitative functional assessments, and medical licensure testing.

What does it take for each of these model parameters to be independent of the others in the manner that we take for granted in actual practice? Can we frame our observations of the members of each facet in the model in ways that will clearly show us when we have failed to obtain the desired independence? Can we do that in a way that simultaneously provides us with a means for communicating information about individual employees, policies, and challenges efficiently in a common language?

Can that common language be expressed in words and numbers that capitalize on the independence of the model parameters and so mean the same thing across local particulars? Can we set up a system for checking and maintaining the meaning of the parameters over time? Can we build measures of employee abilities, policy efficiencies, and market challenges into our information systems in useful ways? Can we improve the overall quality, efficiency, and meaningfulness of our industry by collaborating with other firms, schools, non-profits, and government agencies in the development of reference standard metrics?

These questions all have the same answer: Yes, we can. These questions set the stage for understanding how effective leadership depends on effective information management. If, as Yahoo! CEO Bartz says, leadership has become more difficult in the age of blogospherical second-guessing and “opposition research,” why not tap all of that critical energy as a resource and put it to work figuring out what differences make a difference? If critics think they have important questions that need to be answered, the independence and consistency, or lack thereof, of their and others’ responses gives real heft to a “put-up-or-shut-up” criterion for distinguishing signal from noise.

This kind of a BS-detector supports leadership in two ways, by focusing attention on meaningful information, and by highlighting significant divergences from accepted opinion. The latter might turn out to be nothing more than exceptionally loud noise, but it might also signal something very important, a contrary opinion sensitive to special information available only from a particular perspective.

Bartz is right on, then, in saying that the central role of information in leadership has made listening and mentoring more important than ever. Modeling the organization and experimenting with it makes it possible to listen and mentor in completely new ways. Testing data for independent model parameters is akin to tuning the organization like an instrument. When independence is achieved, everything harmonizes. The path forward is clear, since the ratings delineate the range in which organizational performance consistently varies.

Variation in the measures is illustrated by the hierarchy of the policy and market items rated, which take positions in their distributions showing what consistently comes first, and what precedents have to be set for later challenges to be met successfully. By demanding that the model parameters be independent of one another, we have set ourselves up to learn something from the past that can be used to predict the future.

Further and quite importantly, as experience is repeatedly related to these quantitatively-scaled hierarchies, the factors that make policies and challenges take particular positions on the ruler come to be understood, theory is refined, and leadership gains an edge. Now, it is becoming possible to predict where new policies and challenges will fall on the measurement continuum, making it possible for more rapid responses and earlier anticipations of previously unseen opportunities.

It’s a different story, though, when dependencies emerge, as when one or more employees in a particular area unexpectedly disagree with otherwise broadly accepted policy efficiencies or market challenges, or when a particular policy provokes anomalous evaluations relative to some market challenges but not others. There’s a qualitatively different kind of learning that takes place when expectations are refuted. Instead of getting an answer to the question we asked, we got an answer to one we didn’t ask.

It might just be noise or error, but it is imperative to ask and find out what question the unexpected answer responds to. Routine management thrives on learning how to ever more efficiently predict quantitative results; its polar opposite, innovation, lives on the mystery of unexpected anomalies. If someone hadn’t been able to wonder what value hardened rubber left on a stove might have, what might have killed bacteria in a petri dish, or why an experimental effect disappeared when a lead plate was moved, Vulcanized tires, Penicillin, and X-ray devices might never have come about.

We are on the cusp of the information analogues of these ground-breaking innovations. Methods of integrating rigorously scientific quantities with qualitative creative grist clarify information in previously unimagined ways, and in so doing make it more leveragable than ever before for advancing strategy, engaging customers, and motivating employees.

The only thing in Carol Bartz’s article that I might take issue with comes in the first line, with the words “will be.” The truth is that information already is our greatest opportunity.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Statistics and Measurement: Clarifying the Differences

August 26, 2009

Measurement is qualitatively and paradigmatically quite different from statistics, even though statistics obviously play important roles in measurement, and vice versa. The perception of measurement as conceptually difficult stems in part from its rearrangement of most of the concepts that we take for granted in the statistical paradigm as landmarks of quantitative thinking. When we recognize and accept the qualitative differences between statistics and measurement, they both become easier to understand.

Statistical analyses are commonly referred to as quantitative, even though the numbers analyzed most usually have not been derived from the mapping of an invariant substantive unit onto a number line. Measurement takes such mapping as its primary concern, focusing on the quantitative meaningfulness of numbers (Falmagne & Narens, 1983; Luce, 1978; ,  Marcus-Roberts & Roberts, 1987; Mundy, 1986; Narens, 2002; Roberts, 1999). Statistical models focus on group processes and relations among variables, while measurement models focus on individual processes and relations within variables (Duncan, 1992; Duncan & Stenbeck, 1988; Rogosa, 1987). Statistics makes assumptions about factors beyond its control, while measurement sets requirements for objective inference (Andrich, 1989). Statistics primarily involves data analysis, while measurement primarily calibrates instruments in common metrics for interpretation at the point of use (Cohen, 1994; Fisher, 2000; Guttman, 1985; Goodman, 1999a-c; Rasch, 1960).

Statistics focuses on making the most of the data in hand, while measurement focuses on using the data in hand to inform (a) instrument calibration and improvement, and (b) the prediction and efficient gathering of meaningful new data on individuals in practical applications. Where statistical “measures” are defined inherently by a particular analytic method, measures read from calibrated instruments—and the raw observations informing these measures—need not be computerized for further analysis.

Because statistical “measures” are usually derived from ordinal raw scores, changes to the instrument change their meaning, resulting in a strong inclination to avoid improving the instrument. Measures, in contrast, take missing data into account, so their meaning remains invariant over instrument configurations, resulting in a firm basis for the emergence of a measurement quality improvement culture. So statistical “measurement” begins and ends with data analysis, where measurement from calibrated instruments is in a constant cycle of application, new item calibrations, and critical recalibrations that require only intermittent resampling.

The vast majority of statistical methods and models make strong assumptions about the nature of the unit of measurement, but provide either very limited ways of checking those assumptions, or no checks at all. Statistical models are descriptive in nature, meaning that models are fit to data, that the validity of the data is beyond the immediate scope of interest, and that the model accounting for the most variation is regarded as best. Finally, and perhaps most importantly, statistical models are inherently oriented toward the relations among variables at the level of samples and populations.

Measurement models, however, impose strong requirements on data quality in order to achieve the unit of measurement that is easiest to think with, one that stays constant and remains invariant across the local particulars of instrument and sample. Measurement methods and models, then, provide extensive and varied ways of checking the quality of the unit, and so must be prescriptive rather than descriptive. That is, measurement models define the data quality that must be obtained for objective inference. In the measurement paradigm, data are fit to models, data quality is of paramount interest, and data quality evaluation must be informed as much by qualitative criteria as by quantitative.

To repeat the most fundamental point, measurement models are oriented toward individual-level response processes, not group-level aggregate processes. Herbert Blumer pointed out as early as 1930 that quantitative method is not equivalent to statistical method, and that the natural sciences had conspicuous degrees of success long before the emergence of statistical techniques (Hammersly, 1989, pp. 113-4). Both the initial scientific revolution in the 16th-17th centuries and the second scientific revolution of the 19th century found a basis in measurement for publicly objective and reproducible results, but statistics played little or no role in the major discoveries of the times.

The scientific value of statistics resides largely in the reproducibility of cross-variable data relations, and statisticians widely agree that statistical analyses should depend only on sufficient statistics (Arnold, 1982, p. 79). Measurement theoreticians and practitioners also agree, but the sufficiency of the mean and standard deviation relative to a normal distribution is one thing, and the sufficiency of individual responses relative to an invariant construct is quite another (Andersen, 1977; Arnold, 1985; Dynkin, 1951; Fischer, 1981; Hall, Wijsman, & Ghosh, 1965; van der Linden, 1992).

It is of historical interest, though, to point out that Rasch, foremost proponent of the latter, attributes credit for the general value of the concept of sufficiency to Ronald Fisher, foremost proponent of the former. Rasch’s strong statements concerning the fundamental inferential value of sufficiency (Andrich, 1997; Rasch, 1977; Wright, 1980) would seem to contradict his repeated joke about burning all the statistics texts making use of the normal distribution (Andersen, 1995, p. 385) were it not for the paradigmatic distinction between statistical models of group-level relations among variables, and measurement models of individual processes. Indeed, this distinction is made on the first page of Rasch’s (1980) book.

Now we are in a position to appreciate a comment by Ernst Rutherford, the winner of the 1908 Nobel Prize in Chemistry, who held that, if you need statistics to understand the results of your experiment, then you should have designed a better experiment (Wise, 1995, p. 11). A similar point was made by Feinstein (1995) concerning meta-analysis. The rarely appreciated point is that the generalizable replication and application of results depends heavily on the existence of a portable and universally uniform observational framework. The inferences, judgments, and adjustments that can be made at the point of use by clinicians, teachers, managers, etc. provided with additive measures expressed in a substantively meaningful common metric far outstrip those that can be made using ordinal measures expressed in instrument- and sample-dependent scores. See Andrich (1989, 2002, 2004), Cohen (1994), Davidoff (1999), Duncan (1992), Embretson (1996), Goodman (1999a, 1999b, 1999c), Guttman (1981, 1985), Meehl (1967), Michell (1986), Rogosa (1987), Romanowski and Douglas (2002), and others for more on this distinction between statistics and measurement.

These contrasts show that the confounding of statistics and measurement is a problem of vast significance that persists in spite of repeated efforts to clarify the distinction. For a wide variety of reasons ranging from cultural presuppositions about the nature of number to the popular notion that quantification is as easy as assigning numbers to observations, measurement is not generally well understood by the public (or even by statisticians!). And so statistics textbooks rarely, if ever, include even passing mention of instrument calibration methods, metric equating processes, the evaluation of data quality relative to the requirements of objective inference, traceability to metrological reference standards, or the integration of qualitative and quantitative methods in the interpretation of measures.

Similarly, in business, marketing, health care, and quality improvement circles, we find near-universal repetition of the mantra, “You manage what you measure,” with very little or no attention paid to the quality of the numbers treated as measures. And so, we find ourselves stuck with so-called measurement systems where,

• instead of linear measures defined by a unit that remains constant across samples and instruments we saddle ourselves with nonlinear scores and percentages defined by units that vary in unknown ways across samples and instruments;
• instead of availing ourselves of the capacity to take missing data into account, we hobble ourselves with the need for complete data;
• instead of dramatically reducing data volume with no loss of information, we insist on constantly re-enacting the meaningless ritual of poring over undigestible masses of numbers;
• instead of adjusting measures for the severity or leniency of judges assigning ratings, we allow measures to depend unfairly on which rater happens to make the observations;
• instead of using methods that give the same result across different distributions, we restrict ourselves to ones that give different results when assumptions of normality are not met and/or standard deviations differ;
• instead of calibrating instruments in an experimental test of the hypothesis that the intended construct is in fact structured in such a way as to make its mapping onto a number line meaningful, we assign numbers and make quantitative inferences with no idea as to whether they relate at all to anything real;
• instead of checking to see whether rating scales work as intended, with higher ratings consistently representing more of the variable, we make assumptions that may be contradicted by the order and spacing of the way rating scales actually work in practice;
• instead of defining a comprehensive framework for interpreting measures relative to a construct, we accept the narrow limits of frameworks defined by the local sample and items;
• instead of capitalizing on the practicality and convenience of theories capable of accurately predicting item calibrations and measures apart from data, we counterproductively define measurement empirically in terms of data analysis;
• instead of putting calibrated tools into the hands of front-line managers, service representatives, teachers and clinicians, we require them to submit to cumbersome data entry, analysis, and reporting processes that defeat the purpose of measurement by ensuring the information provided is obsolete by the time it gets back to the person who could act on it; and
• instead of setting up efficient systems for communicating meaningful measures in common languages with shared points of reference, we settle for inefficient systems for communicating meaningless scores in local incommensurable languages.

Because measurement is simultaneously ubiquitous and rarely well understood, we find ourselves in a world that gives near-constant lip service to the importance of measurement while it does almost nothing to provide measures that behave the way we assume they do. This state of affairs seems to have emerged in large part due to our failure to distinguish between the group-level orientation of statistics and the individual-level orientation of measurement. We seem to have been seduced by a variation on what Whitehead (1925, pp. 52-8) called the fallacy of misplaced concreteness. That is, we have assumed that the power of lawful regularities in thought and behavior would be best revealed and acted on via statistical analyses of data that themselves embody the aggregate mass of the patterns involved.

It now appears, however, in light of studies in the history of science (Latour, 1987, 2005; Wise, 1995), that an alternative and likely more successful approach will be to capitalize on the “wisdom of crowds” (Surowiecki, 2004) phenomenon of collective, distributed cognition (Akkerman, et al., 2007; Douglas, 1986; Hutchins, 1995; Magnus, 2007). This will be done by embodying lawful regularities in instruments calibrated in ideal, abstract, and portable metrics put to work by front-line actors on mass scales (Fisher, 2000, 2005, 2009a, 2009b). In this way, we will inform individual decision processes and structure communicative transactions with efficiencies, meaningfulness, substantive effectiveness, and power that go far beyond anything that could be accomplished by trying to make inferences about individuals from group-level statistics.

We ought not accept the factuality of data as the sole criterion of objectivity, with all theory and instruments constrained by and focused on the passing ephemera of individual sets of local particularities. Properly defined and operationalized via a balanced interrelation of theory, data, and instrument, advanced measurement is not a mere mathematical exercise but offers a wealth of advantages and conveniences that cannot otherwise be obtained. We ignore its potentials at our peril.

References
Akkerman, S., Van den Bossche, P., Admiraal, W., Gijselaers, W., Segers, M., Simons, R.-J., et al. (2007, February). Reconsidering group cognition: From conceptual confusion to a boundary area between cognitive and socio-cultural perspectives? Educational Research Review, 2, 39-63.

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.

Andersen, E. B. (1995). What George Rasch would have thought about this book. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 383-390). New York: Springer-Verlag.

Andrich, D. (1989). Distinctions between assumptions and requirements in measurement in the social sciences. In J. A. Keats, R. Taft, R. A. Heath & S. H. Lovibond (Eds.), Mathematical and Theoretical Systems: Proceedings of the 24th International Congress of Psychology of the International Union of Psychological Science, Vol. 4 (pp. 7-16). North-Holland: Elsevier Science Publishers.

Andrich, D. (1997). Georg Rasch in his own words [excerpt from a 1979 interview]. Rasch Measurement Transactions, 11(1), 542-3. [http://www.rasch.org/rmt/rmt111.htm#Georg].

Andrich, D. (2002). Understanding resistance to the data-model relationship in Rasch’s paradigm: A reflection for the next generation. Journal of Applied Measurement, 3(3), 325-59.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Arnold, S. F. (1982-1988). Sufficient statistics. In S. Kotz, N. L. Johnson & C. B. Read (Eds.), Encyclopedia of Statistical Sciences (pp. 72-80). New York: John Wiley & Sons.

Arnold, S. F. (1985, September). Sufficiency and invariance. Statistics & Probability Letters, 3, 275-279.

Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist, 49, 997-1003.

Davidoff, F. (1999, 15 June). Standing statistics right side up (Editorial). Annals of Internal Medicine, 130(12), 1019-1021.

Douglas, M. (1986). How institutions think. Syracuse, New York: Syracuse University Press.

Dynkin, E. B. (1951). Necessary and sufficient statistics for a family of probability distributions. Selected Translations in Mathematical Statistics and Probability, 1, 23-41.

Duncan, O. D. (1992, September). What if? Contemporary Sociology, 21(5), 667-668.

Duncan, O. D., & Stenbeck, M. (1988). Panels and cohorts: Design and model in the study of voting turnout. In C. C. Clogg (Ed.), Sociological Methodology 1988 (pp. 1-35). Washington, DC: American Sociological Association.

Embretson, S. E. (1996, September). Item Response Theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, 20(3), 201-212.

Falmagne, J.-C., & Narens, L. (1983). Scales and meaningfulness of quantitative laws. Synthese, 55, 287-325.

Feinstein, A. R. (1995, January). Meta-analysis: Statistical alchemy for the 21st century. Journal of Clinical Epidemiology, 48(1), 71-79.

Fischer, G. H. (1981, March). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.

Fisher, W. P., Jr. (2000). Objectivity in psychosocial measurement: What, why, how. Journal of Outcome Measurement, 4(2), 527-563.

Fisher, W. P., Jr. (2005). Daredevil barnstorming to the tipping point: New aspirations for the human sciences. Journal of Applied Measurement, 6(3), 173-9.

Fisher, W. P., Jr. (2009a). Bringing human, social, and natural capital to life: Practical consequences and opportunities. In M. Wilson, K. Draney, N. Brown & B. Duckor (Eds.), Advances in Rasch Measurement, Vol. Two (p. in press). Maple Grove, MN: JAM Press.

Fisher, W. P., Jr. (2009b, July). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), in press.

Goodman, S. N. (1999a, 6 April). Probability at the bedside: The knowing of chances or the chances of knowing? (Editorial). Annals of Internal Medicine, 130(7), 604-6.

Goodman, S. N. (1999b, 15 June). Toward evidence-based medical statistics. 1: The p-value fallacy. Annals of Internal Medicine, 130(12), 995-1004.

Goodman, S. N. (1999c, 15 June). Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine, 130(12), 1005-1013.

Guttman, L. (1981). What is not what in theory construction. In I. Borg (Ed.), Multidimensional data representations: When & why. Ann Arbor, MI: Mathesis Press.

Guttman, L. (1985). The illogic of statistical inference for cumulative science. Applied Stochastic Models and Data Analysis, 1, 3-10.

Hall, W. J., Wijsman, R. A., & Ghosh, J. K. (1965). The relationship between sufficiency and invariance with applications in sequential analysis. Annals of Mathematical Statistics, 36, 575-614.

Hammersley, M. (1989). The dilemma of qualitative method: Herbert Blumer and the Chicago Tradition. New York: Routledge.

Hutchins, E. (1995). Cognition in the wild. Cambridge, Massachusetts: MIT Press.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York: Cambridge University Press.

Latour, B. (1995). Cogito ergo sumus! Or psychology swept inside out by the fresh air of the upper deck: Review of Hutchins’ Cognition in the Wild, MIT Press, 1995. Mind, Culture, and Activity: An International Journal, 3(192), 54-63.

Latour, B. (2005). Reassembling the social: An introduction to Actor-Network-Theory. (Clarendon Lectures in Management Studies). Oxford, England: Oxford University Press.

Luce, R. D. (1978, March). Dimensionally invariant numerical laws correspond to meaningful qualitative relations. Philosophy of Science, 45, 1-16.

Magnus, P. D. (2007). Distributed cognition and the task of science. Social Studies of Science, 37(2), 297-310.

Marcus-Roberts, H., & Roberts, F. S. (1987). Meaningless statistics. Journal of Educational Statistics, 12(4), 383-394.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.

Michell, J. (1986). Measurement scales and statistics: A clash of paradigms. Psychological Bulletin, 100, 398-407.

Mundy, B. (1986, June). On the general theory of meaningful representation. Synthese, 67(3), 391-437.

Narens, L. (2002, December). A meaningful justification for the representational theory of measurement. Journal of Mathematical Psychology, 46(6), 746-68.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 14, 58-94.

Roberts, F. S. (1999). Meaningless statements. In R. Graham, J. Kratochvil, J. Nesetril & F. Roberts (Eds.), Contemporary trends in discrete mathematics, DIMACS Series, Volume 49 (pp. 257-274). Providence, RI: American Mathematical Society.

Rogosa, D. (1987). Casual [sic] models do not support scientific conclusions: A comment in support of Freedman. Journal of Educational Statistics, 12(2), 185-95.

Romanoski, J. T., & Douglas, G. (2002). Rasch-transformed raw scores and two-way ANOVA: A simulation analysis. Journal of Applied Measurement, 3(4), 421-430.

Stevens, S. S. (1951). Mathematics, measurement, and psychophysics. In S. S. Stevens (Ed.), Handbook of experimental psychology (pp. 1-49). New York: John Wiley & Sons.

Surowiecki, J. (2004). The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies and nations. New York: Doubleday.

van der Linden, W. J. (1992). Sufficient and necessary statistics. Rasch Measurement Transactions, 6(3), 231 [http://www.rasch.org/rmt/rmt63d.htm].

Whitehead, A. N. (1925). Science and the modern world. New York: Macmillan.

Wise, M. N. (Ed.). (1995). The values of precision. Princeton, New Jersey: Princeton University Press.

Wright, B. D. (1980). Foreword, Afterword. In Probabilistic models for some intelligence and attainment tests, by Georg Rasch (pp. ix-xix, 185-199. http://www.rasch.org/memo63.htm) [Reprint; original work published in 1960 by the Danish Institute for Educational Research]. Chicago, Illinois: University of Chicago Press.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.