Archive for the ‘Probabilistic models’ Category

2011 IMEKO Conference Papers Published Online

January 13, 2012

Papers from the Joint International IMEKO TC1+ TC7+ TC13 Symposium held August 31st to September 2nd,  2011, in Jena, Germany are now available online at http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24575/IMEKO2011_TOC.pdf. The following will be of particular interest to those interested in measurement applications in the social sciences, education, health care, and psychology:

Nikolaus Bezruczko
Foundational Imperatives for Measurement with Mathematical Models
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24419/ilm1-2011imeko-030.pdf

Nikolaus Bezruczko, Shu-Pi C. Chen, Connie Hill, Joyce M. Chesniak
A Clinical Scale for Measuring Functional Caregiving of Children Assisted with Medical Technologies
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24507/ilm1-2011imeko-032.pdf

Stefan Cano, Anne F. Klassen, Andrea L. Pusic, Andrea
From Breast-Q © to Q-Score ©: Using Rasch Measurement to Better Capture Breast Surgery Outcomes
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24429/ilm1-2011imeko-039.pdf

Gordon A. Cooper, William P. Fisher, Jr.
Continuous Quantity and Unit; Their Centrality to Measurement
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24494/ilm1-2011imeko-019.pdf

William P. Fisher, Jr.
Measurement, Metrology and the Coordination of Sociotechnical Networks
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24491/ilm1-2011imeko-017.pdf

William .P Fisher, Jr., A. Jackson Stenner
A Technology Roadmap for Intangible Assets Metrology
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24493/ilm1-2011imeko-018.pdf

Carl V. Granger, Nikolaus Bezruczko
Body, Mind, and Spirit are Instrumental to Functional Health: A Case Study
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24494/ilm1-2011imeko-019.pdf

Thomas Salzberger
The Quantification of Latent Variables in the Social Sciences: Requirements for Scientific Measurement and Shortcomings of Current Procedures
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24417/ilm1-2011imeko-029.pdf

A. Jackson Stenner, Mark Stone, Donald Burdick
How to Model and Test for the Mechanisms that Make Measurement Systems Tick
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24416/ilm1-2011imeko-027.pdf

Mark Wilson
The Role of Mathematical Models in Measurement: A Perspective from Psychometrics
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24178/ilm1-2011imeko-005.pdf

Also of interest will be Karl Ruhm’s plenary lecture and papers from the Fundamentals of Measurement Science session and the Special Session on the Role of Mathematical Models in Measurement:

Karl H. Ruhm
From Verbal Models to Mathematical Models – A Didactical Concept not just in Metrology
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24167/ilm1-2011imeko-002.pdf

Alessandro Giordani, Luca Mari
Quantity and Quantity Value
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24414/ilm1-2011imeko-025.pdf

Eric Benoit
Uncertainty in Fuzzy Scales Based Measurements
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24415/ilm1-2011imeko-020.pdf

Susanne C.N. Töpfer
Application of Mathematical Models in Optical Coordinate Metrology
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24445/ilm1-2011imeko-008.pdf

Giovanni Battista Rossi
Measurement Modelling: Foundations and Probabilistic Approach
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24446/ilm1-2011imeko-009.pdf

Sanowar H. Khan, Ludwik Finkelstein
The Role of Mathematical Modelling in the Analysis and Design of Measurement Systems
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24448/ilm1-2011imeko-010.pdf

Roman Z. Morawski
Application-Oriented Approach to Mathematical Modelling of Measurement Processes
http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24449/ilm1-2011imeko-011.pdf

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Advertisements

A New Agenda for Measurement Theory and Practice in Education and Health Care

April 15, 2011

Two key issues on my agenda offer different answers to the question “Why do you do things the way you do in measurement theory and practice?”

First, we can take up the “Because of…” answer to this question. We need to articulate an historical account of measurement that does three things:

  1. that builds on Rasch’s use of Maxwell’s method of analogy by employing it and expanding on it in new applications;
  2. that unifies the vocabulary and concepts of measurement across the sciences into a single framework so far as possible by situating probabilistic models of invariant individual-level within-variable phenomena in the context of measurement’s GIGO principle and data-to-model fit, as distinct from the interactions of group-level between-variable phenomena in the context of statistics’ model-to-data fit; and
  3. that stresses the social, collective cognition facilitated by networks of individuals whose point-of-use measurement-informed decisions and behaviors are coordinated and harmonized virtually, at a distance, with no need for communication or negotiation.

We need multiple publications in leading journals on these issues, as well as one or more books that people can cite as a way of making this real and true history of measurement, properly speaking, credible and accepted in the mainstream. This web site http://ssrn.com/abstract=1698919 is a draft article of my own in this vein that I offer for critique; other material is available on request. Anyone who works on this paper with me and makes a substantial contribution to its publication will be added as co-author.

Second, we can take up the “In order that…” answer to the question “Why do you do things the way you do?” From this point of view, we need to broaden the scope of the measurement research agenda beyond data analysis, estimation, models, and fit assessment in three ways:

  1. by emphasizing predictive construct theories that exhibit the fullest possible understanding of what is measured and so enable the routine reproduction of desired proportionate effects efficiently, with no need to analyze data to obtain an estimate;
  2. by defining the standard units to which all calibrated instruments measuring given constructs are traceable; and
  3. by disseminating to front line users on mass scales instruments measuring in publicly available standard units and giving immediate feedback at the point of use.

These two sets of issues define a series of talking points that together constitute a new narrative for measurement in education, psychology, health care, and many other fields. We and others may see our way to organizing new professional societies, new journals, new university-based programs of study, etc. around these principles.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

A Second Simple Example of Measurement’s Role in Reducing Transaction Costs, Enhancing Market Efficiency, and Enables the Pricing of Intangible Assets

March 9, 2011

The prior post here showed why we should not confuse counts of things with measures of amounts, though counts are the natural starting place to begin constructing measures. That first simple example focused on an analogy between counting oranges and measuring the weight of oranges, versus counting correct answers on tests and measuring amounts of ability. This second example extends the first by, in effect, showing what happens when we want to aggregate value not just across different counts of some one thing but across different counts of different things. The point will be, in effect, to show how the relative values of apples, oranges, grapes, and bananas can be put into a common frame of reference and compared in a practical and convenient way.

For instance, you may go into a grocery store to buy raspberries and blackberries, and I go in to buy cantaloupe and watermelon. Your cost per individual fruit will be very low, and mine will be very high, but neither of us will find this annoying, confusing, or inconvenient because your fruits are very small, and mine, very large. Conversely, your cost per kilogram will be much higher than mine, but this won’t cause either of us any distress because we both recognize the differences in the labor, handling, nutritional, and culinary value of our purchases.

But what happens when we try to purchase something as complex as a unit of socioeconomic development? The eight UN Millennium Development Goals (MDGs) represent a start at a systematic effort to bring human, social, and natural capital together into the same economic and accountability framework as liquid and manufactured capital, and property. But that effort is stymied by the inefficiency and cost of making and using measures of the goals achieved. The existing MDG databases (http://data.un.org/Browse.aspx?d=MDG), and summary reports present overwhelming numbers of numbers. Individual indicators are presented for each year, each country, each region, and each program, goal by goal, target by target, indicator by indicator, and series by series, in an indigestible volume of data.

Though there are no doubt complex mathematical methods by which a philanthropic, governmental, or NGO investor might determine how much development is gained per million dollars invested, the cost of obtaining impact measures is so high that most funding decisions are made with little information concerning expected returns (Goldberg, 2009). Further, the percentages of various needs met by leading social enterprises typically range from 0.07% to 3.30%, and needs are growing, not diminishing. Progress at current rates means that it would take thousands of years to solve today’s problems of human suffering, social disparity, and environmental quality. The inefficiency of human, social, and natural capital markets is so overwhelming that there is little hope for significant improvements without the introduction of fundamental infrastructural supports, such as an Intangible Assets Metric System.

A basic question that needs to be asked of the MDG system is, how can anyone make any sense out of so much data? Most of the indicators are evaluated in terms of counts of the number of times something happens, the number of people affected, or the number of things observed to be present. These counts are usually then divided by the maximum possible (the count of the total population) and are expressed as percentages or rates.

As previously explained in various posts in this blog, counts and percentages are not measures in any meaningful sense. They are notoriously difficult to interpret, since the quantitative meaning of any given unit difference varies depending on the size of what is counted, or where the percentage falls in the 0-100 continuum. And because counts and percentages are interpreted one at a time, it is very difficult to know if and when any number included in the sheer mass of data is reasonable, all else considered, or if it is inconsistent with other available facts.

A study of the MDG data must focus on these three potential areas of data quality improvement: consistency evaluation, volume reduction, and interpretability. Each builds on the others. With consistent data lending themselves to summarization in sufficient statistics, data volume can be drastically reduced with no loss of information (Andersen, 1977, 1999; Wright, 1977, 1997), data quality can be readily assessed in terms of sufficiency violations (Smith, 2000; Smith & Plackner, 2009), and quantitative measures can be made interpretable in terms of a calibrated ruler’s repeatedly reproducible hierarchy of indicators (Bond & Fox, 2007; Masters, Lokan, & Doig, 1994).

The primary data quality criteria are qualitative relevance and meaningfulness, on the one hand, and mathematical rigor, on the other. The point here is one of following through on the maxim that we manage what we measure, with the goal of measuring in such a way that management is better focused on the program mission and not distracted by accounting irrelevancies.

Method

As written and deployed, each of the MDG indicators has the face and content validity of providing information on each respective substantive area of interest. But, as has been the focus of repeated emphases in this blog, counting something is not the same thing as measuring it.

Counts or rates of literacy or unemployment are not, in and of themselves, measures of development. Their capacity to serve as contributing indications of developmental progress is an empirical question that must be evaluated experimentally against the observable evidence. The measurement of progress toward an overarching developmental goal requires inferences made from a conceptual order of magnitude above and beyond that provided in the individual indicators. The calibration of an instrument for assessing progress toward the realization of the Millennium Development Goals requires, first, a reorganization of the existing data, and then an analysis that tests explicitly the relevant hypotheses as to the potential for quantification, before inferences supporting the comparison of measures can be scientifically supported.

A subset of the MDG data was selected from the MDG database available at http://data.un.org/Browse.aspx?d=MDG, recoded, and analyzed using Winsteps (Linacre, 2011). At least one indicator was selected from each of the eight goals, with 22 in total. All available data from these 22 indicators were recorded for each of 64 countries.

The reorganization of the data is nothing but a way of making the interpretation of the percentages explicit. The meaning of any one country’s percentage or rate of youth unemployment, cell phone users, or literacy has to be kept in context relative to expectations formed from other countries’ experiences. It would be nonsense to interpret any single indicator as good or bad in isolation. Sometimes 30% represents an excellent state of affairs, other times, a terrible one.

Therefore, the distributions of each indicator’s percentages across the 64 countries were divided into ranges and converted to ratings. A lower rating uniformly indicates a status further away from the goal than a higher rating. The ratings were devised by dividing the frequency distribution of each indicator roughly into thirds.

For instance, the youth unemployment rate was found to vary such that the countries furthest from the desired goal had rates of 25% and more(rated 1), and those closest to or exceeding the goal had rates of 0-10% (rated 3), leaving the middle range (10-25%) rated 2. In contrast, percentages of the population that are undernourished were rated 1 for 35% or more, 2 for 15-35%, and 3 for less than 15%.

Thirds of the distributions were decided upon only on the basis of the investigator’s prior experience with data of this kind. A more thorough approach to the data would begin from a finer-grained rating system, like that structuring the MDG table at http://mdgs.un.org/unsd/mdg/Resources/Static/Products/Progress2008/MDG_Report_2008_Progress_Chart_En.pdf. This greater detail would be sought in order to determine empirically just how many distinctions each indicator can support and contribute to the overall measurement system.

Sixty-four of the available 336 data points were selected for their representativeness, with no duplications of values and with a proportionate distribution along the entire continuum of observed values.

Data from the same 64 countries and the same years were then sought for the subsequent indicators. It turned out that the years in which data were available varied across data sets. Data within one or two years of the target year were sometimes substituted for missing data.

The data were analyzed twice, first with each indicator allowed its own rating scale, parameterizing each of the category difficulties separately for each item, and then with the full rating scale model, as the results of the first analysis showed all indicators shared strong consistency in the rating structure.

Results

Data were 65.2% complete. Countries were assessed on an average of 14.3 of the 22 indicators, and each indicator was applied on average to 41.7 of the 64 country cases. Measurement reliability was .89-.90, depending on how measurement error is estimated. Cronbach’s alpha for the by-country scores was .94. Calibration reliability was .93-.95. The rating scale worked well (see Linacre, 2002, for criteria). The data fit the measurement model reasonably well, with satisfactory data consistency, meaning that the hypothesis of a measurable developmental construct was not falsified.

The main result for our purposes here concerns how satisfactory data consistency makes it possible to dramatically reduce data volume and improve data interpretability. The figure below illustrates how. What does it mean for data volume to be drastically reduced with no loss of information? Let’s see exactly how much the data volume is reduced for the ten item data subset shown in the figure below.

The horizontal continuum from -100 to 1300 in the figure is the metric, the ruler or yardstick. The number of countries at various locations along that ruler is shown across the bottom of the figure. The mean (M), first standard deviation (S), and second standard deviation (T) are shown beneath the numbers of countries. There are ten countries with a measure of just below 400, just to the left of the mean (M).

The MDG indicators are listed on the right of the figure, with the indicator most often found being achieved relative to the goals at the bottom, and the indicator least often being achieved at the top. The ratings in the middle of the figure increase from 1 to 3 left to right as the probability of goal achievement increases as the measures go from low to high. The position of the ratings in the middle of the figure shifts from left to right as one reads up the list of indicators because the difficulty of achieving the goals is increasing.

Because the ratings of the 64 countries relative to these ten goals are internally consistent, nothing but the developmental level of the country and the developmental challenge of the indicator affects the probability that a given rating will be attained. It is this relation that defines fit to a measurement model, the sufficiency of the summed ratings, and the interpretability of the scores. Given sufficient fit and consistency, any country’s measure implies a given rating on each of the ten indicators.

For instance, imagine a vertical line drawn through the figure at a measure of 500, just above the mean (M). This measure is interpreted relative to the places at which the vertical line crosses the ratings in each row associated with each of the ten items. A measure of 500 is read as implying, within a given range of error, uncertainty, or confidence, a rating of

  • 3 on debt service and female-to-male parity in literacy,
  • 2 or 3 on how much of the population is undernourished and how many children under five years of age are moderately or severely underweight,
  • 2 on infant mortality, the percent of the population aged 15 to 49 with HIV, and the youth unemployment rate,
  • 1 or 2 the poor’s share of the national income, and
  • 1 on CO2 emissions and the rate of personal computers per 100 inhabitants.

For any one country with a measure of 500 on this scale, ten percentages or rates that appear completely incommensurable and incomparable are found to contribute consistently to a single valued function, developmental goal achievement. Instead of managing each separate indicator as a universe unto itself, this scale makes it possible to manage development itself at its own level of complexity. This ten-to-one ratio of reduced data volume is more than doubled when the total of 22 items included in the scale is taken into account.

This reduction is conceptually and practically important because it focuses attention on the actual object of management, development. When the individual indicators are the focus of attention, the forest is lost for the trees. Those who disparage the validity of the maxim, you manage what you measure, are often discouraged by the the feeling of being pulled in too many directions at once. But a measure of the HIV infection rate is not in itself a measure of anything but the HIV infection rate. Interpreting it in terms of broader developmental goals requires evidence that it in fact takes a place in that larger context.

And once a connection with that larger context is established, the consistency of individual data points remains a matter of interest. As the world turns, the order of things may change, but, more likely, data entry errors, temporary data blips, and other factors will alter data quality. Such changes cannot be detected outside of the context defined by an explicit interpretive framework that requires consistent observations.

-100  100     300     500     700     900    1100    1300
|-------+-------+-------+-------+-------+-------+-------|  NUM   INDCTR
1                                 1  :    2    :  3     3    9  PcsPer100
1                         1   :   2    :   3            3    8  CO2Emissions
1                    1  :    2    :   3                 3   10  PoorShareNatInc
1                 1  :    2    :  3                     3   19  YouthUnempRatMF
1              1   :    2   :   3                       3    1  %HIV15-49
1            1   :   2    :   3                         3    7  InfantMortality
1          1  :    2    :  3                            3    4  ChildrenUnder5ModSevUndWgt
1         1   :    2    :  3                            3   12  PopUndernourished
1    1   :    2   :   3                                 3    6  F2MParityLit
1   :    2    :  3                                      3    5  DebtServExpInc
|-------+-------+-------+-------+-------+-------+-------|  NUM   INDCTR
-100  100     300     500     700     900    1100    1300
                   1
       1   1 13445403312323 41 221    2   1   1            COUNTRIES
       T      S       M      S       T

Discussion

A key element in the results obtained here concerns the fact that the data were about 35% missing. Whether or not any given indicator was actually rated for any given country, the measure can still be interpreted as implying the expected rating. This capacity to take missing data into account can be taken advantage of systematically by calibrating a large bank of indicators. With this in hand, it becomes possible to gather only the amount of data needed to make a specific determination, or to adaptively administer the indicators so as to obtain the lowest-error (most reliable) measure at the lowest cost (with the fewest indicators administered). Perhaps most importantly, different collections of indicators can then be equated to measure in the same unit, so that impacts may be compared more efficiently.

Instead of an international developmental aid market that is so inefficient as to preclude any expectation of measured returns on investment, setting up a calibrated bank of indicators to which all measures are traceable opens up numerous desirable possibilities. The cost of assessing and interpreting the data informing aid transactions could be reduced to negligible amounts, and the management of the processes and outcomes in which that aid is invested would be made much more efficient by reduced data volume and enhanced information content. Because capital would flow more efficiently to where supply is meeting demand, nonproducers would be cut out of the market, and the effectiveness of the aid provided would be multiplied many times over.

The capacity to harmonize counts of different but related events into a single measurement system presents the possibility that there may be a bright future for outcomes-based budgeting in education, health care, human resource management, environmental management, housing, corrections, social services, philanthropy, and international development. It may seem wildly unrealistic to imagine such a thing, but the return on the investment would be so monumental that not checking it out would be even crazier.

A full report on the MDG data, with the other references cited, is available on my SSRN page at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1739386.

Goldberg, S. H. (2009). Billions of drops in millions of buckets: Why philanthropy doesn’t advance social progress. New York: Wiley.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Reasoning by analogy in social science education: On the need for a new curriculum

April 12, 2010

I’d like to revisit the distinction between measurement models and statistical models. Rasch was well known for joking about burning all books containing the words “normal distribution” (Andersen, 1995, p. 385). Rasch’s book and 1961 article both start on their first pages with a distinction between statistical models describing intervariable relations at the group level and measurement models prescribing intravariable relations at the individual level. I think confusion between these kinds of models has caused huge problems.

We typically assume all statistical analyses are quantitative. We refer to any research that uses numbers as quantitative even when nothing is done to map a substantive and invariant unit on a number line. We distinguish between qualitative and quantitative data and methods as though quantification has ever been achieved in the history of science without substantive qualitative understandings of the constructs.

Quantification in fact predates the emergence of statistics by millennia. It seems to me that there is a great deal to be gained from maintaining a careful distinction between statistics and measurement. Measurement is not primarily performed by someone sitting at a computer analyzing data. Measurement is done by individuals using calibrated instruments to obtain immediately useful quantitative information expressed in a universally uniform unit.

Rasch was correct in his assertion that we can measure the reading ability of a child with the same kind of objectivity with which we measure his or her weight or height. But we don’t commonly express individual height and weight measures in statistical terms. 

Information overload is one of the big topics of the day. Which will contribute more to reducing that overload in efficient and meaningful ways: calibrated instruments measuring in common units giving individual users immediate feedback that summarizes responses to dozens of questions, or ordinal group-level item-by-item statistics reported six months too late to do anything about them?

Instrument calibration certainly makes use of statistics, and statistical models usually assume measurement has taken place, but much stands to be gained from a clear distinction between inter- and intra-variable models. And so I respectfully disagree with those who assert that “the Rasch model is first of all a statistical model.” Maxwell’s method of making analogies from well known physical laws (Nersessian, 2002; Turner, 1955) was adopted by Rasch (1960, pp. 110-115) so that his model would have the same structure as the laws of physics.

Statistical models are a different class of models from the laws of physics (Meehl, 1967), since they allow cross-variable interactions in ways that compromise and defeat the possibility of testing the hypotheses of constant unit size, parameter separation, sufficiency, etc.

I’d like to suggest a paraphrase of the first sentence of the abstract from a recent paper (Silva, 2007) on using analogies in science education: Despite its great importance, many students and even their teachers still cannot recognize the relevance of measurement models to build up psychosocial knowledge and are unable to develop qualitative explanations for mathematical expressions of the lawful structural invariances that exist within the social sciences.

And so, here’s a challenge: we need to make an analogy from Silva’s (2007) work in physics science education and develop a curriculum for social science education that follows a parallel track. We could trace the development of reading measurement from Rasch (1960) through the Anchor Test Study (Jaeger, 1973; Rentz & Bashaw, 1977) to the introduction of the Lexile Framework for Reading (Stenner, 2001) and its explicit continuity with Rasch’s use of Maxwell’s method of analogy (Burdick, Stone, & Stenner, 2006) and full blown predictive theory (Stenner & Stone, 2003).

With the example of the Rasch Reading Law in hand, we could then train students and teachers to think about structural invariance in the context of psychosocial constructs. It may be that, without the development and dissemination of at least a college-level curriculum of this kind, we will never overcome the confusion between statistical and measurement models.

References

Andersen, E. B. (1995). What George Rasch would have thought about this book. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 383-390). New York: Springer-Verlag.

Burdick, D. S., Stone, M. H., & Stenner, A. J. (2006). The Combined Gas Law and a Rasch Reading Law. Rasch Measurement Transactions, 20(2), 1059-60 [http://www.rasch.org/rmt/rmt202.pdf].

Jaeger, R. M. (1973). The national test equating study in reading (The Anchor Test Study). Measurement in Education, 4, 1-8.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.

Nersessian, N. J. (2002). Maxwell and “the Method of Physical Analogy”: Model-based reasoning, generic abstraction, and conceptual change. In D. Malament (Ed.), Essays in the history and philosophy of science and mathematics (pp. 129-166). Lasalle, Illinois: Open Court.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Rasch, G. (1961). On general laws and the meaning of measurement in psychology. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability (pp. 321-333 [http://www.rasch.org/memo1960.pdf]). Berkeley, California: University of California Press.

Rentz, R. R., & Bashaw, W. L. (1977, Summer). The National Reference Scale for Reading: An application of the Rasch model. Journal of Educational Measurement, 14(2), 161-179.

Silva, C. C. (2007, August). The role of models and analogies in the electromagnetic theory: A historical case study. Science & Education, 16(7-8), 835-848.

Stenner, A. J. (2001). The Lexile Framework: A common metric for matching readers and texts. California School Library Journal, 25(1), 41-2.

Stenner, A. J., & Stone, M. (2003). Item specification vs. item banking. Rasch Measurement Transactions, 17(3), 929-30 [http://www.rasch.org/rmt/rmt173a.htm].

Turner, J. (1955, November). Maxwell on the method of physical analogy. British Journal for the Philosophy of Science, 6, 226-238.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Parameterizing Perfection: Practical Applications of a Mathematical Model of the Lean Ideal

April 2, 2010

To properly pursue perfection, we need to parameterize it. That is, taking perfection as the ideal, unattainable standard against which we judge our performance is equivalent to thinking of it as a mathematical model. Organizations are intended to realize their missions independent of the particular employees, customers, suppliers, challenges, products, etc. they happen to engage with at any particular time. Organizational performance measurement (Spitzer, 2007) ought to then be designed in terms of a model that posits, tests for, and capitalizes on the always imperfectly realized independence of those parameters.

Lean thinking (Womack & Jones, 1996) focuses on minimizing waste and maximizing value. At every point at which resources are invested in processes, services, or products, the question is asked, “What value is added here?” Resources are wasted when no value is added, when they can be removed with no detrimental effect on the value of the end product. In their book, Natural Capitalism: Creating the Next Industrial Revolution, Hawken, Lovins, and Lovins (1999, p. 133) say

“Lean thinking … changes the standard for measuring corporate success. … As they [Womack and Jones] express it: ‘Our earnest advice to lean firms today is simple. To hell with your competitors; compete against perfection by identifying all activities that are muda [the Japanese term for waste used in Toyota’s landmark quality programs] and eliminating them. This is an absolute rather than a relative standard which can provide the essential North Star for any organization.”

Further, every input should “be presumed waste until shown otherwise.” A constant, ongoing, persistent pressure for removing waste is the basic characteristic of lean thinking. Perfection is never achieved, but it aptly serves as the ideal against which progress is measured.

Lean thinking sounds a lot like a mathematical model, though it does not seem to have been written out in a mathematical form, or used as the basis for calibrating instruments, estimating measures, evaluating data quality, or for practical assessments of lean organizational performance. The closest anyone seems to have come to parameterizing perfection is in the work of Genichi Taguchi (Ealey, 1988), which has several close parallels with Rasch measurement (Linacre, 1993).  But meaningful and objective quantification, as required and achieved in the theory and practice of fundamental measurement (Andrich, 2004; Bezruczko, 2005; Bond & Fox 2007; Smith & Smith, 2004; Wilson, 2005; Wright, 1999), in fact asserts abstract ideals of perfection as models of organizational, social, and psychological processes in education, health care, marketing, etc. These models test the extent to which outcomes remain invariant across examination or survey questions, across teachers, students, schools, and curricula, or across treatment methods, business processes, or policies.

Though as yet implemented only to a limited extent in business (Drehmer, Belohlav, James, & Coye, 2000; Drehmer & Deklava, 2001;  Lunz & Linacre, 1998; Salzberger, 2009), advanced measurement’s potential rewards are great. Fundamental measurement theory has been successfully applied in research and practice thousands of times over the last 40 years and more, including in very large scale assessments and licensure/certification applications (Adams, Wu, & Macaskill, 1997; Masters, 2007; Smith, Julian, Lunz, et al., 1994). These successes speak to an opportunity for making broad improvements in outcome measurement that could provide more coherent product definition, and significant associated opportunities for improving product quality and the efficiency with which it is produced, in the manner that has followed from the use of fundamental measures in other industries.

Of course, processes and outcomes are never implemented or obtained with perfect consistency. This would be perfectly true only in a perfect world. But to pursue perfection, we need to parameterize it. In other words, to raise the bar in any area of performance assessment, we have to know not only what direction is up, but we also need to know when we have raised the bar far enough. But we cannot tell up from down, we do not know how much to raise the bar, and we cannot properly evaluate the effects of lean experiments when we have no way of locating measures on a number line that embodies the lean ideal.

To think together collectively in ways that lead to significant new innovations, to rise above what Jaron Lanier calls the “global mush” of confused and self-confirming hive thinking, we need the common languages of widely accepted fundamental measures of the relevant processes and outcomes, measures that remain constant across samples of customers, patients, employees, students, etc., and across products, sales techniques, curricula, treatment processes, assessment methods, and brands of instrument.

We are all well aware that the consequences of not knowing where the bar is, of not having product definitions, can be disastrous. In many respects, as I’ve said previously in this blog, the success or failure of health care reform hinges on getting measurement right. The Institute of Medicine report, To Err is Human, of several years ago stresses the fact that system failures pose the greatest threat to safety in health care because they lead to human errors. When a system as complex as health care lacks a standard product definition, and product delivery is fragmented across multiple providers with different amounts and kinds of information in different settings, the system becomes dangerously cumbersome and over-complicated, with unacceptably wide variations and errors in its processes and outcomes, not to even speak of its economic inefficiency.

In contrast with the widespread use of fundamental measures in the product definitions of other industries, health care researchers typically implement neither the longstanding, repeatedly proven, and mathematically rigorous models of fundamental measurement theory nor the metrological networks through which reference standard metrics are engineered. Most industries carefully define, isolate, and estimate the parameters of their products, doing so in ways 1) that ensure industry-wide comparability and standardization, and 2) that facilitate continuous product improvement by revealing multiple opportunities for enhancement. Where organizations in other industries manage by metrics and thereby keep their eyes on the ball of product quality, health care organizations often manage only their own internal processes and cannot in fact bring the product quality ball into view.

In his message concerning the Institute for Healthcare Improvement’s Pursuing Perfection project a few years ago, Don Berwick, like others (Coye, 2001; Coye & Detmer, 1998), observed that health care does not yet have an organization setting new standards in the way that Toyota did for the auto industry in the 1970s. It still doesn’t, of course. Given the differences between the auto and health care industries uses of fundamental measures of product quality and associated abilities to keep their eyes on the quality ball, is it any wonder then, that no one in health care has yet hit a home run? It may well be that no one will hit a home run in health care until reference standard measures of product quality are devised.

The need for reference standard measures in uniform data systems is crucial, and the methods for obtaining them are widely available and well-known. So what is preventing the health care industry from adopting and deploying them? Part of the answer is the cost of the initial investment required. In 1980, metrology comprised about six percent of the U.S. gross national product (Hunter, 1980). In the period from 1981 to 1994, annual expenditures on research and development in the U.S. were less than three percent of the GNP, and non-defense R&D was about two percent (NIST Subcommittee on Research, National Science and Technology Council, 1996). These costs, however, must be viewed as investments from which high rates of return can be obtained (Barber, 1987; Gallaher, Rowe, Rogozhin, et al., 2007; Swann, 2005).

For instance, the U.S. National Institute of Standards and Technology estimated the economic impact of 12 areas of research in metrology, in four broad areas including semiconductors, electrical calibration and testing, optical industries, and computer systems (NIST, 1996, Appendix C; also see NIST, 2003). The median rate of return in these 12 areas was 147 percent, and returns ranged from 41 to 428 percent. The report notes that these results compare favorably with those obtained in similar studies of return rates from other public and private research and development efforts. Even if health care metrology produces only a small fraction of the return rate produced in physical metrology, its economic impact could still amount to billions of dollars annually. The proposed pilot projects therefore focus on determining what an effective health care outcomes metrology system should look like. What should its primary functions be? What should it cost? What rates of return could be expected from it?

Metrology, the science of measurement (Pennella, 1997), requires 1) that instruments be calibrated within individual laboratories so as to isolate and estimate the values of the required parameters (Wernimont, 1978); and 2) that individual instruments’ capacities to provide the same measure for the same amount, and so be traceable to a reference standard, be established and monitored via interlaboratory round-robin trials (Mandel, 1978).

Fundamental measurement has already succeeded in demonstrating the viability of reference standard measures of health outcomes, measures whose meaningfulness does not depend on the particular samples of items employed or patients measured. Though this work succeeds as far as it goes, it being done in a context that lacks any sense of the need for metrological infrastructure. Health care needs networks of scientists and technicians collaborating not only in the first, intralaboratory phase of metrological work, but also in the interlaboratory trials through which different brands or configurations of instruments intended to measure the same variable would be tuned to harmoniously produce the same measure for the same amount.

Implementation of the two phases of metrological innovation in health care would then begin with the intralaboratory calibration of existing and new instruments for measuring overall organizational performance, quality of care, and patients’ health status, quality of life, functionality, etc.  The second phase takes up the interlaboratory equating of these instruments, and the concomitant deployment of reference standard units of measurement throughout a health care system and the industry as a whole. To answer questions concerning health care metrology’s potential returns on investment, the costs for, and the savings accrued from, accomplishing each phase of each pilot will be tracked or estimated.

When instruments measuring in universally uniform, meaningful units are put in the hands of clinicians, a new scientific revolution will occur in medicine. It will be analogous to previous ones associated with the introduction of the thermometer and the instruments of optometry and the clinical laboratory. Such tools will multiply many times over the quality improvement methods used by Brent James, touted as holding the key to health care reform in a recent New York Times profile. Instead of implicitly hypothesizing models of perfection and assessing performance relative to them informally, what we need is a new science that systematically implements the lean ideal on industry-wide scales. The future belongs to those who master these techniques.

References

Adams, R. J., Wu, M. L., & Macaskill, G. (1997). Scaling methodology and procedures for the mathematics and science scales. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study Technical Report: Vol. 2: Implementation and Analysis – Primary and Middle School Years (pp. 111-145). Chestnut Hill, MA: Boston College.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Barber, J. M. (1987). Economic rationale for government funding of work on measurement standards. In R. Dobbie, J. Darrell, K. Poulter & R. Hobbs (Eds.), Review of DTI work on measurement standards (p. Annex 5). London: Department of Trade and Industry.

Berwick, D. M., James, B., & Coye, M. J. (2003, January). Connections between quality measurement and improvement. Medical Care, 41(1 (Suppl)), I30-38.

Bezruczko, N. (Ed.). (2005). Rasch measurement in health sciences. Maple Grove, MN: JAM Press.

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2d edition. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Coye, M. J. (2001, November/December). No Toyotas in health care: Why medical care has not evolved to meet patients’ needs. Health Affairs, 20(6), 44-56.

Coye, M. J., & Detmer, D. E. (1998). Quality at a crossroads. The Milbank Quarterly, 76(4), 759-68.

Drehmer, D. E., Belohlav, J. A., & Coye, R. W. (2000, Dec). A exploration of employee participation using a scaling approach. Group & Organization Management, 25(4), 397-418.

Drehmer, D. E., & Deklava, S. M. (2001, April). A note on the evolution of software engineering practices. Journal of Systems and Software, 57(1), 1-7.

Ealey, L. A. (1988). Quality by design: Taguchi methods and U.S. industry. Dearborn MI: ASI Press.

Gallaher, M. P., Rowe, B. R., Rogozhin, A. V., Houghton, S. A., Davis, J. L., Lamvik, M. K., et al. (2007). Economic impact of measurement in the semiconductor industry (Tech. Rep. No. 07-2). Gaithersburg, MD: National Institute for Standards and Technology.

Hawken, P., Lovins, A., & Lovins, H. L. (1999). Natural capitalism: Creating the next industrial revolution. New York: Little, Brown, and Co.

Hunter, J. S. (1980, November). The national system of scientific measurement. Science, 210(21), 869-874.

Linacre, J. M. (1993). Quality by design: Taguchi and Rasch. Rasch Measurement Transactions, 7(2), 292.

Lunz, M. E., & Linacre, J. M. (1998). Measurement designs using multifacet Rasch modeling. In G. A. Marcoulides (Ed.), Modern methods for business research. Methodology for business and management (pp. 47-77). Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc.

Mandel, J. (1978, December). Interlaboratory testing. ASTM Standardization News, 6, 11-12.

Masters, G. N. (2007). Special issue: Programme for International Student Assessment (PISA). Journal of Applied Measurement, 8(3), 235-335.

National Institute for Standards and Technology (NIST). (1996). Appendix C: Assessment examples. Economic impacts of research in metrology. In C. o. F. S. Subcommittee on Research (Ed.), Assessing fundamental science: A report from the Subcommittee on Research, Committee on Fundamental Science. Washington, DC: National Standards and Technology Council [http://www.nsf.gov/statistics/ostp/assess/nstcafsk.htm#Topic%207; last accessed 18 February 2008].

National Institute for Standards and Technology (NIST). (2003, 15 January). Outputs and outcomes of NIST laboratory research. Retrieved 12 July 2009, from http://www.nist.gov/director/planning/studies.htm#measures.

Pennella, C. R. (1997). Managing the metrology system. Milwaukee, WI: ASQ Quality Press.\

Salzberger, T. (2009). Measurement in marketing research: An alternative framework. Northampton, MA: Edward Elgar.

Smith, R. M., Julian, E., Lunz, M., Stahl, J., Schulz, M., & Wright, B. D. (1994). Applications of conjoint measurement in admission and professional certification programs. International Journal of Educational Research, 21(6), 653-664.

Smith, E. V., Jr., & Smith, R. M. (2004). Introduction to Rasch measurement. Maple Grove, MN: JAM Press.

Spitzer, D. (2007). Transforming performance measurement: Rethinking the way we measure and drive organizational success. New York: AMACOM.

Swann, G. M. P. (2005, 2 December). John Barber’s pioneering work on the economics of measurement standards [Electronic version]. Retrieved http://www.cric.ac.uk/cric/events/jbarber/swann.pdf from Notes for Workshop in Honor of John Barber held at University of Manchester.

Wernimont, G. (1978, December). Careful intralaboratory study must come first. ASTM Standardization News, 6, 11-12.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Womack, J. P., & Jones, D. T. (1996, Sept./Oct.). Beyond Toyota: How to root out waste and pursue perfection. Harvard Business Review, 74, 140-58.

Wright, B. D. (1999). Fundamental measurement for psychology. In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every educator and psychologist should know (pp. 65-104 [http://www.rasch.org/memo64.htm]). Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Protocols for Living Capital

December 23, 2009

David Brooks’ December 22, 2009 column, “The Protocol Society,” hits some really great notes. There are several things worth commenting on. The first point concerns the protection of intellectual property and the encouragement of a free flow of ideas within the overarching operating system of laws, regulations, and property rights. What Brooks is getting at here is the concept of living capital.

A diverse group of writers (Hayek, De Soto, Latour, many others) contrast what they variously term socialist, centralized, and prescientific efforts to control capital’s concrete forms, on the one hand, with capitalist, decentralized, and scientific methods that focus on liberating the flow of capital defined abstractly in terms of the rule of law and transferable representations (titles, deeds, calibrated instruments, etc.). These two senses of capital also apply in the context of intangibles like human, social, and natural capital (Fisher, 2002, 2005, 2009a, 2010).

Second, the movement in economics away from mathematical modeling echoes the broadening appreciation for qualitative methods across the social sciences that has been underway since the 1960s. The issue is one of learning how to integrate substantive concerns for meaningfulness and understanding in the ways we think about economics. The idealized rational consumer typically assumed in traditional mathematical models demands the imposition of a logic not actually often observed in practice.

But just because people may not behave in accord with one sense of rationality does not mean there is not a systematic logic employed in the ways they make decisions that are meaningful to them. Further, though few are yet much aware of this, mathematical models are not inherently irreconcilable with qualitative methods (Fisher, 2003a, 2003b; Heelan, 1998; Kisiel, 1973). Scientifically efficacious mathematical thinking has always had deep roots in qualitative, substantive meaning (Heilbron, 1993; Kuhn, 1961; Roche, 1998). Analogous integrations of qualitative and quantitative methods have been used in psychology, sociology, and education for decades (Bond & Fox, 2007; Fisher, 2004; Wilson, 2005; Wright, 1997, 2000).

Third, yes, those societies and subcultures that have the capacities for increasing the velocity of new recipes have measurably greater amounts of social capital than others. The identification of invariant patterns in social capital will eventually lead to the calibration of precision measures and the deployment of universally uniform metrics as common currencies for the exchange of social value (Fisher, 2002, 2005, 2009a, 2009b).

Fourth, though I haven’t read “Smart World,” the book by Richard Ogle that Brooks refers to, the theory of the extended mind embodied in social networks sounds highly indebted to the work of Bruno Latour (1987, 1995, 2005) and others working in the social studies of science (O’Connell, 1993) and in social psychology (Hutchins, 1995; Magnus, 2007). Brooks and Ogle are exactly right in their assertions about the kinds of collective cognition that are needed for real innovation. The devilish details are embedded in the infrastructure of metrological standards and uniform metrics that coordinate and harmonize thought and behavior. We won’t realize our potential for creativity in the domains of the intangible forms of capital and intellectual property until we get our act together and create a new metric system for them (Fisher, 2009a, 2009b, 2010). Every time someone iterates through the protocol exemplified in Brooks’ column, we get a step closer to this goal.

References

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2d edition. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Fisher, W. P., Jr. (2000). Objectivity in psychosocial measurement: What, why, how. Journal of Outcome Measurement, 4(2), 527-563 [http://www.livingcapitalmetrics.com/images/WP_Fisher_Jr_2000.pdf].

Fisher, W. P., Jr. (2002, Spring). “The Mystery of Capital” and the human sciences. Rasch Measurement Transactions, 15(4), 854 [http://www.rasch.org/rmt/rmt154j.htm].

Fisher, W. P., Jr. (2003a, December). Mathematics, measurement, metaphor, metaphysics: Part I. Implications for method in postmodern science. Theory & Psychology, 13(6), 753-90.

Fisher, W. P., Jr. (2003b, December). Mathematics, measurement, metaphor, metaphysics: Part II. Accounting for Galileo’s “fateful omission.” Theory & Psychology, 13(6), 791-828.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2005). Daredevil barnstorming to the tipping point: New aspirations for the human sciences. Journal of Applied Measurement, 6(3), 173-9 [http://www.livingcapitalmetrics.com/images/FisherJAM05.pdf].

Fisher, W. P., Jr. (2009a, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), 42(9), 1278-1287.

Fisher, W. P. J. (2009b). NIST Critical national need idea White Paper: metrological infrastructure for human, social, and natural capital (Tech. Rep., http://www.livingcapitalmetrics.com/images/FisherNISTWhitePaper2.pdf). New Orleans: LivingCapitalMetrics.com.

Fisher, W. P., Jr. (2010). Bringing human, social, and natural capital to life: Practical consequences and opportunities. Journal of Applied Measurement, p. in press [http://www.livingcapitalmetrics.com/images/BringingHSN_FisherARMII.pdf].

Heelan, P. A. (1998, June). The scope of hermeneutics in natural science. Studies in History and Philosophy of Science Part A, 29(2), 273-98.

Heilbron, J. L. (1993). Weighing imponderables and other quantitative science around 1800 (Vol. 24 (Supplement), Part I, pp. 1-337). Historical studies in the physical and biological sciences). Berkeley, California: University of California Press.

Hutchins, E. (1995). Cognition in the wild. Cambridge, Massachusetts: MIT Press.

Kisiel, T. (1973). The mathematical and the hermeneutical: On Heidegger’s notion of the apriori. In E. G. Ballard & C. E. Scott (Eds.), Martin Heidegger: In Europe and America (pp. 109-20). The Hague: Martinus Nijhoff.

Kuhn, T. S. (1961). The function of measurement in modern physical science. Isis, 52(168), 161-193. (Rpt. in T. S. Kuhn, (Ed.). (1977). The essential tension: Selected studies in scientific tradition and change (pp. 178-224). Chicago: University of Chicago Press.)

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York: Cambridge University Press.

Latour, B. (1995). Cogito ergo sumus! Or psychology swept inside out by the fresh air of the upper deck: Review of Hutchins’ Cognition in the Wild, MIT Press, 1995. Mind, Culture, and Activity: An International Journal, 3(192), 54-63.

Latour, B. (2005). Reassembling the social: An introduction to Actor-Network-Theory. (Clarendon Lectures in Management Studies). Oxford, England: Oxford University Press.

Magnus, P. D. (2007). Distributed cognition and the task of science. Social Studies of Science, 37(2), 297-310.

O’Connell, J. (1993). Metrology: The creation of universality by the circulation of particulars. Social Studies of Science, 23, 129-173.

Roche, J. (1998). The mathematics of measurement: A critical history. London: The Athlone Press.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wright, B. D. (1997, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45, 52 [http://www.rasch.org/memo62.htm].

Wright, B. D., Stone, M., & Enos, M. (2000). The evolution of meaning in practice. Rasch Measurement Transactions, 14(1), 736 [http://www.rasch.org/rmt/rmt141g.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Reliability Coefficients: Starting from the Beginning

August 31, 2009

[This posting was prompted by questions concerning a previous blog entry, Reliability Revisited, and provides background on reliability that only Rasch measurement practitioners are likely to possess.] Most measurement applications based in ordinal data do not implement rigorous checks of the internal consistency of the observations, nor do they typically use the log-odds transformation to convert the nonlinear scores into linear measures. Measurement is usually defined in statistical terms, applying population-level models to obtain group-level summary scores, means, and percentages. Measurement, however, ought to involve individual-level models and case-specific location estimates. (See one of my earlier blogs for more on this distinction between statistics and measurement.)

Given the appropriate measurement focus on the individual, the instrument is initially calibrated and measures are estimated in a simultaneous conjoint process. Once the instrument is calibrated, the item estimates can be anchored, measures can be routinely produced from them, and new items can be calibrated into the system, and others dropped, over time. This method has been the norm in admissions, certification, licensure, and high stakes testing for decades (Fisher & Wright, 1994; Bezruczko, 2005).

Measurement modelling of individual response processes has to be stochastic, or else we run into the attenuation paradox (Engelhard, 1993, 1994). This is the situation in which a deterministic progression of observations from one end of the instrument to the other produces apparently error-free data strings that look like this (1 being a correct answer, a higher rating, or the presence of an attribute, and 0 being incorrect, a lower rating, or the absence of the attribute):

00000000000

10000000000

11000000000

11100000000

11110000000

11111000000

11111100000

11111111000

11111111100

11111111110

11111111111

In this situation, strings with all 0s and all 1s give no information useful for estimating measures (rows) or calibrations (columns). It is as though some of the people are shorter than the first unit on the ruler, and others are taller than the top unit. We don’t really have any way of knowing how short or tall they are, so their rows drop out. But eliminating the top and bottom rows makes the leftmost and rightmost columns all 0s and 1s, and eliminating them then gives new rows with all 0s and 1s, etc., until there’s no data left. (See my Revisiting Reliability blog for evaluations of five different probabilistically-structured data sets of this kind simulated to contrast various approaches to assessing reliability and internal consistency.)

The problem for estimation (Linacre, 1991, 1999, 2000) in data like those shown above is that the lack of informational overlaps between the columns, on the one hand, and between the rows, on the other, gives us no basis for knowing how much more of the variable is represented by any one item relative to any other, or by any one person measured relative to any other. In addition, whenever we actually construct measures of abilities, attitudes, or behaviors that conform with this kind of Guttman (1950) structure (Andrich, 1985; Douglas & Wright, 1989; Engelhard, 2008), the items have to be of such markedly different difficulties or agreeabilities that the results tend to involve large numbers of indistinguishable groups of respondents. But when that information is present in a probabilistically consistent way, we have an example of the phenomenon of stochastic resonance (Fisher, 1992b), so called because of the way noise amplifies weak deterministic signals (Andò & Graziani, 2000; Benzi, Sutera, & Vulpiani, 1981; Bulsara & Gammaitoni, 1996; Dykman & McClintock, 1998; Schimansky-Geier, Freund, Neiman, & Shulgin, 1998).

We need the noise, but we can’t let it overwhelm the system. We have to be able to know how much error there is relative to actual signal. Reliability is traditionally defined (Guilford 1965, pp. 439-40) as an estimate of this relation of signal and noise:

“The reliability of any set of measurements is logically defined as the proportion of their variance that is true variance…. We think of the total variance of a set of measures as being made up of two sources of variance: true variance and error variance… The true measure is assumed to be the genuine value of whatever is being measured… The error components occur independently and at random.”

Traditional reliability coefficients, like Cronbach’s alpha, are correlational, implementing a statistical model of group-level information. Error is taken to be the unexplained portion of the variance:

“In his description of alpha Cronbach (1951) proved (1) that alpha is the mean of all possible split-half coefficients, (2) that alpha is the value expected when two random samples of items from a pool like those in the given test are correlated, and (3) that alpha is a lower bound to the proportion of test variance attributable to common factors among the items” (Hattie, 1985, pp. 143-4).

But measurement models of individual-level response processes (Rasch, 1960; Andrich, 1988; Wright, 1977; Fisher & Wright, 1994; Bond & Fox, 2007; Wilson, 2005; Bezruczko, 2005) employ individual-level error estimates (Wright, 1977; Wright & Stone, 1979; Wright & Masters, 1982), not correlational group-level variance estimates. The individual measurement errors are statistically equivalent to sampling confidence intervals, as is evident in both Wright’s equations and in plots of errors and confidence intervals (see Figure 4 in Fisher, 2008). That is, error and confidence intervals both decline at the same rate with larger numbers of item responses per person, or larger numbers of person responses per item.

This phenomenon has a constructive application in instrument design. If a reasonable expectation for the measurement standard deviation can be formulated and related to the error expected on the basis of the number of items and response categories, a good estimate of the measurement reliability can be read off a nomograph (Linacre, 1993).

Wright (Wright & Masters, 1982, pp. 92, 106; Wright, 1996) introduced several vitally important measurement precision concepts and tools that follow from access to individual person and item error estimates. They improve on the traditional KR-20 or Cronbach reliability coefficients because the individualized error estimates better account for the imprecisions of mistargeted instruments, and for missing data, and so more accurately and conservatively estimate reliability.

Wright and Masters introduce a new reliability statistic, G, the measurement separation reliability index. The availability of individual error estimates makes it possible to estimate the true variance of the measures more directly, by subtracting the mean square error from the total variance. The standard deviation based on this estimate of true variance is then made the numerator of a ratio, G, having the root mean square error as its denominator.

Each unit increase in this G index then represents another multiple of the error unit in the amount of quantitative variation present in the measures. This multiple is nonlinearly represented in the traditional reliability coefficients expressed in the 0.00 – 1.00 range, such that the same separation index unit difference is found in the 0.00 to 0.50, 0.50 to 0.80, 0.80 to 0.90, 0.90 to 0.94, 0.94 to 0.96, and 0.96 to 0.97 reliability ranges (see Fisher, 1992a, for a table of values; available online: see references).

G can also be estimated as the square root of the reliability divided by one minus the reliability. Conversely, a reliability coefficient roughly equivalent to Cronbach’s alpha is estimated as G squared divided by G squared plus the error variance. Because individual error estimates are inflated in the presence of missing data and when an instrument is mistargeted and measures tend toward the extremes, the Rasch-based reliability coefficients tend to be more conservative than Cronbach’s alpha, as these sources of error are hidden within the variances and correlations. For a comparison of the G separation index, the G reliability coefficient, and Cronbach’s alpha over five simulated data sets, see the Reliability Revisited blog entry.

Error estimates can be made more conservative yet by multiplying each individual error term by the larger of either 1.0 or the square root of the associated individual mean square fit statistic for that case (Wright, 1995). (The mean square fit statistics are chi-squares divided by their degrees of freedom, and so have an expected value of 1.00; see Smith (1996) for more on fit, and see my recent blog, Revisiting Reliability, for more on the conceptualization and evaluation of reliability relative to fit.)

Wright and Masters (1982, pp. 92, 105-6) also introduce the concept of strata, ranges on the measurement continuum with centers separated by three errors. Strata are in effect a more forgiving expression of the separation reliability index, G, since the latter approximates strata with centers separated by four errors. An estimate of strata defined as having centers separated by four errors is very nearly identical with the separation index. If three errors define a 95% confidence interval, four are equivalent to 99% confidence.

There is a particular relevance in all of this for practical applications involving the combination or aggregation of physical, chemical, and other previously calibrated measures. This is illustrated in, for instance, the use of chemical indicators in assessing disease severity, environmental pollution, etc. Though any individual measure of the amount of a chemical or compound is valid within the limits of its intended purpose, to arrive at measures delineating disease severity, overall pollution levels, etc., the relevant instruments must be designed, tested, calibrated, and maintained, just as any instruments are (Alvarez, 2005; Cipriani, Fox, Khuder, et al., 2005; Fisher, Bernstein, et al., 2002; Fisher, Priest, Gilder, et al., 2008; Hughes, Perkins, Wright, et al., 2003; Perkins, Wright, & Dorsey, 2005; Wright, 2000).

The same methodology that is applied in this work, involving the rating or assessment of the quality of the outcomes or impacts counted, expressed as percentages, or given in an indicator’s native metric (parts per million, acres, number served, etc.), is needed in the management of all forms of human, social, and natural capital. (Watch this space for a forthcoming blog applying this methodology to the scaling of the UN Millennium Development Goals data.) The practical advantages of working from calibrated instrumentation in these contexts include data quality evaluations, the replacement of nonlinear percentages with linear measures, data volume reduction with no loss of information, and the integration of meaningful and substantive qualities with additive quantities on annotated metrics.

References

Alvarez, P. (2005). Several noncategorical measures define air pollution. In N. Bezruczko (Ed.), Rasch measurement in health sciences (pp. 277-93). Maple Grove, MN: JAM Press.

Andò, B., & Graziani, S. (2000). Stochastic resonance theory and applications. New York: Kluwer Academic Publishers.

Andrich, D. (1985). An elaboration of Guttman scaling with Rasch models for measurement. In N. B. Tuma (Ed.), Sociological methodology 1985 (pp. 33-80). San Francisco, California: Jossey-Bass.

Andrich, D. (1988). Rasch models for measurement. (Vols. series no. 07-068). Sage University Paper Series on Quantitative Applications in the Social Sciences). Beverly Hills, California: Sage Publications.

Benzi, R., Sutera, A., & Vulpiani, A. (1981). The mechanism of stochastic resonance. Journal of Physics. A. Mathematical and General, 14, L453-L457.

Bezruczko, N. (Ed.). (2005). Rasch measurement in health sciences. Maple Grove, MN: JAM Press.

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2d edition. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Bulsara, A. R., & Gammaitoni, L. (1996, March). Tuning in to noise. Physics Today, 49, 39-45.

Cipriani, D., Fox, C., Khuder, S., & Boudreau, N. (2005). Comparing Rasch analyses probability estimates to sensitivity, specificity and likelihood ratios when examining the utility of medical diagnostic tests. Journal of Applied Measurement, 6(2), 180-201.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334.

Douglas, G. A., & Wright, B. D. (1989). Response patterns and their probabilities. Rasch Measurement Transactions, 3(4), 75-77 [http://www.rasch.org/rmt/rmt34.htm].

Dykman, M. I., & Mcclintock, P. V. E. (1998, January 22). What can stochastic resonance do? Nature, 391(6665), 344.

Engelhard, G., Jr. (1993). What is the attenuation paradox? Rasch Measurement Transactions, 6(4), 257 [http://www.rasch.org/rmt/rmt64.htm].

Engelhard, G., Jr. (1994). Resolving the attenuation paradox. Rasch Measurement Transactions, 8(3), 379.

Engelhard, G. (2008, July). Historical perspectives on invariant measurement: Guttman, Rasch, and Mokken. Measurement: Interdisciplinary Research & Perspectives, 6(3), 155-189.

Fisher, W. P., Jr. (1992a). Reliability statistics. Rasch Measurement Transactions, 6(3), 238 [http://www.rasch.org/rmt/rmt63i.htm].

Fisher, W. P., Jr. (1992b, Spring). Stochastic resonance and Rasch measurement. Rasch Measurement Transactions, 5(4), 186-187 [http://www.rasch.org/rmt/rmt54k.htm].

Fisher, W. P., Jr. (2008, Summer). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-3 [http://www.rasch.org/rmt/rmt221.pdf].

Fisher, W. P., Jr., Bernstein, L. H., Qamar, A., Babb, J., Rypka, E. W., & Yasick, D. (2002, February). At the bedside: Measuring patient outcomes. Advance for Administrators of the Laboratory, 11(2), 8, 10 [http://laboratory-manager.advanceweb.com/Article/At-the-Bedside-7.aspx].

Fisher, W. P., Jr., Priest, E., Gilder, R., Blankenship, D., & Burton, E. C. (2008, July 3-6). Development of a novel heart failure measure to identify hospitalized patients at risk for intensive care unit admission. Presented at the World Congress on Controversies in Cardiovascular Diseases [http://www.comtecmed.com/ccare/2008/authors_abstract.aspx#Author15], Intercontinental Hotel, Berlin, Germany.

Fisher, W. P., Jr., & Wright, B. D. (Eds.). (1994). Applications of probabilistic conjoint measurement. International Journal of Educational Research, 21(6), 557-664.

Guilford, J. P. (1965). Fundamental statistics in psychology and education. 4th Edn. New York: McGraw-Hill.

Guttman, L. (1950). The basis for scalogram analysis. In S. A. Stouffer & et al. (Eds.), Studies in social psychology in World War II. volume 4: Measurement and prediction (pp. 60-90). New York: Wiley.

Hattie, J. (1985, June). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9(2), 139-64.

Hughes, L., Perkins, K., Wright, B. D., & Westrick, H. (2003). Using a Rasch scale to characterize the clinical features of patients with a clinical diagnosis of uncertain, probable or possible Alzheimer disease at intake. Journal of Alzheimer’s Disease, 5(5), 367-373.

Linacre, J. M. (1991, Spring). Stochastic Guttman order. Rasch Measurement Transactions, 5(4), 189 [http://www.rasch.org/rmt/rmt54p.htm].

Linacre, J. M. (1993). Rasch-based generalizability theory. Rasch Measurement Transactions, 7(1), 283-284; [http://www.rasch.org/rmt/rmt71h.htm].

Linacre, J. M. (1999). Understanding Rasch measurement: Estimation methods for Rasch measures. Journal of Outcome Measurement, 3(4), 382-405.

Linacre, J. M. (2000, Autumn). Guttman coefficients and Rasch data. Rasch Measurement Transactions, 14(2), 746-7 [http://www.rasch.org/rmt/rmt142e.htm].

Perkins, K., Wright, B. D., & Dorsey, J. K. (2005). Using Rasch measurement with medical data. In N. Bezruczko (Ed.), Rasch measurement in health sciences (pp. 221-34). Maple Grove, MN: JAM Press.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Schimansky-Geier, L., Freund, J. A., Neiman, A. B., & Shulgin, B. (1998). Noise induced order: Stochastic resonance. International Journal of Bifurcation and Chaos, 8(5), 869-79.

Smith, R. M. (2000). Fit analysis in latent trait measurement models. Journal of Applied Measurement, 1(2), 199-218.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1995, Summer). Which standard error? Rasch Measurement Transactions, 9(2), 436-437 [http://www.rasch.org/rmt/rmt92n.htm].

Wright, B. D. (1996, Winter). Reliability and separation. Rasch Measurement Transactions, 9(4), 472 [http://www.rasch.org/rmt/rmt94n.htm].

Wright, B. D. (2000). Rasch regression: My recipe. Rasch Measurement Transactions, 14(3), 758-9 [http://www.rasch.org/rmt/rmt143u.htm].

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, Illinois: MESA Press.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Statistics and Measurement: Clarifying the Differences

August 26, 2009

Measurement is qualitatively and paradigmatically quite different from statistics, even though statistics obviously play important roles in measurement, and vice versa. The perception of measurement as conceptually difficult stems in part from its rearrangement of most of the concepts that we take for granted in the statistical paradigm as landmarks of quantitative thinking. When we recognize and accept the qualitative differences between statistics and measurement, they both become easier to understand.

Statistical analyses are commonly referred to as quantitative, even though the numbers analyzed most usually have not been derived from the mapping of an invariant substantive unit onto a number line. Measurement takes such mapping as its primary concern, focusing on the quantitative meaningfulness of numbers (Falmagne & Narens, 1983; Luce, 1978; ,  Marcus-Roberts & Roberts, 1987; Mundy, 1986; Narens, 2002; Roberts, 1999). Statistical models focus on group processes and relations among variables, while measurement models focus on individual processes and relations within variables (Duncan, 1992; Duncan & Stenbeck, 1988; Rogosa, 1987). Statistics makes assumptions about factors beyond its control, while measurement sets requirements for objective inference (Andrich, 1989). Statistics primarily involves data analysis, while measurement primarily calibrates instruments in common metrics for interpretation at the point of use (Cohen, 1994; Fisher, 2000; Guttman, 1985; Goodman, 1999a-c; Rasch, 1960).

Statistics focuses on making the most of the data in hand, while measurement focuses on using the data in hand to inform (a) instrument calibration and improvement, and (b) the prediction and efficient gathering of meaningful new data on individuals in practical applications. Where statistical “measures” are defined inherently by a particular analytic method, measures read from calibrated instruments—and the raw observations informing these measures—need not be computerized for further analysis.

Because statistical “measures” are usually derived from ordinal raw scores, changes to the instrument change their meaning, resulting in a strong inclination to avoid improving the instrument. Measures, in contrast, take missing data into account, so their meaning remains invariant over instrument configurations, resulting in a firm basis for the emergence of a measurement quality improvement culture. So statistical “measurement” begins and ends with data analysis, where measurement from calibrated instruments is in a constant cycle of application, new item calibrations, and critical recalibrations that require only intermittent resampling.

The vast majority of statistical methods and models make strong assumptions about the nature of the unit of measurement, but provide either very limited ways of checking those assumptions, or no checks at all. Statistical models are descriptive in nature, meaning that models are fit to data, that the validity of the data is beyond the immediate scope of interest, and that the model accounting for the most variation is regarded as best. Finally, and perhaps most importantly, statistical models are inherently oriented toward the relations among variables at the level of samples and populations.

Measurement models, however, impose strong requirements on data quality in order to achieve the unit of measurement that is easiest to think with, one that stays constant and remains invariant across the local particulars of instrument and sample. Measurement methods and models, then, provide extensive and varied ways of checking the quality of the unit, and so must be prescriptive rather than descriptive. That is, measurement models define the data quality that must be obtained for objective inference. In the measurement paradigm, data are fit to models, data quality is of paramount interest, and data quality evaluation must be informed as much by qualitative criteria as by quantitative.

To repeat the most fundamental point, measurement models are oriented toward individual-level response processes, not group-level aggregate processes. Herbert Blumer pointed out as early as 1930 that quantitative method is not equivalent to statistical method, and that the natural sciences had conspicuous degrees of success long before the emergence of statistical techniques (Hammersly, 1989, pp. 113-4). Both the initial scientific revolution in the 16th-17th centuries and the second scientific revolution of the 19th century found a basis in measurement for publicly objective and reproducible results, but statistics played little or no role in the major discoveries of the times.

The scientific value of statistics resides largely in the reproducibility of cross-variable data relations, and statisticians widely agree that statistical analyses should depend only on sufficient statistics (Arnold, 1982, p. 79). Measurement theoreticians and practitioners also agree, but the sufficiency of the mean and standard deviation relative to a normal distribution is one thing, and the sufficiency of individual responses relative to an invariant construct is quite another (Andersen, 1977; Arnold, 1985; Dynkin, 1951; Fischer, 1981; Hall, Wijsman, & Ghosh, 1965; van der Linden, 1992).

It is of historical interest, though, to point out that Rasch, foremost proponent of the latter, attributes credit for the general value of the concept of sufficiency to Ronald Fisher, foremost proponent of the former. Rasch’s strong statements concerning the fundamental inferential value of sufficiency (Andrich, 1997; Rasch, 1977; Wright, 1980) would seem to contradict his repeated joke about burning all the statistics texts making use of the normal distribution (Andersen, 1995, p. 385) were it not for the paradigmatic distinction between statistical models of group-level relations among variables, and measurement models of individual processes. Indeed, this distinction is made on the first page of Rasch’s (1980) book.

Now we are in a position to appreciate a comment by Ernst Rutherford, the winner of the 1908 Nobel Prize in Chemistry, who held that, if you need statistics to understand the results of your experiment, then you should have designed a better experiment (Wise, 1995, p. 11). A similar point was made by Feinstein (1995) concerning meta-analysis. The rarely appreciated point is that the generalizable replication and application of results depends heavily on the existence of a portable and universally uniform observational framework. The inferences, judgments, and adjustments that can be made at the point of use by clinicians, teachers, managers, etc. provided with additive measures expressed in a substantively meaningful common metric far outstrip those that can be made using ordinal measures expressed in instrument- and sample-dependent scores. See Andrich (1989, 2002, 2004), Cohen (1994), Davidoff (1999), Duncan (1992), Embretson (1996), Goodman (1999a, 1999b, 1999c), Guttman (1981, 1985), Meehl (1967), Michell (1986), Rogosa (1987), Romanowski and Douglas (2002), and others for more on this distinction between statistics and measurement.

These contrasts show that the confounding of statistics and measurement is a problem of vast significance that persists in spite of repeated efforts to clarify the distinction. For a wide variety of reasons ranging from cultural presuppositions about the nature of number to the popular notion that quantification is as easy as assigning numbers to observations, measurement is not generally well understood by the public (or even by statisticians!). And so statistics textbooks rarely, if ever, include even passing mention of instrument calibration methods, metric equating processes, the evaluation of data quality relative to the requirements of objective inference, traceability to metrological reference standards, or the integration of qualitative and quantitative methods in the interpretation of measures.

Similarly, in business, marketing, health care, and quality improvement circles, we find near-universal repetition of the mantra, “You manage what you measure,” with very little or no attention paid to the quality of the numbers treated as measures. And so, we find ourselves stuck with so-called measurement systems where,

• instead of linear measures defined by a unit that remains constant across samples and instruments we saddle ourselves with nonlinear scores and percentages defined by units that vary in unknown ways across samples and instruments;
• instead of availing ourselves of the capacity to take missing data into account, we hobble ourselves with the need for complete data;
• instead of dramatically reducing data volume with no loss of information, we insist on constantly re-enacting the meaningless ritual of poring over undigestible masses of numbers;
• instead of adjusting measures for the severity or leniency of judges assigning ratings, we allow measures to depend unfairly on which rater happens to make the observations;
• instead of using methods that give the same result across different distributions, we restrict ourselves to ones that give different results when assumptions of normality are not met and/or standard deviations differ;
• instead of calibrating instruments in an experimental test of the hypothesis that the intended construct is in fact structured in such a way as to make its mapping onto a number line meaningful, we assign numbers and make quantitative inferences with no idea as to whether they relate at all to anything real;
• instead of checking to see whether rating scales work as intended, with higher ratings consistently representing more of the variable, we make assumptions that may be contradicted by the order and spacing of the way rating scales actually work in practice;
• instead of defining a comprehensive framework for interpreting measures relative to a construct, we accept the narrow limits of frameworks defined by the local sample and items;
• instead of capitalizing on the practicality and convenience of theories capable of accurately predicting item calibrations and measures apart from data, we counterproductively define measurement empirically in terms of data analysis;
• instead of putting calibrated tools into the hands of front-line managers, service representatives, teachers and clinicians, we require them to submit to cumbersome data entry, analysis, and reporting processes that defeat the purpose of measurement by ensuring the information provided is obsolete by the time it gets back to the person who could act on it; and
• instead of setting up efficient systems for communicating meaningful measures in common languages with shared points of reference, we settle for inefficient systems for communicating meaningless scores in local incommensurable languages.

Because measurement is simultaneously ubiquitous and rarely well understood, we find ourselves in a world that gives near-constant lip service to the importance of measurement while it does almost nothing to provide measures that behave the way we assume they do. This state of affairs seems to have emerged in large part due to our failure to distinguish between the group-level orientation of statistics and the individual-level orientation of measurement. We seem to have been seduced by a variation on what Whitehead (1925, pp. 52-8) called the fallacy of misplaced concreteness. That is, we have assumed that the power of lawful regularities in thought and behavior would be best revealed and acted on via statistical analyses of data that themselves embody the aggregate mass of the patterns involved.

It now appears, however, in light of studies in the history of science (Latour, 1987, 2005; Wise, 1995), that an alternative and likely more successful approach will be to capitalize on the “wisdom of crowds” (Surowiecki, 2004) phenomenon of collective, distributed cognition (Akkerman, et al., 2007; Douglas, 1986; Hutchins, 1995; Magnus, 2007). This will be done by embodying lawful regularities in instruments calibrated in ideal, abstract, and portable metrics put to work by front-line actors on mass scales (Fisher, 2000, 2005, 2009a, 2009b). In this way, we will inform individual decision processes and structure communicative transactions with efficiencies, meaningfulness, substantive effectiveness, and power that go far beyond anything that could be accomplished by trying to make inferences about individuals from group-level statistics.

We ought not accept the factuality of data as the sole criterion of objectivity, with all theory and instruments constrained by and focused on the passing ephemera of individual sets of local particularities. Properly defined and operationalized via a balanced interrelation of theory, data, and instrument, advanced measurement is not a mere mathematical exercise but offers a wealth of advantages and conveniences that cannot otherwise be obtained. We ignore its potentials at our peril.

References
Akkerman, S., Van den Bossche, P., Admiraal, W., Gijselaers, W., Segers, M., Simons, R.-J., et al. (2007, February). Reconsidering group cognition: From conceptual confusion to a boundary area between cognitive and socio-cultural perspectives? Educational Research Review, 2, 39-63.

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.

Andersen, E. B. (1995). What George Rasch would have thought about this book. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 383-390). New York: Springer-Verlag.

Andrich, D. (1989). Distinctions between assumptions and requirements in measurement in the social sciences. In J. A. Keats, R. Taft, R. A. Heath & S. H. Lovibond (Eds.), Mathematical and Theoretical Systems: Proceedings of the 24th International Congress of Psychology of the International Union of Psychological Science, Vol. 4 (pp. 7-16). North-Holland: Elsevier Science Publishers.

Andrich, D. (1997). Georg Rasch in his own words [excerpt from a 1979 interview]. Rasch Measurement Transactions, 11(1), 542-3. [http://www.rasch.org/rmt/rmt111.htm#Georg].

Andrich, D. (2002). Understanding resistance to the data-model relationship in Rasch’s paradigm: A reflection for the next generation. Journal of Applied Measurement, 3(3), 325-59.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Arnold, S. F. (1982-1988). Sufficient statistics. In S. Kotz, N. L. Johnson & C. B. Read (Eds.), Encyclopedia of Statistical Sciences (pp. 72-80). New York: John Wiley & Sons.

Arnold, S. F. (1985, September). Sufficiency and invariance. Statistics & Probability Letters, 3, 275-279.

Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist, 49, 997-1003.

Davidoff, F. (1999, 15 June). Standing statistics right side up (Editorial). Annals of Internal Medicine, 130(12), 1019-1021.

Douglas, M. (1986). How institutions think. Syracuse, New York: Syracuse University Press.

Dynkin, E. B. (1951). Necessary and sufficient statistics for a family of probability distributions. Selected Translations in Mathematical Statistics and Probability, 1, 23-41.

Duncan, O. D. (1992, September). What if? Contemporary Sociology, 21(5), 667-668.

Duncan, O. D., & Stenbeck, M. (1988). Panels and cohorts: Design and model in the study of voting turnout. In C. C. Clogg (Ed.), Sociological Methodology 1988 (pp. 1-35). Washington, DC: American Sociological Association.

Embretson, S. E. (1996, September). Item Response Theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, 20(3), 201-212.

Falmagne, J.-C., & Narens, L. (1983). Scales and meaningfulness of quantitative laws. Synthese, 55, 287-325.

Feinstein, A. R. (1995, January). Meta-analysis: Statistical alchemy for the 21st century. Journal of Clinical Epidemiology, 48(1), 71-79.

Fischer, G. H. (1981, March). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.

Fisher, W. P., Jr. (2000). Objectivity in psychosocial measurement: What, why, how. Journal of Outcome Measurement, 4(2), 527-563.

Fisher, W. P., Jr. (2005). Daredevil barnstorming to the tipping point: New aspirations for the human sciences. Journal of Applied Measurement, 6(3), 173-9.

Fisher, W. P., Jr. (2009a). Bringing human, social, and natural capital to life: Practical consequences and opportunities. In M. Wilson, K. Draney, N. Brown & B. Duckor (Eds.), Advances in Rasch Measurement, Vol. Two (p. in press). Maple Grove, MN: JAM Press.

Fisher, W. P., Jr. (2009b, July). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), in press.

Goodman, S. N. (1999a, 6 April). Probability at the bedside: The knowing of chances or the chances of knowing? (Editorial). Annals of Internal Medicine, 130(7), 604-6.

Goodman, S. N. (1999b, 15 June). Toward evidence-based medical statistics. 1: The p-value fallacy. Annals of Internal Medicine, 130(12), 995-1004.

Goodman, S. N. (1999c, 15 June). Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine, 130(12), 1005-1013.

Guttman, L. (1981). What is not what in theory construction. In I. Borg (Ed.), Multidimensional data representations: When & why. Ann Arbor, MI: Mathesis Press.

Guttman, L. (1985). The illogic of statistical inference for cumulative science. Applied Stochastic Models and Data Analysis, 1, 3-10.

Hall, W. J., Wijsman, R. A., & Ghosh, J. K. (1965). The relationship between sufficiency and invariance with applications in sequential analysis. Annals of Mathematical Statistics, 36, 575-614.

Hammersley, M. (1989). The dilemma of qualitative method: Herbert Blumer and the Chicago Tradition. New York: Routledge.

Hutchins, E. (1995). Cognition in the wild. Cambridge, Massachusetts: MIT Press.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York: Cambridge University Press.

Latour, B. (1995). Cogito ergo sumus! Or psychology swept inside out by the fresh air of the upper deck: Review of Hutchins’ Cognition in the Wild, MIT Press, 1995. Mind, Culture, and Activity: An International Journal, 3(192), 54-63.

Latour, B. (2005). Reassembling the social: An introduction to Actor-Network-Theory. (Clarendon Lectures in Management Studies). Oxford, England: Oxford University Press.

Luce, R. D. (1978, March). Dimensionally invariant numerical laws correspond to meaningful qualitative relations. Philosophy of Science, 45, 1-16.

Magnus, P. D. (2007). Distributed cognition and the task of science. Social Studies of Science, 37(2), 297-310.

Marcus-Roberts, H., & Roberts, F. S. (1987). Meaningless statistics. Journal of Educational Statistics, 12(4), 383-394.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.

Michell, J. (1986). Measurement scales and statistics: A clash of paradigms. Psychological Bulletin, 100, 398-407.

Mundy, B. (1986, June). On the general theory of meaningful representation. Synthese, 67(3), 391-437.

Narens, L. (2002, December). A meaningful justification for the representational theory of measurement. Journal of Mathematical Psychology, 46(6), 746-68.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 14, 58-94.

Roberts, F. S. (1999). Meaningless statements. In R. Graham, J. Kratochvil, J. Nesetril & F. Roberts (Eds.), Contemporary trends in discrete mathematics, DIMACS Series, Volume 49 (pp. 257-274). Providence, RI: American Mathematical Society.

Rogosa, D. (1987). Casual [sic] models do not support scientific conclusions: A comment in support of Freedman. Journal of Educational Statistics, 12(2), 185-95.

Romanoski, J. T., & Douglas, G. (2002). Rasch-transformed raw scores and two-way ANOVA: A simulation analysis. Journal of Applied Measurement, 3(4), 421-430.

Stevens, S. S. (1951). Mathematics, measurement, and psychophysics. In S. S. Stevens (Ed.), Handbook of experimental psychology (pp. 1-49). New York: John Wiley & Sons.

Surowiecki, J. (2004). The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies and nations. New York: Doubleday.

van der Linden, W. J. (1992). Sufficient and necessary statistics. Rasch Measurement Transactions, 6(3), 231 [http://www.rasch.org/rmt/rmt63d.htm].

Whitehead, A. N. (1925). Science and the modern world. New York: Macmillan.

Wise, M. N. (Ed.). (1995). The values of precision. Princeton, New Jersey: Princeton University Press.

Wright, B. D. (1980). Foreword, Afterword. In Probabilistic models for some intelligence and attainment tests, by Georg Rasch (pp. ix-xix, 185-199. http://www.rasch.org/memo63.htm) [Reprint; original work published in 1960 by the Danish Institute for Educational Research]. Chicago, Illinois: University of Chicago Press.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Contesting the Claim, Part III: References

July 24, 2009

References

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.

Andersen, E. B. (1995). What George Rasch would have thought about this book. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 383-390). New York: Springer-Verlag.

Andrich, D. (1988). Rasch models for measurement. (Vols. series no. 07-068). Sage University Paper Series on Quantitative Applications in the Social Sciences). Beverly Hills, California: Sage Publications.

Andrich, D. (1998). Thresholds, steps and rating scale conceptualization. Rasch Measurement Transactions, 12(3), 648-9 [http://209.238.26.90/rmt/rmt1239.htm].

Arnold, S. F. (1985, September). Sufficiency and invariance. Statistics & Probability Letters, 3, 275-279.

Bond, T., & Fox, C. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Burdick, D. S., Stone, M. H., & Stenner, A. J. (2006). The Combined Gas Law and a Rasch Reading Law. Rasch Measurement Transactions, 20(2), 1059-60 [http://www.rasch.org/rmt/rmt202.pdf].

Burdick, H., & Stenner, A. J. (1996). Theoretical prediction of test items. Rasch Measurement Transactions, 10(1), 475 [http://www.rasch.org/rmt/rmt101b.htm].

Choi, E. (1998, Spring). Rasch invents “Ounces.” Popular Measurement, 1(1), 29 [http://www.rasch.org/pm/pm1-29.pdf].

Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist, 49, 997-1003.

DeBoeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. (Statistics for Social and Behavioral Sciences). New York: Springer-Verlag.

Dynkin, E. B. (1951). Necessary and sufficient statistics for a family of probability distributions. Selected Translations in Mathematical Statistics and Probability, 1, 23-41.

Embretson, S. E. (1996, September). Item Response Theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, pp. 201-212.

Falmagne, J.-C., & Narens, L. (1983). Scales and meaningfulness of quantitative laws. Synthese, 55, 287-325.

Fischer, G. H. (1981, March). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.

Fischer, G. H. (1995). Derivations of the Rasch model. In G. Fischer & I. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 15-38). New York: Springer-Verlag.

Fisher, W. P., Jr. (1988). Truth, method, and measurement: The hermeneutic of instrumentation and the Rasch model [diss]. Dissertation Abstracts International, 49, 0778A, Dept. of Education, Division of the Social Sciences: University of Chicago (376 pages, 23 figures, 31 tables).

Fisher, W. P., Jr. (1997). Physical disability construct convergence across instruments: Towards a universal metric. Journal of Outcome Measurement, 1(2), 87-113.

Fisher, W. P., Jr. (1997, June). What scale-free measurement means to health outcomes research. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 357-373.

Fisher, W. P., Jr. (1999). Foundations for health status metrology: The stability of MOS SF-36 PF-10 calibrations across samples. Journal of the Louisiana State Medical Society, 151(11), 566-578.

Fisher, W. P., Jr. (2000). Objectivity in psychosocial measurement: What, why, how. Journal of Outcome Measurement, 4(2), 527-563.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2008, Summer). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-3 [http://www.rasch.org/rmt/rmt221.pdf].

Fisher, W. P., Jr. (2009, July). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), in press.

Goodman, S. N. (1999, 15 June). Toward evidence-based medical statistics. 1: The p-value fallacy. Annals of Internal Medicine, 130(12), 995-1004.

Guttman, L. (1985). The illogic of statistical inference for cumulative science. Applied Stochastic Models and Data Analysis, 1, 3-10.

Hall, W. J., Wijsman, R. A., & Ghosh, J. K. (1965). The relationship between sufficiency and invariance with applications in sequential analysis. Annals of Mathematical Statistics, 36, 575-614.

Linacre, J. M. (1993). Rasch-based generalizability theory. Rasch Measurement Transactions, 7(1), 283-284 [http://www.rasch.org/rmt/rmt71h.htm].

Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint measurement: A new kind of fundamental measurement. Journal of Mathematical Psychology, 1(1), 1-27.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-34.

Michell, J. (1999). Measurement in psychology: A critical history of a methodological concept. Cambridge: Cambridge University Press.

Moulton, M. (1993). Probabilistic mapping. Rasch Measurement Transactions, 7(1), 268 [http://www.rasch.org/rmt/rmt71b.htm].

Mundy, B. (1986, June). On the general theory of meaningful representation. Synthese, 67(3), 391-437.

Narens, L. (2002). Theories of meaningfulness (S. W. Link & J. T. Townsend, Eds.). Scientific Psychology Series. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Newby, V. A., Conner, G. R., Grant, C. P., & Bunderson, C. V. (2009). The Rasch model and additive conjoint measurement. Journal of Applied Measurement, 10(4), 348-354.

Pelton, T., & Bunderson, V. (2003). The recovery of the density scale using a stochastic quasi-realization of additive conjoint measurement. Journal of Applied Measurement, 4(3), 269-81.

Ramsay, J. O., Bloxom, B., & Cramer, E. M. (1975, June). Review of Foundations of Measurement, Vol. 1, by D. H. Krantz et al. Psychometrika, 40(2), 257-262.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Roberts, F. S., & Rosenbaum, Z. (1986). Scale type, meaningfulness, and the possible psychophysical laws. Mathematical Social Sciences, 12, 77-95.

Romanoski, J. T., & Douglas, G. (2002). Rasch-transformed raw scores and two-way ANOVA: A simulation analysis. Journal of Applied Measurement, 3(4), 421-430.

Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. Psychological Bulletin, 57(5), 416-428.

Smith, R. M., & Taylor, P. (2004). Equating rehabilitation outcome scales: Developing common metrics. Journal of Applied Measurement, 5(3), 229-42.

Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, XXXIII, 529-544. Reprinted in L. L. Thurstone, The Measurement of Values. Midway Reprint Series. Chicago, Illinois: University of Chicago Press, 1959, pp. 215-233.

van der Linden, W. J. (1992). Sufficient and necessary statistics. Rasch Measurement Transactions, 6(3), 231 [http://www.rasch.org/rmt/rmt63d.htm].

Velleman, P. F., & Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician, 47(1), 65-72.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1997, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, pp. 33-45, 52 [http://www.rasch.org/memo62.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Contesting the Claim, Part II: Are Rasch Measures Really as Objective as Physical Measures?

July 22, 2009

When a raw score is sufficient to the task of measurement, the model is the Rasch model, we can estimate the parameters consistently, and we can evaluate the fit of the data to the model. The invariance properties that follow from a sufficient statistic include virtually the entire class of invariant rules (Hall, Wijsman, & Ghosh, 1965; Arnold, 1985), and similar relationships with other key measurement properties follow from there (Fischer, 1981, 1995; Newby, Conner, Grant, & Bunderson, 2009; Wright, 1977, 1997).

What does this all actually mean? Imagine we were able to ask an infinite number of people an infinite number of questions that all work together to measure the same thing. Because (1) the scores are sufficient statistics, (2) the ruler is not affected by what is measured, (3) the parameters separate, and (4) the data fit the model, any subset of the questions asked would give the same measure. This means that any subscore for any person measured would be a function of any and all other subscores. When a sufficient statistic is a function of all other sufficient statistics, it is not only sufficient, it is necessary, and is referred to as a minimally sufficient statistic. Thus, if separable, independent model parameters can be estimated, the model must be the Rasch model, and the raw score is both sufficient and necessary (Andersen, 1977; Dynkin, 1951; van der Linden, 1992).

This means that scores, ratings, and percentages actually stand for something measurable only when they fit a Rasch model.  After all, what actually would be the point of using data that do not support the estimation of independent parameters? If the meaning of the results is tied in unknown ways to the specific particulars of a given situation, then those results are meaningless, by definition (Roberts & Rosenbaum, 1986; Falmagne & Narens, 1983; Mundy, 1986; Narens, 2002; also see Embretson, 1996; Romanoski and Douglas, 2002). There would be no point in trying to learn anything from them, as whatever happened was a one-time unique event that tells us nothing we can use in any future event (Wright, 1977, 1997).

What we’ve done here is akin to taking a narrative stroll through a garden of mathematical proofs. These conceptual analyses can be very convincing, but actual demonstrations of them are essential. Demonstrations would be especially persuasive if there would be some way of showing three things. First, shouldn’t there be some way of constructing ordinal ratings or scores for one or another physical variable that, when scaled, give us measures that are the same as the usual measures we are accustomed to?

This would show that we can use the type of instrument usually found in the social sciences to construct physical measures with the characteristics we expect. There are four available examples, in fact, involving paired comparisons of weights (Choi, 1998), measures of short lengths (Fisher, 1988), ratings of medium-range distances (Moulton, 1993), and a recovery of the density scale (Pelton & Bunderson, 2003). In each case, the Rasch-calibrated experimental instruments produced measures equivalent to the controls, as shown in linear plots of the pairs of measures.

A second thing to build out from the mathematical proofs are experiments in which we check the purported stability of measures and calibrations. We can do this by splitting large data sets, using different groups of items to produce two or more measures for each person, or using different groups of respondents/examinees to provide data for two or more sets of item calibrations. This is a routine experimental procedure in many psychometric labs, and results tend to conform with theory, with strong associations found between increasing sample sizes and increasing reliability coefficients for the respective measures or calibrations. These associations can be plotted (Fisher, 2008), as can the pairs of calibrations estimated from different samples (Fisher, 1999), and the pairs of measures estimated from different instruments (Fisher, Harvey, Kilgore, et al., 1995; Smith & Taylor, 2004). The theoretical expectation of tighter plots for better designed instruments, larger sample sizes, and longer tests is confirmed so regularly that it should itself have the status of a law of nature (Linacre, 1993).

A third convincing demonstration is to compare studies of the same thing conducted in different times and places by different researchers using different instruments on different samples. If the instruments really measure the same thing, there will not only be obvious similarities in their item contents, but similar items will calibrate in similar positions on the metric across samples. Results of this kind have been obtained in at least three published studies (Fisher, 1997a, 1997b; Belyukova, Stone, & Fox, 2004).

All of these arguments are spelled out in greater length and detail, with illustrations, in a forthcoming article (Fisher, 2009). I learned all of this from Benjamin Wright, who worked directly with Rasch himself, and who, perhaps more importantly, was prepared for what he could learn from Rasch in his previous career as a physicist. Before encountering Rasch in 1960, Wright had worked with Feynman at Cornell, Townes at Bell Labs, and Mulliken at the University of Chicago. Taught and influenced not just by three of the great minds of twentieth-century physics, but also by Townes’ philosophical perspectives on meaning and beauty, Wright had left physics in search of life. He was happy to transfer his experience with computers into his new field of educational research, but he was dissatisfied with the quality of the data and how it was treated.

Rasch’s ideas gave Wright the conceptual tools he needed to integrate his scientific values with the demands of the field he was in. Over the course of his 40-year career in measurement, Wright wrote the first software for estimating Rasch model parameters and continuously improved it; he adapted new estimation algorithms for Rasch’s models and was involved in the articulation of new models; he applied the models to hundreds of data sets using his software; he vigorously invested himself in students and colleagues; he founded new professional societies, meetings, and journals;  and he never stopped learning how to think anew about measurement and the meaning of numbers. Through it all, there was always a yardstick handy as a simple way of conveying the basic requirements of measurement as we intuitively understand it in physical terms.

Those of us who spend a lot of time working with these ideas and trying them out on lots of different kinds of data forget or never realize how skewed our experience is relative to everyone else’s. I guess a person lives in a different world when you have the sustained luxury of working with very large databases, as I have had, and you see the constancy and stability of well-designed measures and calibrations over time, across instruments, and over repeated samples ranging from 30 to several million.

When you have that experience, it becomes a basic description of reasonable expectation to read the work of a colleague and see him say that “when the key features of a statistical model relevant to the analysis of social science data are the same as those of the laws of physics, then those features are difficult to ignore” (Andrich, 1988, p. 22). After calibrating dozens of instruments over 25 years, some of them many times over, it just seems like the plainest statement of the obvious to see the same guy say “Our measurement principles should be the same for properties of rocks as for the properties of people. What we say has to be consistent with physical measurement” (Andrich, 1998, p. 3).

And I find myself wishing more people held the opinion expressed by two other colleagues, that “scientific measures in the social sciences must hold to the same standards as do measures in the physical sciences if they are going to lead to the same quality of generalizations” (Bond & Fox, 2001, p. 2). When these sentiments are taken to their logical conclusion in a practical application, the real value of “attempting for reading comprehension what Newtonian mechanics achieved for astronomy” (Burdick & Stenner, 1996) becomes apparent. Rasch’s analogy of the structure of his model for reading tests and Newton’s Second Law can be restated relative to any physical law expressed as universal conditionals among variable triplets; a theory of the variable measured capable of predicting item calibrations provides the causal story for the observed variation (Burdick, Stone, & Stenner, 2006; DeBoeck & Wilson, 2004).

Knowing what I know, from the mathematical principles I’ve been trained in and from the extensive experimental work I’ve done, it seems amazing that so little attention is actually paid to tools and concepts that receive daily lip service as to their central importance in every facet of life, from health care to education to economics to business. Measurement technology rose up decades ago in preparation for the demands of today’s challenges. It is just plain weird the way we’re not using it to anything anywhere near its potential.

I’m convinced, though, that the problem is not a matter of persuasive rhetoric applied to the minds of the right people. Rather, someone, hopefully me, has got to configure the right combination of players in the right situation at the right time and at the right place to create a new form of real value that can’t be created any other way. Like they say, money talks. Persuasion is all well and good, but things will really take off only when people see that better measurement can aid in removing inefficiencies from the management of human, social, and natural capital, that better measurement is essential to creating sustainable and socially responsible policies and practices, and that better measurement means new sources of profitability.  I’m convinced that advanced measurement techniques are really nothing more than a new form of IT or communications technology. They will fit right into the existing networks and multiply their efficiencies many times over.

And when they do, we may be in a position to finally

“confront the remarkable fact that throughout the gigantic range of physical knowledge numerical laws assume a remarkably simple form provided fundamental measurement has taken place. Although the authors cannot explain this fact to their own satisfaction, the extension to behavioral science is obvious: we may have to await fundamental measurement before we will see any real progress in quantitative laws of behavior. In short, ordinal scales (even continuous ordinal scales) are perhaps not good enough and it may not be possible to live forever with a dozen different procedures for quantifying the same piece of behavior, each making strong but untestable and basically unlikely assumptions which result in nonlinear plots of one scale against another. Progress in physics would have been impossibly difficult without fundamental measurement and the reader who believes that all that is at stake in the axiomatic treatment of measurement is a possible criterion for canonizing one scaling procedure at the expense of others is missing the point” (Ramsay, Bloxom, and Cramer, 1975, p. 262).

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.