Archive for the ‘Ordinal vs Interval’ Category

Dispelling Myths about Measurement in Psychology and the Social Sciences

August 27, 2013

Seven common assumptions about measurement and method in psychology and the social sciences stand as inconsistent anomalies in the experience of those who have taken the trouble to challenge them. As evidence, theory, and instrumentation accumulate, will we see a revolutionary break and disruptive change across multiple social and economic levels and areas as a result? Will there be a slower, more gradual transition to a new paradigm? Or will the status quo simply roll on, oblivious to the potential for new questions and new directions? We shall see.

1. Myth: Qualitative data and methods cannot really be integrated with quantitative data and methods because of opposing philosophical assumptions.

Fact: Qualitative methods incorporate a critique of quantitative methods that leads to a more scientific theory and practice of measurement.

2. Myth: Statistics is the logic of measurement.

Fact: Statistics did not emerge as a discipline until the 19th century, while measurement, of course, has been around for millennia. Measurement is modeled at the individual level within a single variable whereas statistics model at the population level between variables. Data are fit to prescriptive measurement models using the Garbage-In, Garbage-Out (GIGO) Principle, while descriptive statistical models are fit to data.

3. Myth: Linear measurement from ordinal test and survey data is impossible.

Fact: Ordinal data have been used as a basis for invariant linear measures for decades.

4. Myth: Scientific laws like Newton’s laws of motion cannot be successfully formulated, tested, or validated in psychology and the social sciences.

Fact: Mathematical laws of human behavior and cognition in the same form as Newton’s laws are formulated, tested, and validated in numerous Rasch model applications.

5. Myth: Experimental manipulations of psychological and social phenomena are inherently impossible or unethical.

Fact: Decades of research across multiple fields have successfully shown how theory-informed interventions on items/indicators/questions can result in predictable, consistent, and substantively meaningful quantitative changes.

6. Myth: “Real” measurement is impossible in psychology and the social sciences.

Fact: Success in predictive theory, instrument calibration, and in maintaining stable units of comparison over time are all evidence supporting the viability of meaningful uniform units of measurement in psychology and the social sciences.

7. Myth: Efficient economic markets can incorporate only manufactured and liquid capital, and property. Human, social, and natural capital, being intangible, have permanent status as market externalities as they cannot be measured well enough to enable accountability, pricing, or transferable representations (common currency instruments).

Fact: The theory and methods necessary for establishing an Intangible Assets Metric System are in hand. What’s missing is the awareness of the scientific, human, social, and economic value that would be returned from the admittedly very large investments that would be required.

References and examples are available in other posts in this blog, in my publications, or on request.

Number lines, counting, and measuring in arithmetic education

July 29, 2011

Over the course of two days spent at a meeting on mathematics education, a question started to form in my mind, one I don’t know how to answer, and to which there may be no answer. I’d like to try to formulate what’s on my mind in writing, and see if it’s just nonsense, a curiosity, some old debate that’s been long since resolved, issues too complex to try to use in elementary education, or something we might actually want to try to do something about.

The question stems from my long experience in measurement. It is one of the basic principles of the field that counting and measuring are different things (see the list of publications on this, below). Counts don’t behave like measures unless the things being counted are units of measurement established as equal ratios or intervals that remain invariant independent of the local particulars of the sample and instrument.

Plainly, if you count two groups of small and large rocks or oranges, the two groups can have the same number of things and the group with the larger things will have more rock or orange than the group with the smaller things. But the association of counting numbers and arithmetic operations with number lines insinuates and reinforces to the point of automatic intuition the false idea that numbers always represent quantity. I know that number lines are supposed to represent an abstract continuum but I think it must be nearly impossible for children to not assume that the number line is basically a kind of ruler, a real physical thing that behaves much like a row of same size wooden blocks laid end to end.

This could be completely irrelevant if the distinction between “How many?” and “How much?” is intensively taught and drilled into kids. Somehow I think it isn’t, though. And here’s where I get to the first part of my real question. Might not the universal, early, and continuous reinforcement of this simplistic equating of number and quantity have a lot to do with the equally simplistic assumption that all numeric data and statistical analysis is somehow quantitative? We count rocks or fish or sticks and call the resulting numbers quantities, and so we do the same thing when we count correct answers or ratings of “Strongly Agree.”

Though that counting is a natural and obvious point from which to begin studying whether something is quantitatively measurable, there are no defined units of measurement in the ordinal data gathered up from tests and surveys. The difference between any two adjacent scores varies depending on which two adjacent scores are compared. This has profound implications for the inferences we make and for our ability to think together as a field about our objects of investigation.

Over the last 30 years and more, we have become increasingly sensitized to the way our words prefigure our expectations and color our perceptions. This struggle to say what we mean and to not prejudicially exclude others from recognition as full human beings is admirable and good. But if that is so, why is it then that we nonetheless go on unjustifiably reducing the real characteristics of people’s abilities, health, performances, etc. to numbers that do not and cannot stand for quantitative amounts? Why do we keep on referring to counts as quantities? Why do we insist on referring to inconstant and locally dependent scores as measures? And why do we refuse to use the readily available methods we have at our disposal to create universally uniform measures that consistently represent the same unit amount always and everywhere?

It seems to me that the image of the number line as a kind of ruler is so indelibly impressed on us as a habit of thought that it is very difficult to relinquish it in favor of a more abstract model of number. Might it be important for us to begin to plant the seeds for more sophisticated understandings of number early in mathematics education? I’m going to wonder out loud about this to some of my math education colleagues…

Cooper, G., & Humphry, S. M. (2010). The ontological distinction between units and entities. Synthese, pp. DOI 10.1007/s11229-010-9832-1.

Wright, B. D. (1989). Rasch model from counting right answers: Raw scores as sufficient statistics. Rasch Measurement Transactions, 3(2), 62 [http://www.rasch.org/rmt/rmt32e.htm].

Wright, B. D. (1993). Thinking with raw scores. Rasch Measurement Transactions, 7(2), 299-300 [http://www.rasch.org/rmt/rmt72r.htm].

Wright, B. D. (1994, Autumn). Measuring and counting. Rasch Measurement Transactions, 8(3), 371 [http://www.rasch.org/rmt/rmt83c.htm].

Wright, B. D., & Linacre, J. M. (1989). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70(12), 857-867 [http://www.rasch.org/memo44.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Translating Gingrich’s Astute Observations on Health Care

June 30, 2011

“At the very heart of transforming health and healthcare is one simple fact: it will require a commitment by the federal government to invest in science and discovery. The period between investment and profit for basic research is too long for most companies to ever consider making the investment. Furthermore, truly basic research often produces new knowledge that everyone can use, so there is no advantage to a particular company to make the investment. The result is that truly fundamental research is almost always a function of government and foundations because the marketplace discourages focusing research in that direction” (p. 169 in Gingrich, 2003).

Gingrich says this while recognizing (p. 185) that:

“Money needs to be available for highly innovative ‘out of the box’ science. Peer review is ultimately a culturally conservative and risk-averse model. Each institution’s director should have a small amount of discretionary money, possibly 3% to 5% of their budget, to spend on outliers.”

He continues (p. 170), with some important elaborations on the theme:

“America’s economic future is a direct function of our ability to take new scientific research and translate it into entrepreneurial development.”

“The [Hart/Rudman] Commission’s second conclusion was that the failure to invest in scientific research and the failure to reform math and science education was the second largest threat to American security [behind terrorism].”

“Our goal [in the Hart/Rudman Commission] was to communicate the centrality of the scientific endeavor to American life and the depth of crisis we believe threatens the math and science education system. The United States’ ability to lead today is a function of past investments in scientific research and math and science education. There is no reason today to believe we will automatically maintain that lead especially given our current investments in scientific research and the staggering levels of our failures in math and science education.”

“Our ability to lead in 2025 will be a function of current decisions. Increasing our investment in science and discovery is a sound and responsible national security policy. No other federal expenditure will do more to create jobs, grow wealth, strengthen our world leadership, protect our environment, promote better education, or ensure better health for the country. We must make this increase now.”

On p. 171, this essential point is made:

“In health and healthcare, it is particularly important to increase our investment in research.”

This is all good. I agree completely. What NG says is probably more true than he realizes, in four ways.

First, the scientific capital created via metrology, controlled via theory, and embodied in technological instruments is the fundamental driver of any economy. The returns on investments in metrological improvements range from 40% to over 400% (NIST, 1996). We usually think of technology and technical standards in terms of computers, telecommunications, and electronics, but there actually is not anything at all in our lives untouched by metrology, since the air, water, food, clothing, roads, buildings, cars, appliances, etc. are all monitored, maintained, and/or manufactured relative to various kinds of universally uniform standards. NG is, as most people are, completely unaware that such standards are feasible and already under development for health, functionality, quality of life, quality of care, math and science education, etc. Given the huge ROIs associated with metrological improvements, there ought to be proportionately huge investments being made in metrology for human, social, and natural capital.

Second, NG’s point concerning national security is right on the mark, though for reasons that go beyond the ones he gives. There are very good reasons for thinking investments in, and meaningful returns from, the basic science for human, social, and natural capital metrology could be expected to undercut the motivations for terrorism and the retreats into fundamentalisms of various kinds that emerge in the face of the failures of liberal democracy (Marty, 2001). Making all forms of capital measured, managed, and accountable within a common framework accessible to everyone everywhere could be an important contributing factor, emulating the property titling rationale of DeSoto (1989, 2000) and the support for distributed cognition at the social level provided by metrological networks (Latour, 1987, 2005; Magnus, 2007), The costs of measurement can be so high as to stifle whole economies (Barzel, 1982), which is, broadly speaking, the primary problem with the economies of education, health care, social services, philanthropy, and environmental management (see, for instance, regarding philanthropy, Goldberg, 2009). Building the legal and financial infrastructure for low-friction titling and property exchange has become a basic feature of World Bank and IMF projects. My point, ever since I read De Soto, has been that we ought to be doing the same thing for human, social, and natural capital, facilitating explicit ownership of the skills, motivations, health, trust, and environmental resources that are rightfully the property of each of us, and that similar effects on national security ought to follow.

Third, NG makes an excellent point when he stresses the need for health and healthcare to be individual-centered, saying that, in contrast with the 20th-century healthcare system, “In the 21st Century System of Health and Healthcare, you will own your medical record, control your healthcare dollars, and be able to make informed choices about healthcare providers.” This is basically equivalent to saying that health capital needs to be fungible, and it can’t be fungible, of course, without a metrological infrastructure that makes every measure of outcomes, quality of life, etc. traceable to a reference standard. Individual-centeredness is also, of course, what distinguishes proper measurement from statistics. Measurement supports inductive inference, from the individual to the population, where statistics are deductive, going from the population to the individual (Fisher & Burton, 2010; Fisher, 2010). Individual-centered healthcare will never go anywhere without properly calibrated instrumentation and the traceability to reference standards that makes measures meaningful.

Fourth, NG repeatedly indicates how appalled he is at the slow pace of change in healthcare, citing research showing that it can take up to 17 years for doctors to adopt new procedures. I contend that this is an effect of our micromanagement of dead, concrete forms of capital. In a fluid living capital market, not only will consumers be able to reward quality in their purchasing decisions by having the information they need when they need it and in a form they can understand, but the quality improvements will be driven from the provider side in much the same way. As Brent James has shown, readily available, meaningful, and comparable information on natural variation in outcomes makes it much easier for providers to improve results and reduce the variation in them. Despite its central importance and the many years that have passed, however, the state of measurement in health care remains in dire need of dramatic improvement. Fryback (1993, p. 271; also see Kindig, 1999) succinctly put the point, observing that the U.S.

“health care industry is a $900 + billion [over $2.5 trillion in 2009 (CMS, 2011] endeavor that does not know how to measure its main product: health. Without a good measure of output we cannot truly optimize efficiency across the many different demands on resources.”

Quantification in health care is almost universally approached using methods inadequate to the task, resulting in ordinal and scale-dependent scores that cannot take advantage of the objective comparisons provided by invariant, individual-level measures (Andrich, 2004). Though data-based statistical studies informing policy have their place, virtually no effort or resources have been invested in developing individual-level instruments traceable to universally uniform metrics that define the outcome products of health care. These metrics are key to efficiently harmonizing quality improvement, diagnostic, and purchasing decisions and behaviors in the manner described by Berwick, James, and Coye (2003) without having to cumbersomely communicate the concrete particulars of locally-dependent scores (Heinemann, Fisher, & Gershon, 2006). Metrologically-based common product definitions will finally make it possible for quality improvement experts to implement analogues of the Toyota Production System in healthcare, long presented as a model but never approached in practice (Coye, 2001).

So, what does all of this add up to? A new division for human, social, and natural capital in NIST is in order, with extensive involvement from NIH, CMS, AHRQ, and other relevant agencies. Innovative measurement methods and standards are the “out of the box” science NG refers to. Providing these tools is the definitive embodiment of an appropriate role for government. These are the kinds of things that we could have a productive conversation with NG about, it seems to me….

References

 Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Barzel, Y. (1982). Measurement costs and the organization of markets. Journal of Law and Economics, 25, 27-48.

Berwick, D. M., James, B., & Coye, M. J. (2003, January). Connections between quality measurement and improvement. Medical Care, 41(1 (Suppl)), I30-38.

Centers for Medicare and Medicaid Services. (2011). National health expenditure data: NHE fact sheet. Retrieved 30 June 2011, from https://www.cms.gov/NationalHealthExpendData/25_NHE_Fact_Sheet.asp.

Coye, M. J. (2001, November/December). No Toyotas in health care: Why medical care has not evolved to meet patients’ needs. Health Affairs, 20(6), 44-56.

De Soto, H. (1989). The other path: The economic answer to terrorism. New York: Basic Books.

De Soto, H. (2000). The mystery of capital: Why capitalism triumphs in the West and fails everywhere else. New York: Basic Books.

Fisher, W. P., Jr. (2010). Statistics and measurement: Clarifying the differences. Rasch Measurement Transactions, 23(4), 1229-1230 [http://www.rasch.org/rmt/rmt234.pdf].

Fisher, W. P., Jr., & Burton, E. (2010). Embedding measurement within existing computerized data systems: Scaling clinical laboratory and medical records heart failure data to predict ICU admission. Journal of Applied Measurement, 11(2), 271-287.

Fryback, D. (1993). QALYs, HYEs, and the loss of innocence. Medical Decision Making, 13(4), 271-2.

Gingrich, N. (2008). Real change: From the world that fails to the world that works. Washington, DC: Regnery Publishing.

Goldberg, S. H. (2009). Billions of drops in millions of buckets: Why philanthropy doesn’t advance social progress. New York: Wiley.

Heinemann, A. W., Fisher, W. P., Jr., & Gershon, R. (2006). Improving health care quality with outcomes management. Journal of Prosthetics and Orthotics, 18(1), 46-50 [http://www.oandp.org/jpo/library/2006_01S_046.asp].

Kindig, D. A. (1997). Purchasing population health. Ann Arbor, Michigan: University of Michigan Press.

Kindig, D. A. (1999). Purchasing population health: Aligning financial incentives to improve health outcomes. Nursing Outlook, 47, 15-22.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York: Cambridge University Press.

Latour, B. (2005). Reassembling the social: An introduction to Actor-Network-Theory. (Clarendon Lectures in Management Studies). Oxford, England: Oxford University Press.

Magnus, P. D. (2007). Distributed cognition and the task of science. Social Studies of Science, 37(2), 297-310.

Marty, M. (2001). Why the talk of spirituality today? Some partial answers. Second Opinion, 6, 53-64.

Marty, M., & Appleby, R. S. (Eds.). (1993). Fundamentalisms and society: Reclaiming the sciences, the family, and education. The fundamentalisms project, vol. 2. Chicago: University of Chicago Press.

National Institute for Standards and Technology. (1996). Appendix C: Assessment examples. Economic impacts of research in metrology. In Committee on Fundamental Science, Subcommittee on Research (Ed.), Assessing fundamental science: A report from the Subcommittee on Research, Committee on Fundamental Science. Washington, DC: National Standards and Technology Council

[http://www.nsf.gov/statistics/ostp/assess/nstcafsk.htm#Topic%207; last accessed 30 June 2011].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

A Second Simple Example of Measurement’s Role in Reducing Transaction Costs, Enhancing Market Efficiency, and Enables the Pricing of Intangible Assets

March 9, 2011

The prior post here showed why we should not confuse counts of things with measures of amounts, though counts are the natural starting place to begin constructing measures. That first simple example focused on an analogy between counting oranges and measuring the weight of oranges, versus counting correct answers on tests and measuring amounts of ability. This second example extends the first by, in effect, showing what happens when we want to aggregate value not just across different counts of some one thing but across different counts of different things. The point will be, in effect, to show how the relative values of apples, oranges, grapes, and bananas can be put into a common frame of reference and compared in a practical and convenient way.

For instance, you may go into a grocery store to buy raspberries and blackberries, and I go in to buy cantaloupe and watermelon. Your cost per individual fruit will be very low, and mine will be very high, but neither of us will find this annoying, confusing, or inconvenient because your fruits are very small, and mine, very large. Conversely, your cost per kilogram will be much higher than mine, but this won’t cause either of us any distress because we both recognize the differences in the labor, handling, nutritional, and culinary value of our purchases.

But what happens when we try to purchase something as complex as a unit of socioeconomic development? The eight UN Millennium Development Goals (MDGs) represent a start at a systematic effort to bring human, social, and natural capital together into the same economic and accountability framework as liquid and manufactured capital, and property. But that effort is stymied by the inefficiency and cost of making and using measures of the goals achieved. The existing MDG databases (http://data.un.org/Browse.aspx?d=MDG), and summary reports present overwhelming numbers of numbers. Individual indicators are presented for each year, each country, each region, and each program, goal by goal, target by target, indicator by indicator, and series by series, in an indigestible volume of data.

Though there are no doubt complex mathematical methods by which a philanthropic, governmental, or NGO investor might determine how much development is gained per million dollars invested, the cost of obtaining impact measures is so high that most funding decisions are made with little information concerning expected returns (Goldberg, 2009). Further, the percentages of various needs met by leading social enterprises typically range from 0.07% to 3.30%, and needs are growing, not diminishing. Progress at current rates means that it would take thousands of years to solve today’s problems of human suffering, social disparity, and environmental quality. The inefficiency of human, social, and natural capital markets is so overwhelming that there is little hope for significant improvements without the introduction of fundamental infrastructural supports, such as an Intangible Assets Metric System.

A basic question that needs to be asked of the MDG system is, how can anyone make any sense out of so much data? Most of the indicators are evaluated in terms of counts of the number of times something happens, the number of people affected, or the number of things observed to be present. These counts are usually then divided by the maximum possible (the count of the total population) and are expressed as percentages or rates.

As previously explained in various posts in this blog, counts and percentages are not measures in any meaningful sense. They are notoriously difficult to interpret, since the quantitative meaning of any given unit difference varies depending on the size of what is counted, or where the percentage falls in the 0-100 continuum. And because counts and percentages are interpreted one at a time, it is very difficult to know if and when any number included in the sheer mass of data is reasonable, all else considered, or if it is inconsistent with other available facts.

A study of the MDG data must focus on these three potential areas of data quality improvement: consistency evaluation, volume reduction, and interpretability. Each builds on the others. With consistent data lending themselves to summarization in sufficient statistics, data volume can be drastically reduced with no loss of information (Andersen, 1977, 1999; Wright, 1977, 1997), data quality can be readily assessed in terms of sufficiency violations (Smith, 2000; Smith & Plackner, 2009), and quantitative measures can be made interpretable in terms of a calibrated ruler’s repeatedly reproducible hierarchy of indicators (Bond & Fox, 2007; Masters, Lokan, & Doig, 1994).

The primary data quality criteria are qualitative relevance and meaningfulness, on the one hand, and mathematical rigor, on the other. The point here is one of following through on the maxim that we manage what we measure, with the goal of measuring in such a way that management is better focused on the program mission and not distracted by accounting irrelevancies.

Method

As written and deployed, each of the MDG indicators has the face and content validity of providing information on each respective substantive area of interest. But, as has been the focus of repeated emphases in this blog, counting something is not the same thing as measuring it.

Counts or rates of literacy or unemployment are not, in and of themselves, measures of development. Their capacity to serve as contributing indications of developmental progress is an empirical question that must be evaluated experimentally against the observable evidence. The measurement of progress toward an overarching developmental goal requires inferences made from a conceptual order of magnitude above and beyond that provided in the individual indicators. The calibration of an instrument for assessing progress toward the realization of the Millennium Development Goals requires, first, a reorganization of the existing data, and then an analysis that tests explicitly the relevant hypotheses as to the potential for quantification, before inferences supporting the comparison of measures can be scientifically supported.

A subset of the MDG data was selected from the MDG database available at http://data.un.org/Browse.aspx?d=MDG, recoded, and analyzed using Winsteps (Linacre, 2011). At least one indicator was selected from each of the eight goals, with 22 in total. All available data from these 22 indicators were recorded for each of 64 countries.

The reorganization of the data is nothing but a way of making the interpretation of the percentages explicit. The meaning of any one country’s percentage or rate of youth unemployment, cell phone users, or literacy has to be kept in context relative to expectations formed from other countries’ experiences. It would be nonsense to interpret any single indicator as good or bad in isolation. Sometimes 30% represents an excellent state of affairs, other times, a terrible one.

Therefore, the distributions of each indicator’s percentages across the 64 countries were divided into ranges and converted to ratings. A lower rating uniformly indicates a status further away from the goal than a higher rating. The ratings were devised by dividing the frequency distribution of each indicator roughly into thirds.

For instance, the youth unemployment rate was found to vary such that the countries furthest from the desired goal had rates of 25% and more(rated 1), and those closest to or exceeding the goal had rates of 0-10% (rated 3), leaving the middle range (10-25%) rated 2. In contrast, percentages of the population that are undernourished were rated 1 for 35% or more, 2 for 15-35%, and 3 for less than 15%.

Thirds of the distributions were decided upon only on the basis of the investigator’s prior experience with data of this kind. A more thorough approach to the data would begin from a finer-grained rating system, like that structuring the MDG table at http://mdgs.un.org/unsd/mdg/Resources/Static/Products/Progress2008/MDG_Report_2008_Progress_Chart_En.pdf. This greater detail would be sought in order to determine empirically just how many distinctions each indicator can support and contribute to the overall measurement system.

Sixty-four of the available 336 data points were selected for their representativeness, with no duplications of values and with a proportionate distribution along the entire continuum of observed values.

Data from the same 64 countries and the same years were then sought for the subsequent indicators. It turned out that the years in which data were available varied across data sets. Data within one or two years of the target year were sometimes substituted for missing data.

The data were analyzed twice, first with each indicator allowed its own rating scale, parameterizing each of the category difficulties separately for each item, and then with the full rating scale model, as the results of the first analysis showed all indicators shared strong consistency in the rating structure.

Results

Data were 65.2% complete. Countries were assessed on an average of 14.3 of the 22 indicators, and each indicator was applied on average to 41.7 of the 64 country cases. Measurement reliability was .89-.90, depending on how measurement error is estimated. Cronbach’s alpha for the by-country scores was .94. Calibration reliability was .93-.95. The rating scale worked well (see Linacre, 2002, for criteria). The data fit the measurement model reasonably well, with satisfactory data consistency, meaning that the hypothesis of a measurable developmental construct was not falsified.

The main result for our purposes here concerns how satisfactory data consistency makes it possible to dramatically reduce data volume and improve data interpretability. The figure below illustrates how. What does it mean for data volume to be drastically reduced with no loss of information? Let’s see exactly how much the data volume is reduced for the ten item data subset shown in the figure below.

The horizontal continuum from -100 to 1300 in the figure is the metric, the ruler or yardstick. The number of countries at various locations along that ruler is shown across the bottom of the figure. The mean (M), first standard deviation (S), and second standard deviation (T) are shown beneath the numbers of countries. There are ten countries with a measure of just below 400, just to the left of the mean (M).

The MDG indicators are listed on the right of the figure, with the indicator most often found being achieved relative to the goals at the bottom, and the indicator least often being achieved at the top. The ratings in the middle of the figure increase from 1 to 3 left to right as the probability of goal achievement increases as the measures go from low to high. The position of the ratings in the middle of the figure shifts from left to right as one reads up the list of indicators because the difficulty of achieving the goals is increasing.

Because the ratings of the 64 countries relative to these ten goals are internally consistent, nothing but the developmental level of the country and the developmental challenge of the indicator affects the probability that a given rating will be attained. It is this relation that defines fit to a measurement model, the sufficiency of the summed ratings, and the interpretability of the scores. Given sufficient fit and consistency, any country’s measure implies a given rating on each of the ten indicators.

For instance, imagine a vertical line drawn through the figure at a measure of 500, just above the mean (M). This measure is interpreted relative to the places at which the vertical line crosses the ratings in each row associated with each of the ten items. A measure of 500 is read as implying, within a given range of error, uncertainty, or confidence, a rating of

  • 3 on debt service and female-to-male parity in literacy,
  • 2 or 3 on how much of the population is undernourished and how many children under five years of age are moderately or severely underweight,
  • 2 on infant mortality, the percent of the population aged 15 to 49 with HIV, and the youth unemployment rate,
  • 1 or 2 the poor’s share of the national income, and
  • 1 on CO2 emissions and the rate of personal computers per 100 inhabitants.

For any one country with a measure of 500 on this scale, ten percentages or rates that appear completely incommensurable and incomparable are found to contribute consistently to a single valued function, developmental goal achievement. Instead of managing each separate indicator as a universe unto itself, this scale makes it possible to manage development itself at its own level of complexity. This ten-to-one ratio of reduced data volume is more than doubled when the total of 22 items included in the scale is taken into account.

This reduction is conceptually and practically important because it focuses attention on the actual object of management, development. When the individual indicators are the focus of attention, the forest is lost for the trees. Those who disparage the validity of the maxim, you manage what you measure, are often discouraged by the the feeling of being pulled in too many directions at once. But a measure of the HIV infection rate is not in itself a measure of anything but the HIV infection rate. Interpreting it in terms of broader developmental goals requires evidence that it in fact takes a place in that larger context.

And once a connection with that larger context is established, the consistency of individual data points remains a matter of interest. As the world turns, the order of things may change, but, more likely, data entry errors, temporary data blips, and other factors will alter data quality. Such changes cannot be detected outside of the context defined by an explicit interpretive framework that requires consistent observations.

-100  100     300     500     700     900    1100    1300
|-------+-------+-------+-------+-------+-------+-------|  NUM   INDCTR
1                                 1  :    2    :  3     3    9  PcsPer100
1                         1   :   2    :   3            3    8  CO2Emissions
1                    1  :    2    :   3                 3   10  PoorShareNatInc
1                 1  :    2    :  3                     3   19  YouthUnempRatMF
1              1   :    2   :   3                       3    1  %HIV15-49
1            1   :   2    :   3                         3    7  InfantMortality
1          1  :    2    :  3                            3    4  ChildrenUnder5ModSevUndWgt
1         1   :    2    :  3                            3   12  PopUndernourished
1    1   :    2   :   3                                 3    6  F2MParityLit
1   :    2    :  3                                      3    5  DebtServExpInc
|-------+-------+-------+-------+-------+-------+-------|  NUM   INDCTR
-100  100     300     500     700     900    1100    1300
                   1
       1   1 13445403312323 41 221    2   1   1            COUNTRIES
       T      S       M      S       T

Discussion

A key element in the results obtained here concerns the fact that the data were about 35% missing. Whether or not any given indicator was actually rated for any given country, the measure can still be interpreted as implying the expected rating. This capacity to take missing data into account can be taken advantage of systematically by calibrating a large bank of indicators. With this in hand, it becomes possible to gather only the amount of data needed to make a specific determination, or to adaptively administer the indicators so as to obtain the lowest-error (most reliable) measure at the lowest cost (with the fewest indicators administered). Perhaps most importantly, different collections of indicators can then be equated to measure in the same unit, so that impacts may be compared more efficiently.

Instead of an international developmental aid market that is so inefficient as to preclude any expectation of measured returns on investment, setting up a calibrated bank of indicators to which all measures are traceable opens up numerous desirable possibilities. The cost of assessing and interpreting the data informing aid transactions could be reduced to negligible amounts, and the management of the processes and outcomes in which that aid is invested would be made much more efficient by reduced data volume and enhanced information content. Because capital would flow more efficiently to where supply is meeting demand, nonproducers would be cut out of the market, and the effectiveness of the aid provided would be multiplied many times over.

The capacity to harmonize counts of different but related events into a single measurement system presents the possibility that there may be a bright future for outcomes-based budgeting in education, health care, human resource management, environmental management, housing, corrections, social services, philanthropy, and international development. It may seem wildly unrealistic to imagine such a thing, but the return on the investment would be so monumental that not checking it out would be even crazier.

A full report on the MDG data, with the other references cited, is available on my SSRN page at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1739386.

Goldberg, S. H. (2009). Billions of drops in millions of buckets: Why philanthropy doesn’t advance social progress. New York: Wiley.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Open Letter to the Impact Investment Community

May 4, 2010

It is very encouraging to discover your web sites (GIIN, IRIS, and GIIRS) and to see the work you’re doing in advancing the concept of impact investing. The defining issue of our time is figuring out how to harness the profit motive for socially responsible and environmentally sustainable prosperity. The economic, social, and environmental disasters of today might all have been prevented or significantly mitigated had social and environmental impacts been taken into account in all investing.

My contribution is to point out that, though the profit motive must be harnessed as the engine driving responsible and sustainable business practices, the force of that power is dissipated and negated by the lack of efficient human, social, and natural capital markets. If we cannot make these markets function more like financial markets, so that money naturally flows to those places where it produces the greatest returns, we will never succeed in the fundamental reorientation of the economy toward responsible sustainability. The goal has to be one of tying financial profits to growth in realized human potential, community, and environmental quality, but to do that we need measures of these intangible forms of capital that are as scientifically rigorous as they are eminently practical and convenient.

Better measurement is key to reducing the market frictions that inflate the cost of human, social, and natural capital transactions. A truly revolutionary paradigm shift has occurred in measurement theory and practice over the last fifty years and more. New methods make it possible

* to reduce data volume dramatically with no loss of information,
* to custom tailor measures by selectively adapting indicators to the entity rated, without compromising comparability,
* to remove rater leniency or severity effects from the measures,
* to design optimally efficient measurement systems that provide the level of precision needed to support decision making,
* to establish reference standard metrics that remain universally uniform across variations in local impact assessment indicator configurations, and
* to calibrate instruments that measure in metrics intuitively meaningful to stakeholders and end users.

Unfortunately, almost all the admirable energy and resources being poured into business intelligence measures skip over these “new” developments, defaulting to mistaken assumptions about numbers and the nature of measurement. Typical ratings, checklists, and scores provide units of measurement that

* change size depending on which question is asked, which rating category is assigned, and who or what is rated,
* increase data volume with every new question asked,
* push measures up and down in uncontrolled ways depending on who is judging the performance,
* are of unknown precision, and
* cannot be compared across different composite aggregations of ratings.

I have over 25 years experience in the use of advanced measurement and instrument calibration methods, backed up with MA and PhD degrees from the University of Chicago. The methods in which I am trained have been standard practice in educational testing for decades, and in the last 20 years have become the methods of choice in health care outcomes assessment.

I am passionately committed to putting these methods to work in the domain of impact investing, business intelligence, and ecological economics. As is shown in my attached CV, I have dozens of peer-reviewed publications presenting technical and philosophical research in measurement theory and practice.

In the last few years, I have taken my work in the direction of documenting the ways in which measurement can and should reduce information overload and transaction costs; enhance human, social, and natural capital market efficiencies; provide the instruments embodying common currencies for the exchange of value; and inform a new kind of Genuine Progress Indicator or Happiness Index.

For more information, please see the attached 2009 article I published in Measurement on these topics, and the attached White Paper I produced last July in response to call from NIST for critical national need ideas. Various entries in my blog (https://livingcapitalmetrics.wordpress.com) elaborate on measurement technicalities, history, and philosophy, as do my web site at http://www.livingcapitalmetrics.com and my profile at http://www.linkedin.com/in/livingcapitalmetrics.

For instance, the blog post at https://livingcapitalmetrics.wordpress.com/2009/11/22/al-gore-will-is-not-the-problem/ explores the idea with which I introduced myself to you here, that the profit motive embodies our collective will for responsible and sustainable business practices, but we hobble ourselves with self-defeating inattention to the ways in which capital is brought to life in efficient markets. We have the solutions to our problems at hand, though there are no panaceas, and the challenges are huge.

Please feel free to contact me at your convenience. Whether we are ultimately able to work together or not, I enthusiastically wish you all possible success in your endeavors.

Sincerely,

William P. Fisher, Jr., Ph.D.
LivingCapitalMetrics.com
919-599-7245

We are what we measure.
It’s time we measured what we want to be.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

How bad will the financial crises have to get before…?

April 30, 2010

More and more states and nations around the world face the possibility of defaulting on their financial obligations. The financial crises are of epic historical proportions. This is a disaster of the first order. And yet, it is so odd–we have the solutions and preventative measures we need at our finger tips, but no one knows about them or is looking for them.

So,  I am persuaded to once again wonder if there might now be some real interest in the possibilities of capitalizing on

  • measurement’s well-known capacity for reducing transaction costs by improving information quality and reducing information volume;
  • instruments calibrated to measure in constant units (not ordinal ones) within known error ranges (not as though the measures are perfectly precise) with known data quality;
  • measures made meaningful by their association with invariant scales defined in terms of the questions asked;
  • adaptive instrument administration methods that make all measures equally precise by targeting the questions asked;
  • judge calibration methods that remove the person rating performances as a factor influencing the measures;
  • the metaphor of transparency by calibrating instruments that we really look right through at the thing measured (risk, governance, abilities, health, performance, etc.);
  • efficient markets for human, social, and natural capital by means of the common currencies of uniform metrics, calibrated instrumentation, and metrological networks;
  • the means available for tuning the instruments of the human, social, and environmental sciences to well-tempered scales that enable us to more easily harmonize, orchestrate, arrange, and choreograph relationships;
  • our understandings that universal human rights require universal uniform measures, that fair dealing requires fair measures, and that our measures define who we are and what we value; and, last but very far from least,
  • the power of love–the back and forth of probing questions and honest answers in caring social intercourse plants seminal ideas in fertile minds that can be nurtured to maturity and Socratically midwifed as living meaning born into supportive ecologies of caring relations.

How bad do things have to get before we systematically and collectively implement the long-established and proven methods we have at our disposal? It is the most surreal kind of schizophrenia or passive-aggressive avoidance pathology to keep on tormenting ourselves with problems for which we have solutions.

For more information on these issues, see prior blogs posted here, the extensive documentation provided, and http://www.livingcapitalmetrics.com.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

How Evidence-Based Decision Making Suffers in the Absence of Theory and Instrument: The Power of a More Balanced Approach

January 28, 2010

The Basis of Evidence in Theory and Instrument

The ostensible point of basing decisions in evidence is to have reasons for proceeding in one direction versus any other. We want to be able to say why we are proceeding as we are. When we give evidence-based reasons for our decisions, we typically couch them in terms of what worked in past experience. That experience might have been accrued over time in practical applications, or it might have been deliberately arranged in one or more experimental comparisons and tests of concisely stated hypotheses.

At its best, generalizing from past experience to as yet unmet future experiences enables us to navigate life and succeed in ways that would not be possible if we could not learn and had no memories. The application of a lesson learned from particular past events to particular future events involves a very specific inferential process. To be able to recognize repeated iterations of the same things requires the accumulation of patterns of evidence. Experience in observing such patterns allows us to develop confidence in our understanding of what that pattern represents in terms of pleasant or painful consequences. When we are able to conceptualize and articulate an idea of a pattern, and when we are then able to recognize a new occurrence of that pattern, we have an idea of it.

Evidence-based decision making is then a matter of formulating expectations from repeatedly demonstrated and routinely reproducible patterns of observations that lend themselves to conceptual representations, as ideas expressed in words. Linguistic and cultural frameworks selectively focus attention by projecting expectations and filtering observations into meaningful patterns represented by words, numbers, and other symbols. The point of efforts aimed at basing decisions in evidence is to try to go with the flow of this inferential process more deliberately and effectively than might otherwise be the case.

None of this is new or controversial. However, the inferential step from evidence to decision always involves unexamined and unjustified assumptions. That is, there is always an element of metaphysical faith behind the expectation that any given symbol or word is going to work as a representation of something in the same way that it has in the past. We can never completely eliminate this leap of faith, since we cannot predict the future with 100% confidence. We can, however, do a lot to reduce the size of the leap, and the risks that go with it, by questioning our assumptions in experimental research that tests hypotheses as to the invariant stability and predictive utility of the representations we make.

Theoretical and Instrumental Assumptions Hidden Behind the Evidence

For instance, evidence as to the effectiveness of an intervention or treatment is often expressed in terms of measures commonly described as quantitative. But it is unusual for any evidence to be produced justifying that description in terms of something that really adds up in the way numbers do. So we often find ourselves in situations in which our evidence is much less meaningful, reliable, and valid than we suppose it to be.

Quantitative measures are often valued as the hallmark of rational science. But their capacity to live up to this billing depends on the quality of the inferences that can be supported. Very few researchers thoroughly investigate the quality of their measures and justify the inferences they make relative to that quality.

Measurement presumes a reproducible pattern of evidence that can serve as the basis for a decision concerning how much of something has been observed. It naturally follows that we often base measurement in counts of some kind—successes, failures, ratings, frequencies, etc. The counts, scores, or sums are then often transformed into percentages by dividing them into the maximum possible that could be obtained. Sometimes the scores are averaged for each person measured, and/or for each item or question on the test, assessment, or survey. These scores and percentages are then almost universally fed directly into decision processes or statistical analyses with no further consideration.

The reproducible pattern of evidence on which decisions are based is presumed to exist between the measures, not within them. In other words, the focus is on the group or population statistics, not on the individual measures. Attention is typically focused on the tip of the iceberg, the score or percentage, not on the much larger, but hidden, mass of information beneath it. Evidence is presumed to be sufficient to the task when the differences between groups of scores are of a consistent size or magnitude, but is this sufficient?

Going Past Assumptions to Testable Hypotheses

In other words, does not science require that evidence be explained by theory, and embodied in instrumentation that provides a shared medium of observation? As shown in the blue lines in the Figure below,

  • theory, whether or not it is explicitly articulated, inevitably influences both what counts as valid data and the configuration of the medium of its representation, the instrument;
  • data, whether or not it is systematically gathered and evaluated, inevitably influences both the medium of its representation, the instrument, and the implicit or explicit theory that explains its properties and justifies its applications; and
  • instruments, whether or not they are actually calibrated from a mapping of symbols and substantive amounts, inevitably influence data gathering and the image of the object explained by theory.

The rhetoric of evidence-based decision making skips over the roles of theory and instrumentation, drawing a direct line from data to decision. In leaving theory laxly formulated, we allow any story that makes a bit of sense and is communicated by someone with a bit of charm or power to carry the day. In not requiring calibrated instrumentation, we allow any data that cross the threshold into our awareness to serve as an acceptable basis for decisions.

What we want, however, is to require meaningful measures that really provide the evidence needed for instruments that exhibit invariant calibrations and for theories that provide predictive explanatory control over the variable. As shown in the Figure, we want data that push theory away from the instrument, theory that separates the data and instrument, and instruments that get in between the theory and data.

We all know to distrust too close a correspondence between theory and data, but we too rarely understand or capitalize on the role of the instrument in mediating the theory-data relation. Similarly, when the questions used as a medium for making observations are obviously biased to produce responses conforming overly closely with a predetermined result, we see that the theory and the instrument are too close for the data to serve as an effective mediator.

Finally, the situation predominating in the social sciences is one in which both construct and measurement theories are nearly nonexistent, which leaves data completely dependent on the instrument it came from. In other words, because counts of correct answers or sums of ratings are mistakenly treated as measures, instruments fully determine and restrict the range of measurement to that defined by the numbers of items and rating categories. Once the instrument is put in play, changes to it would make new data incommensurable with old, so, to retain at least the appearance of comparability, the data structure then fully determines and restricts the instrument.

What we want, though, is a situation in which construct and measurement theories work together to make the data autonomous of the particular instrument it came from. We want a theory that explains what is measured well enough for us to be able to modify existing instruments, or create entirely new ones, that give the same measures for the same amounts as the old instruments. We want to be able to predict item calibrations from the properties of the items, we want to obtain the same item calibrations across data sets, and we want to be able to predict measures on the basis of the observed responses (data) no matter which items or instrument was used to produce them.

Most importantly, we want a theory and practice of measurement that allows us to take missing data into account by providing us with the structural invariances we need as media for predicting the future from the past. As Ben Wright (1997, p. 34) said, any data analysis method that requires complete data to produce results disqualifies itself automatically as a viable basis for inference because we never have complete data—any practical system of measurement has to be positioned so as to be ready to receive, process, and incorporate all of the data we have yet to gather. This goal is accomplished to varying degrees in Rasch measurement (Rasch, 1960; Burdick, Stone, & Stenner, 2006; Dawson, 2004). Stenner and colleagues (Stenner, Burdick, Sanford, & Burdick, 2006) provide a trajectory of increasing degrees to which predictive theory is employed in contemporary measurement practice.

The explanatory and predictive power of theory is embodied in instruments that focus attention on recording observations of salient phenomena. These observations become data that inform the calibration of instruments, which then are used to gather further data that can be used in practical applications and in checks on the calibrations and the theory.

“Nothing is so practical as a good theory” (Lewin, 1951, p. 169). Good theory makes it possible to create symbolic representations of things that are easy to think with. To facilitate clear thinking, our words, numbers, and instruments must be transparent. We have to be able to look right through them at the thing itself, with no concern as to distortions introduced by the instrument, the sample, the observer, the time, the place, etc. This happens only when the structure of the instrument corresponds with invariant features of the world. And where words effect this transparency to an extent, it is realized most completely when we can measure in ways that repeatedly give the same results for the same amounts in the same conditions no matter which instrument, sample, operator, etc. is involved.

Where Might Full Mathematization Lead?

The attainment of mathematical transparency in measurement is remarkable for the way it focuses attention and constrains the imagination. It is essential to appreciate the context in which this focusing occurs, as popular opinion is at odds with historical research in this regard. Over the last 60 years, historians of science have come to vigorously challenge the widespread assumption that technology is a product of experimentation and/or theory (Kuhn, 1961/1977; Latour, 1987, 2005; Maas, 2001; Mendelsohn, 1992; Rabkin, 1992; Schaffer, 1992; Heilbron, 1993; Hankins & Silverman, 1999; Baird, 2002). Neither theory nor experiment typically advances until a key technology is widely available to end users in applied and/or research contexts. Rabkin (1992) documents multiple roles played by instruments in the professionalization of scientific fields. Thus, “it is not just a clever historical aphorism, but a general truth, that ‘thermodynamics owes much more to the steam engine than ever the steam engine owed to thermodynamics’” (Price, 1986, p. 240).

The prior existence of the relevant technology comes to bear on theory and experiment again in the common, but mistaken, assumption that measures are made and experimentally compared in order to discover scientific laws. History shows that measures are rarely made until the relevant law is effectively embodied in an instrument (Kuhn, 1961/1977, pp. 218-9): “…historically the arrow of causality is largely from the technology to the science” (Price, 1986, p. 240). Instruments do not provide just measures; rather they produce the phenomenon itself in a way that can be controlled, varied, played with, and learned from (Heilbron, 1993, p. 3; Hankins & Silverman, 1999; Rabkin, 1992). The term “technoscience” has emerged as an expression denoting recognition of this priority of the instrument (Baird, 1997; Ihde & Selinger, 2003; Latour, 1987).

Because technology often dictates what, if any, phenomena can be consistently produced, it constrains experimentation and theorizing by focusing attention selectively on reproducible, potentially interpretable effects, even when those effects are not well understood (Ackermann, 1985; Daston & Galison, 1992; Ihde, 1998; Hankins & Silverman, 1999; Maasen & Weingart, 2001). Criteria for theory choice in this context stem from competing explanatory frameworks’ experimental capacities to facilitate instrument improvements, prediction of experimental results, and gains in the efficiency with which a phenomenon is produced.

In this context, the relatively recent introduction of measurement models requiring additive, invariant parameterizations (Rasch, 1960) provokes speculation as to the effect on the human sciences that might be wrought by the widespread availability of consistently reproducible effects expressed in common quantitative languages. Paraphrasing Price’s comment on steam engines and thermodynamics, might it one day be said that as yet unforeseeable advances in reading theory will owe far more to the Lexile analyzer (Stenner, et al., 2006) than ever the Lexile analyzer owed reading theory?

Kuhn (1961/1977) speculated that the second scientific revolution of the early- to mid-nineteenth century followed in large part from the full mathematization of physics, i.e., the emergence of metrology as a professional discipline focused on providing universally accessible, theoretically predictable, and evidence-supported uniform units of measurement (Roche, 1998). Kuhn (1961/1977, p. 220) specifically suggests that a number of vitally important developments converged about 1840 (also see Hacking, 1983, p. 234). This was the year in which the metric system was formally instituted in France after 50 years of development (it had already been obligatory in other nations for 20 years at that point), and metrology emerged as a professional discipline (Alder, 2002, p. 328, 330; Heilbron, 1993, p. 274; Kula, 1986, p. 263). Daston (1992) independently suggests that the concept of objectivity came of age in the period from 1821 to 1856, and gives examples illustrating the way in which the emergence of strong theory, shared metric standards, and experimental data converged in a context of particular social mores to winnow out unsubstantiated and unsupportable ideas and contentions.

Might a similar revolution and new advances in the human sciences follow from the introduction of evidence-based, theoretically predictive, instrumentally mediated, and mathematical uniform measures? We won’t know until we try.

Figure. The Dialectical Interactions and Mutual Mediations of Theory, Data, and Instruments

Figure. The Dialectical Interactions and Mutual Mediations of Theory, Data, and Instruments

Acknowledgment. These ideas have been drawn in part from long consideration of many works in the history and philosophy of science, primarily Ackermann (1985), Ihde (1991), and various works of Martin Heidegger, as well as key works in measurement theory and practice. A few obvious points of departure are listed in the references.

References

Ackermann, J. R. (1985). Data, instruments, and theory: A dialectical approach to understanding science. Princeton, New Jersey: Princeton University Press.

Alder, K. (2002). The measure of all things: The seven-year odyssey and hidden error that transformed the world. New York: The Free Press.

Aldrich, J. (1989). Autonomy. Oxford Economic Papers, 41, 15-34.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Baird, D. (1997, Spring-Summer). Scientific instrument making, epistemology, and the conflict between gift and commodity economics. Techné: Journal of the Society for Philosophy and Technology, 3-4, 25-46. Retrieved 08/28/2009, from http://scholar.lib.vt.edu/ejournals/SPT/v2n3n4/baird.html.

Baird, D. (2002, Winter). Thing knowledge – function and truth. Techné: Journal of the Society for Philosophy and Technology, 6(2). Retrieved 19/08/2003, from http://scholar.lib.vt.edu/ejournals/SPT/v6n2/baird.html.

Burdick, D. S., Stone, M. H., & Stenner, A. J. (2006). The Combined Gas Law and a Rasch Reading Law. Rasch Measurement Transactions, 20(2), 1059-60 [http://www.rasch.org/rmt/rmt202.pdf].

Carroll-Burke, P. (2001). Tools, instruments and engines: Getting a handle on the specificity of engine science. Social Studies of Science, 31(4), 593-625.

Daston, L. (1992). Baconian facts, academic civility, and the prehistory of objectivity. Annals of Scholarship, 8, 337-363. (Rpt. in L. Daston, (Ed.). (1994). Rethinking objectivity (pp. 37-64). Durham, North Carolina: Duke University Press.)

Daston, L., & Galison, P. (1992, Fall). The image of objectivity. Representations, 40, 81-128.

Dawson, T. L. (2004, April). Assessing intellectual development: Three approaches, one sequence. Journal of Adult Development, 11(2), 71-85.

Galison, P. (1999). Trading zone: Coordinating action and belief. In M. Biagioli (Ed.), The science studies reader (pp. 137-160). New York, New York: Routledge.

Hacking, I. (1983). Representing and intervening: Introductory topics in the philosophy of natural science. Cambridge: Cambridge University Press.

Hankins, T. L., & Silverman, R. J. (1999). Instruments and the imagination. Princeton, New Jersey: Princeton University Press.

Heelan, P. A. (1983, June). Natural science as a hermeneutic of instrumentation. Philosophy of Science, 50, 181-204.

Heelan, P. A. (1998, June). The scope of hermeneutics in natural science. Studies in History and Philosophy of Science Part A, 29(2), 273-98.

Heidegger, M. (1977). Modern science, metaphysics, and mathematics. In D. F. Krell (Ed.), Basic writings [reprinted from M. Heidegger, What is a thing? South Bend, Regnery, 1967, pp. 66-108] (pp. 243-282). New York: Harper & Row.

Heidegger, M. (1977). The question concerning technology. In D. F. Krell (Ed.), Basic writings (pp. 283-317). New York: Harper & Row.

Heilbron, J. L. (1993). Weighing imponderables and other quantitative science around 1800. Historical studies in the physical and biological sciences), 24(Supplement), Part I, pp. 1-337.

Hessenbruch, A. (2000). Calibration and work in the X-ray economy, 1896-1928. Social Studies of Science, 30(3), 397-420.

Ihde, D. (1983). The historical and ontological priority of technology over science. In D. Ihde, Existential technics (pp. 25-46). Albany, New York: State University of New York Press.

Ihde, D. (1991). Instrumental realism: The interface between philosophy of science and philosophy of technology. (The Indiana Series in the Philosophy of Technology). Bloomington, Indiana: Indiana University Press.

Ihde, D. (1998). Expanding hermeneutics: Visualism in science. Northwestern University Studies in Phenomenology and Existential Philosophy). Evanston, Illinois: Northwestern University Press.

Ihde, D., & Selinger, E. (Eds.). (2003). Chasing technoscience: Matrix for materiality. (Indiana Series in Philosophy of Technology). Bloomington, Indiana: Indiana University Press.

Kuhn, T. S. (1961/1977). The function of measurement in modern physical science. Isis, 52(168), 161-193. (Rpt. In T. S. Kuhn, The essential tension: Selected studies in scientific tradition and change (pp. 178-224). Chicago: University of Chicago Press, 1977).

Kula, W. (1986). Measures and men (R. Screter, Trans.). Princeton, New Jersey: Princeton University Press (Original work published 1970).

Lapre, M. A., & Van Wassenhove, L. N. (2002, October). Learning across lines: The secret to more efficient factories. Harvard Business Review, 80(10), 107-11.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York, New York: Cambridge University Press.

Latour, B. (2005). Reassembling the social: An introduction to Actor-Network-Theory. (Clarendon Lectures in Management Studies). Oxford, England: Oxford University Press.

Lewin, K. (1951). Field theory in social science: Selected theoretical papers (D. Cartwright, Ed.). New York: Harper & Row.

Maas, H. (2001). An instrument can make a science: Jevons’s balancing acts in economics. In M. S. Morgan & J. Klein (Eds.), The age of economic measurement (pp. 277-302). Durham, North Carolina: Duke University Press.

Maasen, S., & Weingart, P. (2001). Metaphors and the dynamics of knowledge. (Vol. 26. Routledge Studies in Social and Political Thought). London: Routledge.

Mendelsohn, E. (1992). The social locus of scientific instruments. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 5-22). Bellingham, WA: SPIE Optical Engineering Press.

Polanyi, M. (1964/1946). Science, faith and society. Chicago: University of Chicago Press.

Price, D. J. d. S. (1986). Of sealing wax and string. In Little Science, Big Science–and Beyond (pp. 237-253). New York, New York: Columbia University Press.

Rabkin, Y. M. (1992). Rediscovering the instrument: Research, industry, and education. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 57-82). Bellingham, Washington: SPIE Optical Engineering Press.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Roche, J. (1998). The mathematics of measurement: A critical history. London: The Athlone Press.

Schaffer, S. (1992). Late Victorian metrology and its instrumentation: A manufactory of Ohms. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 23-56). Bellingham, WA: SPIE Optical Engineering Press.

Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2006). How accurate are Lexile text measures? Journal of Applied Measurement, 7(3), 307-22.

Thurstone, L. L. (1959). The measurement of values. Chicago: University of Chicago Press, Midway Reprint Series.

Wright, B. D. (1997, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45, 52 [http://www.rasch.org/memo62.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Questions about measurement: If it is so important, why…?

January 28, 2010

If measurement is so important, why is measurement quality so uniformly low?

If we manage what we measure, why is measurement leadership virtually nonexistent?

If we can’t tell if things are getting better, staying the same, or getting worse without good metrics, why is measurement so rarely context-sensitive, focused, integrated, and interactive, as Dean Spitzer recommends it should be?

If quantification is valued for its rigor and convenience, why is no one demanding meaningful mappings of substantive, additive amounts of things measured on number lines?

If everyone is drowning in unmanageable floods of data why isn’t measurement used to reduce data volumes dramatically—and not only with no loss of information but with the addition of otherwise unavailable forms of information?

If learning and improvement are the order of the day, why isn’t anyone interested in the organizational and individual learning trajectories that are defined by hierarchies of calibrated items?

If resilient lean thinking is the way to go, why aren’t more measures constructed to retain their meaning and values across changes in item content?

If flexibility is a core value, why aren’t we adapting instruments to people and organizations, instead of vice versa?

If fair, just, and meaningful measurement is often lacking in judge-assigned performance assessments, why isn’t anyone estimating the consistency, and the leniency or harshness, of ratings—and removing those effects from the measures made?

If efficiency is valued, why does no one at all seem to care about adjusting measurement precision to the needs of the task at hand, so that time and resources are not wasted in gathering too much or too little data?

If it’s common knowledge that we can do more together than we can as individuals, why isn’t anyone providing the high quality and uniform information needed for the networked collective thinking that is able to keep pace with the demand for innovation?

Since the metric system and uniform product standards are widely recognized as essential to science and commerce, why are longstanding capacities for common metrics for human, social, and natural capital not being used?

If efficient markets are such great things, why isn’t anyone at all concerned about lubricating the flow of human, social, and natural capital by investing in the highest quality measurement obtainable?

If everyone loves a good profit, why aren’t we setting up human, social, and natural capital metric systems to inform competitive pricing of intangible assets, products, and services?

If companies are supposed to be organic entities that mature in a manner akin to human development over the lifespan, why is so little being done to conceive, gestate, midwife, and nurture living capital?

In short, if measurement is really as essential to management as it is so often said to be, why doesn’t anyone seek out the state of the art technology, methods, and experts before going to the trouble of developing and implementing metrics?

I suspect the answers to these questions are all the same. These disconnects between word and deed happen because so few people are aware of the technical advances made in measurement theory and practice over the last several decades.

For the deep background, see previous entries in this blog, various web sites (www.rasch.org, www.rummlab.com, www.winsteps.com, http://bearcenter.berkeley.edu/, etc.), and an extensive body of published work (Rasch, 1960; Wright, 1977, 1997a, 1997b, 1999a, 1999b; Andrich, 1988, 2004, 2005; Bond & Fox, 2007; Fisher, 2009, 2010; Smith & Smith, 2004; Wilson, 2005; Wright & Stone, 1999, 2004).

There is a wealth of published applied research in education, psychology, and health care (Bezruczko, 2005; Fisher & Wright, 1994; Masters, 2007; Masters & Keeves, 1999). To find more search Rasch and the substantive area of interest.

For applications in business contexts, there is a more limited number of published resources (ATP, 2001; Drehmer, Belohlav, & Coye, 2000; Drehmer & Deklava, 2001; Ludlow & Lunz, 1998; Lunz & Linacre, 1998; Mohamed, et al., 2008; Salzberger, 2000; Salzberger & Sinkovics, 2006; Zakaria, et al., 2008). I have, however, just become aware of the November, 2009, publication of what could be a landmark business measurement text (Salzberger, 2009). Hopefully, this book will be just one of many to come, and the questions I’ve raised will no longer need to be asked.

References

Andrich, D. (1988). Rasch models for measurement. (Vols. series no. 07-068). Sage University Paper Series on Quantitative Applications in the Social Sciences). Beverly Hills, California: Sage Publications.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Andrich, D. (2005). Georg Rasch: Mathematician and statistician. In K. Kempf-Leonard (Ed.), Encyclopedia of Social Measurement (Vol. 3, pp. 299-306). Amsterdam: Academic Press, Inc.

Association of Test Publishers. (2001, Fall). Benjamin D. Wright, Ph.D. honored with the Career Achievement Award in Computer-Based Testing. Test Publisher, 8(2). Retrieved 20 May 2009, from http://www.testpublishers.org/newsletter7.htm#Wright.

Bezruczko, N. (Ed.). (2005). Rasch measurement in health sciences. Maple Grove, MN: JAM Press.

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2d edition. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Dawson, T. L., & Gabrielian, S. (2003, June). Developing conceptions of authority and contract across the life-span: Two perspectives. Developmental Review, 23(2), 162-218.

Drehmer, D. E., Belohlav, J. A., & Coye, R. W. (2000, Dec). A exploration of employee participation using a scaling approach. Group & Organization Management, 25(4), 397-418.

Drehmer, D. E., & Deklava, S. M. (2001, April). A note on the evolution of software engineering practices. Journal of Systems and Software, 57(1), 1-7.

Fisher, W. P., Jr. (2009, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), 42(9), 1278-1287.

Fisher, W. P., Jr. (2010). Bringing human, social, and natural capital to life: Practical consequences and opportunities. Journal of Applied Measurement, 11, in press [Pre-press version available at http://www.livingcapitalmetrics.com/images/BringingHSN_FisherARMII.pdf].

Ludlow, L. H., & Lunz, M. E. (1998). The Job Responsibilities Scale: Invariance in a longitudinal prospective study. Journal of Outcome Measurement, 2(4), 326-37.

Lunz, M. E., & Linacre, J. M. (1998). Measurement designs using multifacet Rasch modeling. In G. A. Marcoulides (Ed.), Modern methods for business research. Methodology for business and management (pp. 47-77). Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc.

Masters, G. N. (2007). Special issue: Programme for International Student Assessment (PISA). Journal of Applied Measurement, 8(3), 235-335.

Masters, G. N., & Keeves, J. P. (Eds.). (1999). Advances in measurement in educational research and assessment. New York: Pergamon.

Mohamed, A., Aziz, A., Zakaria, S., & Masodi, M. S. (2008). Appraisal of course learning outcomes using Rasch measurement: A case study in information technology education. In L. Kazovsky, P. Borne, N. Mastorakis, A. Kuri-Morales & I. Sakellaris (Eds.), Proceedings of the 7th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems (Electrical And Computer Engineering Series) (pp. 222-238). Cambridge, UK: WSEAS.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Salzberger, T. (2000). An extended Rasch analysis of the CETSCALE – implications for scale development and data construction., Department of Marketing, University of Economics and Business Administration, Vienna (WU-Wien) (http://www2.wu-wien.ac.at/marketing/user/salzberger/research/wp_dataconstruction.pdf).

Salzberger, T. (2009). Measurement in marketing research: An alternative framework. Northampton, MA: Edward Elgar.

Salzberger, T., & Sinkovics, R. R. (2006). Reconsidering the problem of data equivalence in international marketing research: Contrasting approaches based on CFA and the Rasch model for measurement. International Marketing Review, 23(4), 390-417.

Smith, E. V., Jr., & Smith, R. M. (2004). Introduction to Rasch measurement. Maple Grove, MN: JAM Press.35.

Spitzer, D. (2007). Transforming performance measurement: Rethinking the way we measure and drive organizational success. New York: AMACOM.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1997a, June). Fundamental measurement for outcome evaluation. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 261-88.

Wright, B. D. (1997b, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45, 52 [http://www.rasch.org/memo62.htm].

Wright, B. D. (1999a). Fundamental measurement for psychology. In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every educator and psychologist should know (pp. 65-104 [http://www.rasch.org/memo64.htm]). Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Wright, B. D. (1999b). Rasch measurement models. In G. N. Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 85-97). New York: Pergamon.

Wright, B. D., & Stone, M. H. (1999). Measurement essentials. Wilmington, DE: Wide Range, Inc. [http://www.rasch.org/memos.htm#measess].

Wright, B. D., & Stone, M. H. (2004). Making measures. Chicago: Phaneron Press.

Zakaria, S., Aziz, A. A., Mohamed, A., Arshad, N. H., Ghulman, H. A., & Masodi, M. S. (2008, November 11-13). Assessment of information managers’ competency using Rasch measurement. iccit: Third International Conference on Convergence and Hybrid Information Technology, 1, 190-196 [http://www.computer.org/portal/web/csdl/doi/10.1109/ICCIT.2008.387].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

On the alleged difficulty of quantifying this or that

October 5, 2009

That this effect or that phenomenon is “difficult to quantify” is one of those phrases that people use from time to time. But, you know, building a computer is difficult, too. I couldn’t do it, and you probably couldn’t, either. Computers are, however, readily available for purchase and it doesn’t matter if you or I can make our own.

Same thing with measurement. Of course, instrument design and calibration are highly technical endeavors, and despite 80+ years of success, most people seem to think it is impossible to really quantify abstract things like abilities, attitudes, motivations,  trust, outcomes and impacts, or maturational development. But real quantification, the kind that is commonly thought possible only for physical things, has been underway in psychology and the social sciences for a long time. More people need to know this.

As anyone who has read much of this blog knows, I’m not talking about some kind of simplistic survey or assessment process that takes measurement to be a mere assignment of numbers to observations. Instrument calibration takes a lot more thought and effort than is usually invested in it. But it isn’t impossible, not by a long shot.

Just as you would not despair of ever having your own computer just because you cannot make one yourself, those who throw up their hands at the supposed difficulty of quantifying something need to think again. Where there’s a will, there’s a way, and scientifically rigorous methods of determining whether something is measurable are a lot more ready to hand than most people realize.

For more information, see my survey design recommendations on pages 1,072-4 at http://www.rasch.org/rmt/rmt203.pdf and Ben Wright’s 15 steps to measurement at http://www.rasch.org/rmt/rmt141g.htm.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Comments on the National Accounts of Well-Being

October 4, 2009

Well-designed measures of human, social, and natural capital captured in genuine progress indicators and properly put to work on the front lines of education, health care, social services, human and environmental resource management, etc. will harness the profit motive as a driver of growth in human potential, community trust, and environmental quality. But it is a tragic shame that so many well-meaning efforts ignore the decisive advantages of readily available measurement methods. For instance, consider the National Accounts of Well-Being (available at http://www.nationalaccountsofwellbeing.org/learn/download-report.html).

This report’s authors admirably say that “Advances in the measurement of well-being mean that now we can reclaim the true purpose of national accounts as initially conceived and shift towards more meaningful measures of progress and policy effectiveness which capture the real wealth of people’s lived experience” (p. 2).

Of course, as is evident in so many of my posts here and in the focus of my scientific publications, I couldn’t agree more!

But look at p. 61, where the authors say “we acknowledge that we need to be careful about interpreting the distribution of transformed scores. The curvilinear transformation results in scores at one end of the distribution being stretched more than those at the other end. This means that standard deviations, for example, of countries with higher scores, are likely to be distorted upwards. As the results section shows, however, this pattern was not in fact found in our data, so it appears that this distortion does not have too much effect. Furthermore, being overly concerned with the distortion would imply absolute faith that the original scales used in the questions are linear. Such faith would be ill-founded. For example, it is not necessarily the case that the difference between ‘all or almost all of the time’ (a response scored as ‘4’ for some questions) and ‘most of the time’ (scored as ‘3’), is the same as the difference between ‘most of the time’ (‘3’) and ‘some of the time’ (‘2’).”

This is just incredible, that the authors admit so baldly that their numbers don’t add up at the same time that they offer those very same numbers in voluminous masses to a global audience that largely takes them at face value. What exactly does it mean to most people “to be careful about interpreting the distribution of transformed scores”?

More to the point, what does it mean that faith in the linearity of the scales is ill-founded? They are doing arithmetic with those scores! There is no way a constant difference between each number on the scale cannot be assumed! Instead of offering cautions, the creators of anything as visible and important as National Accounts of Well Being ought to do the work needed to construct scales that measure in numbers that add up. Instead of saying they don’t know what the size of the unit of measurement is at different places on the ruler, why don’t they formulate a theory of the thing they want to measure, state testable hypotheses as to the constancy and invariance of the measuring unit, and conduct the experiments? It is not, after all, as though we do not have a mature measurement science that has been doing this kind of thing for more than 80 years.

By its very nature, the act of adding up ratings into a sum, and dividing by the number of ratings included in that sum to produce an average, demands the assumption of a common unit of measurement. But practical science does not function or advance on the basis of untested assumptions. Different numbers that add up to the same sum have to mean the same thing: 1+3+4=8=2+3+3, etc. So the capacity of the measurement system to support meaningful inferences as to the invariance of the unit has to be established experimentally.

There is no way to do arithmetic and compute statistics on ordinal rating data without assuming a constant, additive unit of measurement. Either unrealistic demands are being made on people’s cognitive abilities to stretch and shrink numeric units, or the value of the numbers as a basis for action is seriously and unnecessarily compromised.

A lot can be done to construct linear units of measurement that provide the meaningfulness desired by the developers of the National Accounts of Well-Being.

For explanations and illustrations of why scores and percentages are not measures, see https://livingcapitalmetrics.wordpress.com/2009/07/01/graphic-illustrations-of-why-scores-ratings-and-percentages-are-not-measures-part-one/.

The numerous advantages real measures have over raw ratings are listed at https://livingcapitalmetrics.wordpress.com/2009/07/06/table-comparing-scores-ratings-and-percentages-with-rasch-measures/.

To understand the contrast between dead and living capital as it applies to measures based in ordinal data from tests and rating scales, see http://www.rasch.org/rmt/rmt154j.htm.

For a peer-reviewed scientific paper on the theory and research supporting the viability of a metric system for human, social, and natural capital, see http://dx.doi.org/doi:10.1016/j.measurement.2009.03.014.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.