Archive for the ‘Invariance’ Category

The Counterproductive Consequences of Common Study Designs and Statistical Methods

May 21, 2015

Because of the ways studies are designed and the ways data are analyzed, research results in psychology and the social sciences often appear to be nonlinear, sample- and instrument-dependent, and incommensurable, even when they need not be. In contrast with what are common assumptions about the nature of the constructs involved, invariant relations may be more obscured than clarified by typically employed research designs and statistical methods.

To take a particularly salient example, the number of small factors with Eigenvalues greater than 1.0 identified via factor analysis increases as the number of modes in a multi-modal distribution also increases, and the interpretation of results is further complicated by the fact that the number of factors identified decreases as sample size increases (Smith, 1996).

Similarly, variation in employment test validity across settings was established as a basic assumption by the 1970s, after 50 years of studies observing the situational specificity of results. But then Schmidt and Hunter (1977) identified sampling error, measurement error, and range restriction as major sources of what was only the appearance of incommensurable variation in employment test validity. In other words, for most of the 20th century, the identification of constructs and comparisons of results across studies were pointlessly confused by mixed populations, uncontrolled variation in reliability, and unnoted floor and/or ceiling effects. Though they do nothing to establish information systems deploying common languages structured by standard units of measurement (Feinstein, 1995), meta-analysis techniques are a step forward in equating effect sizes (Hunter & Schmidt, 2004).

Wright and Stone’s (1979) Best Test Design, in contrast, takes up each of these problems in an explicit way. Sampling error is addressed in that both the sample’s and the items’ representations of the same populations of persons and expressions of a construct are evaluated. The evaluation of reliability is foregrounded and clarified by taking advantage of the availability of individualized measurement uncertainty (error) estimates (following Andrich, 1982, presented at AERA in 1977). And range restriction becomes manageable in terms of equating and linking instruments measuring in different ranges of the same construct. As was demonstrated by Duncan (1985; Allerup, Bech, Loldrup, et al., 1994; Andrich & Styles, 1998), for instance, the restricted ranges of various studies assessing relationships between measures of attitudes and behaviors led to the mistaken conclusion that these were separate constructs. When the entire range of variation was explicitly modeled and studied, a consistent relationship was found.

Statistical and correlational methods have long histories of preventing the discovery, assessment, and practical application of invariant relations because they fail to test for invariant units of measurement, do not define standard metrics, never calibrate all instruments measuring the same thing in common units, and have no concept of formal measurement systems of interconnected instruments. Wider appreciation of the distinction between statistics and measurement (Duncan & Stenbeck, 1988; Fisher, 2010; Wilson, 2013a), and of the potential for metrological traceability we have within our reach (Fisher, 2009, 2012; Fisher & Stenner, 2013; Mari & Wilson, 2013; Pendrill, 2014; Pendrill & Fisher, 2015; Wilson, 2013b; Wilson, Mari, Maul, & Torres Irribarra, 2015), are demonstrably fundamental to the advancement of a wide range of fields.

References

Allerup, P., Bech, P., Loldrup, D., Alvarez, P., Banegil, T., Styles, I., & Tenenbaum, G. (1994). Psychiatric, business, and psychological applications of fundamental measurement models. International Journal of Educational Research, 21(6), 611-622.

Andrich, D. (1982). An index of person separation in Latent Trait Theory, the traditional KR-20 index, and the Guttman scale response pattern. Education Research and Perspectives, 9(1), 95-104 [http://www.rasch.org/erp7.htm].

Andrich, D., & Styles, I. M. (1998). The structural relationship between attitude and behavior statements from the unfolding perspective. Psychological Methods, 3(4), 454-469.

Duncan, O. D. (1985). Probability, disposition and the inconsistency of attitudes and behaviour. Synthese, 42, 21-34.

Duncan, O. D., & Stenbeck, M. (1988). Panels and cohorts: Design and model in the study of voting turnout. In C. C. Clogg (Ed.), Sociological Methodology 1988 (pp. 1-35). Washington, DC: American Sociological Association.

Feinstein, A. R. (1995). Meta-analysis: Statistical alchemy for the 21st century. Journal of Clinical Epidemiology, 48(1), 71-79.

Fisher, W. P., Jr. (2009). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement, 42(9), 1278-1287.

Fisher, W. P., Jr. (2010). Statistics and measurement: Clarifying the differences. Rasch Measurement Transactions, 23(4), 1229-1230.

Fisher, W. P., Jr. (2012, May/June). What the world needs now: A bold plan for new standards [Third place, 2011 NIST/SES World Standards Day paper competition]. Standards Engineering, 64(3), 1 & 3-5.

Fisher, W. P., Jr., & Stenner, A. J. (2013). Overcoming the invisibility of metrology: A reading measurement network for education and the social sciences. Journal of Physics: Conference Series, 459(012024), http://iopscience.iop.org/1742-6596/459/1/012024.

Hunter, J. E., & Schmidt, F. L. (Eds.). (2004). Methods of meta-analysis: Correcting error and bias in research findings. Thousand Oaks, CA: Sage.

Mari, L., & Wilson, M. (2013). A gentle introduction to Rasch measurement models for metrologists. Journal of Physics Conference Series, 459(1), http://iopscience.iop.org/1742-6596/459/1/012002/pdf/1742-6596_459_1_012002.pdf.

Pendrill, L. (2014). Man as a measurement instrument [Special Feature]. NCSLi Measure: The Journal of Measurement Science, 9(4), 22-33.

Pendrill, L., & Fisher, W. P., Jr. (2015). Counting and quantification: Comparing psychometric and metrological perspectives on visual perceptions of number. Measurement, 71, 46-55. doi: http://dx.doi.org/10.1016/j.measurement.2015.04.010

Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62(5), 529-540.

Smith, R. M. (1996). A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling, 3(1), 25-40.

Wilson, M. R. (2013a). Seeking a balance between the statistical and scientific elements in psychometrics. Psychometrika, 78(2), 211-236.

Wilson, M. R. (2013b). Using the concept of a measurement system to characterize measurement models used in psychometrics. Measurement, 46, 3766-3774.

Wilson, M., Mari, L., Maul, A., & Torres Irribarra, D. (2015). A comparison of measurement concepts across physical science and social science domains: Instrument design, calibration, and measurement. Journal of Physics: Conference Series, 588(012034), http://iopscience.iop.org/1742-6596/588/1/012034.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, Illinois: MESA Press.

Comments on the New ANSI Human Capital Investor Metrics Standard

April 16, 2012

The full text of the proposed standard is available here.

It’s good to see a document emerge in this area, especially one with such a broad base of support from a diverse range of stakeholders. As is stated in the standard, the metrics defined in it are a good place to start and in many instances will likely improve the quality and quantity of the information made available to investors.

There are several issues to keep in mind as the value of standards for human capital metrics becomes more widely appreciated. First, in the context of a comprehensively defined investment framework, human capital is just one of the four major forms of capital, the other three being social, natural, and manufactured (Ekins, 1992; Ekins, Dresden, and Dahlstrom, 2008). To ensure as far as possible the long term stability and sustainability of their profits, and of the economic system as a whole, investors will certainly want to expand the range of the available standards to include social and natural capital along with human capital.

Second, though we manage what we measure, investment management is seriously compromised by having high quality scientific measurement standards only for manufactured capital (length, weight, volume, temperature, energy, time, kilowatts, etc.). Over 80 years of research on ability tests, surveys, rating scales, and assessments has reached a place from which it is prepared to revolutionize the management of intangible forms of capital (Fisher, 2007, 2009a, 2009b, 2010, 2011a, 2011b; Fisher & Stenner, 2011a, 2011b; Wilson, 2011; Wright, 1999). The very large reductions in transaction costs effected by standardized metrics in the economy at large (Barzel, 1982; Benham and Benham, 2000) are likely to have a similarly profound effect on the economics of human, social, and natural capital (Fisher, 2011a, 2012a, 2012b).

The potential for dramatic change in the conceptualization of metrics is most evident in the proposed standard in the sections on leadership quality and employee engagement. For instance, in the section on leadership quality, it is stated that “Investors will be able to directly compare all organizations that are using the same vendor’s methodology.” This kind of dependency should not be allowed to stand as a significant factor in a measurement standard. Properly constructed and validated scientific measures, such as those that have been in wide use in education, psychology and health care for several decades (Andrich, 2010; Bezruzcko, 2005; Bond and Fox, 2007; Fisher and Wright, 1994; Rasch, 1960; Salzberger, 2009; Wright, 1999), are equated to a common unit. Comparability should never depend on which vendor is used. Rather, any instrument that actually measures the construct of interest (leadership quality or employee engagement) should do so in a common unit and within an acceptable range of error. “Normalizing” measures for comparability, as is suggested in the standard, means employing psychometric methods that are 50 years out of date and that are far less rigorous and practical than need be. Transparency in measurement means looking through the instrument to the thing itself. If particular instruments color or reshape what is measured, or merely change the meaning of the numbers reported, then the integrity of the standard as a standard should be re-examined.

Third, for investments in human capital to be effectively managed, each distinct aspect of it (motivations, skills and abilities, health) needs to be measured separately, just as height, weight, and temperature are. New technologies have already transformed measurement practices in ways that make the necessary processes precise and inexpensive. Of special interest are adaptively administered precalibrated instruments supporting mass customized—but globally comparable—measures (for instance, see the examples at http://blog.lexile.com/tag/oasis/ and that were presented at the recent Pearson Global Research Conference in Fremantle, Australia http://www.pearson.com.au/marketing/corporate/pearson_global/default.html; also see Wright and Bell 1984, Lunz, Bergstrom, and Gershon, 1994, Bejar, et al., 2003).

Fourth, the ownership of human capital needs clarification and legal status. If we consider each individual to own their abilities, health, and motivations, and to be solely responsible for decisions made concerning the disposition of those properties, then, in accord with their proven measured amounts of each type of human capital, everyone ought to have legal title to a specific number of shares or credits of each type. This may transform employment away from wage-based job classification compensation to an individualized investment-based continuous quality improvement platform. The same kind of legal titling system will, of course, need to be worked out for social and natural capital, as well.

Fifth, given scientific standards for each major form of capital, practical measurement technologies, and legal title to our shares of capital, we will need expanded financial accounting standards and tools for managing our individual and collective investments. Ongoing research and debates concerning these standards and tools (Siegel and Borgia, 2006; Young and Williams, 2010) have yet to connect with the larger scientific, economic, and legal issues raised here, but developments in this direction should be emerging in due course.

Sixth, a number of lingering moral, ethical and political questions are cast in a new light in this context. The significance of individual behaviors and decisions is informed and largely determined by the context of the culture and institutions in which those behaviors and decisions are executed. Many of the morally despicable but not illegal investment decisions leading to the recent economic downturn put individuals in the position of either setting themselves apart and threatening their careers or doing what was best for their portfolios within the limits of the law. Current efforts intended to devise new regulatory constraints are misguided in focusing on ever more microscopically defined particulars. What is needed is instead a system in which profits are contingent on the growth of human, social, and natural capital. In that framework, legal but ultimately unfair practices would drive down social capital stock values, counterbalancing ill-gotten gains and making them unprofitable.

Seventh, the International Vocabulary of Measurement, now in its third edition (VIM3), is a standard recognized by all eight international standards accrediting bodies (BIPM, etc.). The VIM3 (http://www.bipm.org/en/publications/guides/vim.html) and forthcoming VIM4 are intended to provide a uniform set of concepts and terms for all fields that employ measures across the natural and social sciences. A new dialogue on these issues has commenced in the context of the International Measurement Confederation (IMEKO), whose member organizations are the weights and standards measurement institutes from countries around the world (Conference note, 2011). The 2012 President of the Psychometric Society, Mark Wilson, gave an invited address at the September 2011 IMEKO meeting (Wilson, 2011), and a member of the VIM3 editorial board, Luca Mari, is invited to speak at the July, 2012 International Meeting of the Psychometric Society. I encourage all interested parties to become involved in efforts of these kinds in their own fields.

References

Andrich, D. (2010). Sufficiency and conditional estimation of person parameters in the polytomous Rasch model. Psychometrika, 75(2), 292-308.

Barzel, Y. (1982). Measurement costs and the organization of markets. Journal of Law and Economics, 25, 27-48.

Bejar, I., Lawless, R. R., Morley, M. E., Wagner, M. E., Bennett, R. E., & Revuelta, J. (2003, November). A feasibility study of on-the-fly item generation in adaptive testing. The Journal of Technology, Learning, and Assessment, 2(3), 1-29; http://ejournals.bc.edu/ojs/index.php/jtla/article/view/1663.

Benham, A., & Benham, L. (2000). Measuring the costs of exchange. In C. Ménard (Ed.), Institutions, contracts and organizations: Perspectives from new institutional economics (pp. 367-375). Cheltenham, UK: Edward Elgar.

Bezruczko, N. (Ed.). (2005). Rasch measurement in health sciences. Maple Grove, MN: JAM Press.

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2d edition. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Conference note. (2011). IMEKO Symposium: August 31- September 2, 2011, Jena, Germany. Rasch Measurement Transactions, 25(1), 1318.

Ekins, P. (1992). A four-capital model of wealth creation. In P. Ekins & M. Max-Neef (Eds.), Real-life economics: Understanding wealth creation (pp. 147-155). London: Routledge.

Ekins, P., Dresner, S., & Dahlstrom, K. (2008). The four-capital method of sustainable development evaluation. European Environment, 18(2), 63-80.

Fisher, W. P., Jr. (2007). Living capital metrics. Rasch Measurement Transactions, 21(1), 1092-3 [http://www.rasch.org/rmt/rmt211.pdf].

Fisher, W. P., Jr. (2009a). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement, 42(9), 1278-1287.

Fisher, W. P.. Jr. (2009b). NIST Critical national need idea White Paper: metrological infrastructure for human, social, and natural capital (http://www.nist.gov/tip/wp/pswp/upload/202_metrological_infrastructure_for_human_social_natural.pdf). Washington, DC: National Institute for Standards and Technology.

Fisher, W. P.. Jr. (2010). Rasch, Maxwell’s method of analogy, and the Chicago tradition. In G. Cooper (Chair), https://conference.cbs.dk/index.php/rasch/Rasch2010/paper/view/824. Probabilistic models for measurement in education, psychology, social science and health: Celebrating 50 years since the publication of Rasch’s Probabilistic Models.., University of Copenhagen School of Business, FUHU Conference Centre, Copenhagen, Denmark.

Fisher, W. P., Jr. (2011a). Bringing human, social, and natural capital to life: Practical consequences and opportunities. In N. Brown, B. Duckor, K. Draney & M. Wilson (Eds.), Advances in Rasch Measurement, Vol. 2 (pp. 1-27). Maple Grove, MN: JAM Press.

Fisher, W. P., Jr. (2011b). Measurement, metrology and the coordination of sociotechnical networks. In  S. Bercea (Chair), New Education and Training Methods. International Measurement Confederation (IMEKO), http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24491/ilm1-2011imeko-017.pdf, Jena, Germany.

Fisher, W. P., Jr. (2012a). Measure local, manage global: Intangible assets metric standards for sustainability. In J. Marques, S. Dhiman & S. Holt (Eds.), Business administration education: Changes in management and leadership strategies (pp. in press). New York: Palgrave Macmillan.

Fisher, W. P., Jr. (2012b). What the world needs now: A bold plan for new standards. Standards Engineering, 64, in press.

Fisher, W. P., Jr., & Stenner, A. J. (2011a). Metrology for the social, behavioral, and economic sciences (Social, Behavioral, and Economic Sciences White Paper Series). Retrieved 25 October 2011, from National Science Foundation: http://www.nsf.gov/sbe/sbe_2020/submission_detail.cfm?upld_id=36.

Fisher, W. P., Jr., & Stenner, A. J. (2011b). A technology roadmap for intangible assets metrology. In Fundamentals of measurement science. International Measurement Confederation (IMEKO) TC1-TC7-TC13 Joint Symposium, http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24493/ilm1-2011imeko-018.pdf, Jena, Germany.

Fisher, W. P., Jr., & Wright, B. D. (Eds.). (1994). Applications of probabilistic conjoint measurement. International Journal of Educational Research, 21(6), 557-664.

Lunz, M. E., Bergstrom, B. A., & Gershon, R. C. (1994). Computer adaptive testing. International Journal of Educational Research, 21(6), 623-634.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Salzberger, T. (2009). Measurement in marketing research: An alternative framework. Northampton, MA: Edward Elgar.

Siegel, P., & Borgia, C. (2006). The measurement and recognition of intangible assets. Journal of Business and Public Affairs, 1(1).

Wilson, M. (2011). The role of mathematical models in measurement: A perspective from psychometrics. In L. Mari (Chair), Plenary lecture. International Measurement Confederation (IMEKO), http://www.db-thueringen.de/servlets/DerivateServlet/Derivate-24178/ilm1-2011imeko-005.pdf, Jena, Germany.

Wright, B. D. (1999). Fundamental measurement for psychology. In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every educator and psychologist should know (pp. 65-104 [http://www.rasch.org/memo64.htm]). Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Wright, B. D., & Bell, S. R. (1984, Winter). Item banks: What, why, how. Journal of Educational Measurement, 21(4), 331-345 [http://www.rasch.org/memo43.htm].

Young, J. J., & Williams, P. F. (2010, August). Sorting and comparing: Standard-setting and “ethical” categories. Critical Perspectives on Accounting, 21(6), 509-521.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

A New Agenda for Measurement Theory and Practice in Education and Health Care

April 15, 2011

Two key issues on my agenda offer different answers to the question “Why do you do things the way you do in measurement theory and practice?”

First, we can take up the “Because of…” answer to this question. We need to articulate an historical account of measurement that does three things:

  1. that builds on Rasch’s use of Maxwell’s method of analogy by employing it and expanding on it in new applications;
  2. that unifies the vocabulary and concepts of measurement across the sciences into a single framework so far as possible by situating probabilistic models of invariant individual-level within-variable phenomena in the context of measurement’s GIGO principle and data-to-model fit, as distinct from the interactions of group-level between-variable phenomena in the context of statistics’ model-to-data fit; and
  3. that stresses the social, collective cognition facilitated by networks of individuals whose point-of-use measurement-informed decisions and behaviors are coordinated and harmonized virtually, at a distance, with no need for communication or negotiation.

We need multiple publications in leading journals on these issues, as well as one or more books that people can cite as a way of making this real and true history of measurement, properly speaking, credible and accepted in the mainstream. This web site http://ssrn.com/abstract=1698919 is a draft article of my own in this vein that I offer for critique; other material is available on request. Anyone who works on this paper with me and makes a substantial contribution to its publication will be added as co-author.

Second, we can take up the “In order that…” answer to the question “Why do you do things the way you do?” From this point of view, we need to broaden the scope of the measurement research agenda beyond data analysis, estimation, models, and fit assessment in three ways:

  1. by emphasizing predictive construct theories that exhibit the fullest possible understanding of what is measured and so enable the routine reproduction of desired proportionate effects efficiently, with no need to analyze data to obtain an estimate;
  2. by defining the standard units to which all calibrated instruments measuring given constructs are traceable; and
  3. by disseminating to front line users on mass scales instruments measuring in publicly available standard units and giving immediate feedback at the point of use.

These two sets of issues define a series of talking points that together constitute a new narrative for measurement in education, psychology, health care, and many other fields. We and others may see our way to organizing new professional societies, new journals, new university-based programs of study, etc. around these principles.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

A Second Simple Example of Measurement’s Role in Reducing Transaction Costs, Enhancing Market Efficiency, and Enables the Pricing of Intangible Assets

March 9, 2011

The prior post here showed why we should not confuse counts of things with measures of amounts, though counts are the natural starting place to begin constructing measures. That first simple example focused on an analogy between counting oranges and measuring the weight of oranges, versus counting correct answers on tests and measuring amounts of ability. This second example extends the first by, in effect, showing what happens when we want to aggregate value not just across different counts of some one thing but across different counts of different things. The point will be, in effect, to show how the relative values of apples, oranges, grapes, and bananas can be put into a common frame of reference and compared in a practical and convenient way.

For instance, you may go into a grocery store to buy raspberries and blackberries, and I go in to buy cantaloupe and watermelon. Your cost per individual fruit will be very low, and mine will be very high, but neither of us will find this annoying, confusing, or inconvenient because your fruits are very small, and mine, very large. Conversely, your cost per kilogram will be much higher than mine, but this won’t cause either of us any distress because we both recognize the differences in the labor, handling, nutritional, and culinary value of our purchases.

But what happens when we try to purchase something as complex as a unit of socioeconomic development? The eight UN Millennium Development Goals (MDGs) represent a start at a systematic effort to bring human, social, and natural capital together into the same economic and accountability framework as liquid and manufactured capital, and property. But that effort is stymied by the inefficiency and cost of making and using measures of the goals achieved. The existing MDG databases (http://data.un.org/Browse.aspx?d=MDG), and summary reports present overwhelming numbers of numbers. Individual indicators are presented for each year, each country, each region, and each program, goal by goal, target by target, indicator by indicator, and series by series, in an indigestible volume of data.

Though there are no doubt complex mathematical methods by which a philanthropic, governmental, or NGO investor might determine how much development is gained per million dollars invested, the cost of obtaining impact measures is so high that most funding decisions are made with little information concerning expected returns (Goldberg, 2009). Further, the percentages of various needs met by leading social enterprises typically range from 0.07% to 3.30%, and needs are growing, not diminishing. Progress at current rates means that it would take thousands of years to solve today’s problems of human suffering, social disparity, and environmental quality. The inefficiency of human, social, and natural capital markets is so overwhelming that there is little hope for significant improvements without the introduction of fundamental infrastructural supports, such as an Intangible Assets Metric System.

A basic question that needs to be asked of the MDG system is, how can anyone make any sense out of so much data? Most of the indicators are evaluated in terms of counts of the number of times something happens, the number of people affected, or the number of things observed to be present. These counts are usually then divided by the maximum possible (the count of the total population) and are expressed as percentages or rates.

As previously explained in various posts in this blog, counts and percentages are not measures in any meaningful sense. They are notoriously difficult to interpret, since the quantitative meaning of any given unit difference varies depending on the size of what is counted, or where the percentage falls in the 0-100 continuum. And because counts and percentages are interpreted one at a time, it is very difficult to know if and when any number included in the sheer mass of data is reasonable, all else considered, or if it is inconsistent with other available facts.

A study of the MDG data must focus on these three potential areas of data quality improvement: consistency evaluation, volume reduction, and interpretability. Each builds on the others. With consistent data lending themselves to summarization in sufficient statistics, data volume can be drastically reduced with no loss of information (Andersen, 1977, 1999; Wright, 1977, 1997), data quality can be readily assessed in terms of sufficiency violations (Smith, 2000; Smith & Plackner, 2009), and quantitative measures can be made interpretable in terms of a calibrated ruler’s repeatedly reproducible hierarchy of indicators (Bond & Fox, 2007; Masters, Lokan, & Doig, 1994).

The primary data quality criteria are qualitative relevance and meaningfulness, on the one hand, and mathematical rigor, on the other. The point here is one of following through on the maxim that we manage what we measure, with the goal of measuring in such a way that management is better focused on the program mission and not distracted by accounting irrelevancies.

Method

As written and deployed, each of the MDG indicators has the face and content validity of providing information on each respective substantive area of interest. But, as has been the focus of repeated emphases in this blog, counting something is not the same thing as measuring it.

Counts or rates of literacy or unemployment are not, in and of themselves, measures of development. Their capacity to serve as contributing indications of developmental progress is an empirical question that must be evaluated experimentally against the observable evidence. The measurement of progress toward an overarching developmental goal requires inferences made from a conceptual order of magnitude above and beyond that provided in the individual indicators. The calibration of an instrument for assessing progress toward the realization of the Millennium Development Goals requires, first, a reorganization of the existing data, and then an analysis that tests explicitly the relevant hypotheses as to the potential for quantification, before inferences supporting the comparison of measures can be scientifically supported.

A subset of the MDG data was selected from the MDG database available at http://data.un.org/Browse.aspx?d=MDG, recoded, and analyzed using Winsteps (Linacre, 2011). At least one indicator was selected from each of the eight goals, with 22 in total. All available data from these 22 indicators were recorded for each of 64 countries.

The reorganization of the data is nothing but a way of making the interpretation of the percentages explicit. The meaning of any one country’s percentage or rate of youth unemployment, cell phone users, or literacy has to be kept in context relative to expectations formed from other countries’ experiences. It would be nonsense to interpret any single indicator as good or bad in isolation. Sometimes 30% represents an excellent state of affairs, other times, a terrible one.

Therefore, the distributions of each indicator’s percentages across the 64 countries were divided into ranges and converted to ratings. A lower rating uniformly indicates a status further away from the goal than a higher rating. The ratings were devised by dividing the frequency distribution of each indicator roughly into thirds.

For instance, the youth unemployment rate was found to vary such that the countries furthest from the desired goal had rates of 25% and more(rated 1), and those closest to or exceeding the goal had rates of 0-10% (rated 3), leaving the middle range (10-25%) rated 2. In contrast, percentages of the population that are undernourished were rated 1 for 35% or more, 2 for 15-35%, and 3 for less than 15%.

Thirds of the distributions were decided upon only on the basis of the investigator’s prior experience with data of this kind. A more thorough approach to the data would begin from a finer-grained rating system, like that structuring the MDG table at http://mdgs.un.org/unsd/mdg/Resources/Static/Products/Progress2008/MDG_Report_2008_Progress_Chart_En.pdf. This greater detail would be sought in order to determine empirically just how many distinctions each indicator can support and contribute to the overall measurement system.

Sixty-four of the available 336 data points were selected for their representativeness, with no duplications of values and with a proportionate distribution along the entire continuum of observed values.

Data from the same 64 countries and the same years were then sought for the subsequent indicators. It turned out that the years in which data were available varied across data sets. Data within one or two years of the target year were sometimes substituted for missing data.

The data were analyzed twice, first with each indicator allowed its own rating scale, parameterizing each of the category difficulties separately for each item, and then with the full rating scale model, as the results of the first analysis showed all indicators shared strong consistency in the rating structure.

Results

Data were 65.2% complete. Countries were assessed on an average of 14.3 of the 22 indicators, and each indicator was applied on average to 41.7 of the 64 country cases. Measurement reliability was .89-.90, depending on how measurement error is estimated. Cronbach’s alpha for the by-country scores was .94. Calibration reliability was .93-.95. The rating scale worked well (see Linacre, 2002, for criteria). The data fit the measurement model reasonably well, with satisfactory data consistency, meaning that the hypothesis of a measurable developmental construct was not falsified.

The main result for our purposes here concerns how satisfactory data consistency makes it possible to dramatically reduce data volume and improve data interpretability. The figure below illustrates how. What does it mean for data volume to be drastically reduced with no loss of information? Let’s see exactly how much the data volume is reduced for the ten item data subset shown in the figure below.

The horizontal continuum from -100 to 1300 in the figure is the metric, the ruler or yardstick. The number of countries at various locations along that ruler is shown across the bottom of the figure. The mean (M), first standard deviation (S), and second standard deviation (T) are shown beneath the numbers of countries. There are ten countries with a measure of just below 400, just to the left of the mean (M).

The MDG indicators are listed on the right of the figure, with the indicator most often found being achieved relative to the goals at the bottom, and the indicator least often being achieved at the top. The ratings in the middle of the figure increase from 1 to 3 left to right as the probability of goal achievement increases as the measures go from low to high. The position of the ratings in the middle of the figure shifts from left to right as one reads up the list of indicators because the difficulty of achieving the goals is increasing.

Because the ratings of the 64 countries relative to these ten goals are internally consistent, nothing but the developmental level of the country and the developmental challenge of the indicator affects the probability that a given rating will be attained. It is this relation that defines fit to a measurement model, the sufficiency of the summed ratings, and the interpretability of the scores. Given sufficient fit and consistency, any country’s measure implies a given rating on each of the ten indicators.

For instance, imagine a vertical line drawn through the figure at a measure of 500, just above the mean (M). This measure is interpreted relative to the places at which the vertical line crosses the ratings in each row associated with each of the ten items. A measure of 500 is read as implying, within a given range of error, uncertainty, or confidence, a rating of

  • 3 on debt service and female-to-male parity in literacy,
  • 2 or 3 on how much of the population is undernourished and how many children under five years of age are moderately or severely underweight,
  • 2 on infant mortality, the percent of the population aged 15 to 49 with HIV, and the youth unemployment rate,
  • 1 or 2 the poor’s share of the national income, and
  • 1 on CO2 emissions and the rate of personal computers per 100 inhabitants.

For any one country with a measure of 500 on this scale, ten percentages or rates that appear completely incommensurable and incomparable are found to contribute consistently to a single valued function, developmental goal achievement. Instead of managing each separate indicator as a universe unto itself, this scale makes it possible to manage development itself at its own level of complexity. This ten-to-one ratio of reduced data volume is more than doubled when the total of 22 items included in the scale is taken into account.

This reduction is conceptually and practically important because it focuses attention on the actual object of management, development. When the individual indicators are the focus of attention, the forest is lost for the trees. Those who disparage the validity of the maxim, you manage what you measure, are often discouraged by the the feeling of being pulled in too many directions at once. But a measure of the HIV infection rate is not in itself a measure of anything but the HIV infection rate. Interpreting it in terms of broader developmental goals requires evidence that it in fact takes a place in that larger context.

And once a connection with that larger context is established, the consistency of individual data points remains a matter of interest. As the world turns, the order of things may change, but, more likely, data entry errors, temporary data blips, and other factors will alter data quality. Such changes cannot be detected outside of the context defined by an explicit interpretive framework that requires consistent observations.

-100  100     300     500     700     900    1100    1300
|-------+-------+-------+-------+-------+-------+-------|  NUM   INDCTR
1                                 1  :    2    :  3     3    9  PcsPer100
1                         1   :   2    :   3            3    8  CO2Emissions
1                    1  :    2    :   3                 3   10  PoorShareNatInc
1                 1  :    2    :  3                     3   19  YouthUnempRatMF
1              1   :    2   :   3                       3    1  %HIV15-49
1            1   :   2    :   3                         3    7  InfantMortality
1          1  :    2    :  3                            3    4  ChildrenUnder5ModSevUndWgt
1         1   :    2    :  3                            3   12  PopUndernourished
1    1   :    2   :   3                                 3    6  F2MParityLit
1   :    2    :  3                                      3    5  DebtServExpInc
|-------+-------+-------+-------+-------+-------+-------|  NUM   INDCTR
-100  100     300     500     700     900    1100    1300
                   1
       1   1 13445403312323 41 221    2   1   1            COUNTRIES
       T      S       M      S       T

Discussion

A key element in the results obtained here concerns the fact that the data were about 35% missing. Whether or not any given indicator was actually rated for any given country, the measure can still be interpreted as implying the expected rating. This capacity to take missing data into account can be taken advantage of systematically by calibrating a large bank of indicators. With this in hand, it becomes possible to gather only the amount of data needed to make a specific determination, or to adaptively administer the indicators so as to obtain the lowest-error (most reliable) measure at the lowest cost (with the fewest indicators administered). Perhaps most importantly, different collections of indicators can then be equated to measure in the same unit, so that impacts may be compared more efficiently.

Instead of an international developmental aid market that is so inefficient as to preclude any expectation of measured returns on investment, setting up a calibrated bank of indicators to which all measures are traceable opens up numerous desirable possibilities. The cost of assessing and interpreting the data informing aid transactions could be reduced to negligible amounts, and the management of the processes and outcomes in which that aid is invested would be made much more efficient by reduced data volume and enhanced information content. Because capital would flow more efficiently to where supply is meeting demand, nonproducers would be cut out of the market, and the effectiveness of the aid provided would be multiplied many times over.

The capacity to harmonize counts of different but related events into a single measurement system presents the possibility that there may be a bright future for outcomes-based budgeting in education, health care, human resource management, environmental management, housing, corrections, social services, philanthropy, and international development. It may seem wildly unrealistic to imagine such a thing, but the return on the investment would be so monumental that not checking it out would be even crazier.

A full report on the MDG data, with the other references cited, is available on my SSRN page at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1739386.

Goldberg, S. H. (2009). Billions of drops in millions of buckets: Why philanthropy doesn’t advance social progress. New York: Wiley.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Consequences of Standardized Technical Effects for Scientific Advancement

January 24, 2011

Note. This is modified from:

Fisher, W. P., Jr. (2004, Wednesday, January 21). Consequences of standardized technical effects for scientific advancement. In  A. Leplège (Chair), Session 2.5A. Rasch Models: History and Philosophy. Second International Conference on Measurement in Health, Education, Psychology, and Marketing: Developments with Rasch Models, The International Laboratory for Measurement in the Social Sciences, School of Education, Murdoch University, Perth, Western Australia.

—————————

Over the last several decades, historians of science have repeatedly produced evidence contradicting the widespread assumption that technology is a product of experimentation and/or theory (Kuhn 1961; Latour 1987; Rabkin 1992; Schaffer 1992; Hankins & Silverman 1999; Baird 2002). Theory and experiment typically advance only within the constraints set by a key technology that is widely available to end users in applied and/or research contexts. Thus, “it is not just a clever historical aphorism, but a general truth, that ‘thermodynamics owes much more to the steam engine than ever the steam engine owed to thermodynamics’” (Price 1986, p. 240).

The prior existence of the relevant technology comes to bear on theory and experiment again in the common, but mistaken, assumption that measures are made and experimentally compared in order to discover scientific laws. History and the logic of measurement show that measures are rarely made until the relevant law is effectively embodied in an instrument (Kuhn 1961; Michell 1999). This points to the difficulty experienced in metrologically fusing (Schaffer 1992, p. 27; Lapré & van Wassenhove 2002) instrumentalists’ often inarticulate, but materially effective, knowledge (know-how) with theoreticians’ often immaterial, but well articulated, knowledge (know-why) (Galison 1999; Baird 2002).

Because technology often dictates what, if any, phenomena can be consistently produced, it constrains experimentation and theorizing by focusing attention selectively on reproducible, potentially interpretable effects, even when those effects are not well understood (Ackermann 1985; Daston & Galison 1992; Ihde 1998; Hankins & Silverman 1999; Maasen & Weingart 2001). Criteria for theory choice in this context stem from competing explanatory frameworks’ experimental capacities to facilitate instrument improvements, prediction of experimental results, and gains in the efficiency with which a phenomenon is produced.

In this context, the relatively recent introduction of measurement models requiring additive, invariant parameterizations (Rasch 1960) provokes speculation as to the effect on the human sciences that might be wrought by the widespread availability of consistently reproducible effects expressed in common quantitative languages. Paraphrasing Price’s comment on steam engines and thermodynamics, might it one day be said that as yet unforeseeable advances in reading theory will owe far more to the Lexile analyzer (Burdick & Stenner 1996) than ever the Lexile analyzer owed reading theory?

Kuhn (1961) speculated that the second scientific revolution of the mid-nineteenth century followed in large part from the full mathematization of physics, i.e., the emergence of metrology as a professional discipline focused on providing universally accessible uniform units of measurement (Roche 1998). Might a similar revolution and new advances in the human sciences follow from the introduction of rigorously mathematical uniform measures?

Measurement technologies capable of supporting the calibration of additive units that remain invariant over instruments and samples (Rasch 1960) have been introduced relatively recently in the human sciences. The invariances produced appear 1) very similar to those produced in the natural sciences (Fisher 1997) and 2) based in the same mathematical metaphysics as that informing the natural sciences (Fisher 2003). Might then it be possible that the human sciences are on the cusp of a revolution analogous to that of nineteenth century physics? Other factors involved in answering this question, such as the professional status of the field, the enculturation of students, and the scale of the relevant enterprises, define the structure of circumstances that might be capable of supporting the kind of theoretical consensus and research productivity that came to characterize, for instance, work in electrical resistance through the early 1880s (Schaffer 1992).

Much could be learned from Rasch’s use of Maxwell’s method of analogy (Nersessian, 2002; Turner, 1955), not just in the modeling of scientific laws but from the social and economic factors that made the regularities of natural phenomena function as scientific capital (Latour, 1987). Quantification must be understood in the fully mathematical sense of commanding a comprehensive grasp of the real root of mathematical thinking. Far from being simply a means of producing numbers, to be useful, quantification has to result in qualitatively transparent figure-meaning relations at any point of use for any one of every different kind of user. Connections between numbers and unit amounts of the variable must remain constant across samples, instruments, time, space, and measurers. Quantification that does not support invariant linear comparisons expressed in a uniform metric available universally to all end users at the point of need is inadequate and incomplete. Such standardization is widely respected in the natural sciences but is virtually unknown in the human sciences, largely due to untested hypotheses and unexamined prejudices concerning the viability of universal uniform measures for the variables measured via tests, surveys, and performance assessments.

Quantity is an effective medium for science to the extent that it comprises an instance of the kind of common language necessary for distributed, collective thinking; for widespread agreement on what makes research results compelling; and for the formation of social capital’s group-level effects. It may be that the primary relevant difference between the case of 19th century physics and today’s human sciences concerns the awareness, widespread among scientists in the 1800s and virtually nonexistent in today’s human sciences, that universal uniform metrics for the variables of interest are both feasible and of great human, scientific, and economic value.

In the creative dynamics of scientific instrument making, as in the making of art, the combination of inspiration and perspiration can sometimes result in cultural gifts of the first order. It nonetheless often happens that some of these superlative gifts, no matter how well executed, are unable to negotiate the conflict between commodity and gift economics characteristic of the marketplace (Baird, 1997; Hagstrom, 1965; Hyde, 1979), and so remain unknown, lost to the audiences they deserve, and unable to render their potential effects historically. Value is not an intrinsic characteristic of the gift; rather, value is ascribed as a function of interests. If interests are not cultivated via the clear definition of positive opportunities for self-advancement, common languages, socio-economic relations, and recruitment, gifts of even the greatest potential value may die with their creators. On the other hand, who has not seen mediocrity disproportionately rewarded merely as a result of intensive marketing?

A central problem is then how to strike a balance between individual or group interests and the public good. Society and individuals are interdependent in that children are enculturated into the specific forms of linguistic and behavioral competence that are valued in communities at the same time that those communities are created, maintained, and reproduced through communicative actions (Habermas, 1995, pp. 199-200). The identities of individuals and societies then co-evolve, as each defines itself through the other via the medium of language. Language is understood broadly in this context to include all perceptual reading of the environment, bodily gestures, social action, etc., as well as the use of spoken or written symbols and signs (Harman, 2005; Heelan, 1983; Ihde, 1998; Nicholson, 1984; Ricoeur, 1981).

Technologies extend language by providing media for the inscription of new kinds of signs (Heelan, 1983a, 1998; Ihde, 1991, 1998; Ihde & Selinger, 2003). Thus, mobility desires and practices are inscribed and projected into the world using the automobile; shelter and life style, via housing and clothing; and communications, via alphabets, scripts, phonemes, pens and paper, telephones, and computers. Similarly, technologies in the form of test, survey, and assessment instruments provide the devices on which we inscribe desires for social mobility, career advancement, health maintenance and improvement, etc.

References

Ackermann, J. R. (1985). Data, instruments, and theory: A dialectical approach to understanding science. Princeton, New Jersey: Princeton University Press.

Baird, D. (1997, Spring-Summer). Scientific instrument making, epistemology, and the conflict between gift and commodity economics. Techné: Journal of the Society for Philosophy and Technology, 2(3-4), 25-46. Retrieved 08/28/2009, from http://scholar.lib.vt.edu/ejournals/SPT/v2n3n4/baird.html.

Baird, D. (2002, Winter). Thing knowledge – function and truth. Techné: Journal of the Society for Philosophy and Technology, 6(2). Retrieved 19/08/2003, from http://scholar.lib.vt.edu/ejournals/SPT/v6n2/baird.html.

Burdick, H., & Stenner, A. J. (1996). Theoretical prediction of test items. Rasch Measurement Transactions, 10(1), 475 [http://www.rasch.org/rmt/rmt101b.htm].

Daston, L., & Galison, P. (1992, Fall). The image of objectivity. Representations, 40, 81-128.

Galison, P. (1999). Trading zone: Coordinating action and belief. In M. Biagioli (Ed.), The science studies reader (pp. 137-160). New York, New York: Routledge.

Habermas, J. (1995). Moral consciousness and communicative action. Cambridge, Massachusetts: MIT Press.

Hagstrom, W. O. (1965). Gift-giving as an organizing principle in science. The Scientific Community. New York: Basic Books, pp. 12-22. (Rpt. in B. Barnes, (Ed.). (1972). Sociology of science: Selected readings (pp. 105-20). Baltimore, Maryland: Penguin Books.

Hankins, T. L., & Silverman, R. J. (1999). Instruments and the imagination. Princeton, New Jersey: Princeton University Press.

Harman, G. (2005). Guerrilla metaphysics: Phenomenology and the carpentry of things. Chicago: Open Court.

Hyde, L. (1979). The gift: Imagination and the erotic life of property. New York: Vintage Books.

Ihde, D. (1998). Expanding hermeneutics: Visualism in science. Northwestern University Studies in Phenomenology and Existential Philosophy). Evanston, Illinois: Northwestern University Press.

Kuhn, T. S. (1961). The function of measurement in modern physical science. Isis, 52(168), 161-193. (Rpt. in The essential tension: Selected studies in scientific tradition and change (pp. 178-224). Chicago, Illinois: University of Chicago Press (Original work published 1977).

Lapré, M. A., & Van Wassenhove, L. N. (2002, October). Learning across lines: The secret to more efficient factories. Harvard Business Review, 80(10), 107-11.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York, New York: Cambridge University Press.

Maasen, S., & Weingart, P. (2001). Metaphors and the dynamics of knowledge. (Vol. 26. Routledge Studies in Social and Political Thought). London: Routledge.

Michell, J. (1999). Measurement in psychology: A critical history of a methodological concept. Cambridge: Cambridge University Press.

Nersessian, N. J. (2002). Maxwell and “the Method of Physical Analogy”: Model-based reasoning, generic abstraction, and conceptual change. In D. Malament (Ed.), Essays in the history and philosophy of science and mathematics (pp. 129-166). Lasalle, Illinois: Open Court.

Price, D. J. d. S. (1986). Of sealing wax and string. In Little Science, Big Science–and Beyond (pp. 237-253). New York, New York: Columbia University Press. p. 240:

Rabkin, Y. M. (1992). Rediscovering the instrument: Research, industry, and education. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 57-82). Bellingham, Washington: SPIE Optical Engineering Press.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Roche, J. (1998). The mathematics of measurement: A critical history. London: The Athlone Press.

Schaffer, S. (1992). Late Victorian metrology and its instrumentation: A manufactory of Ohms. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 23-56). Bellingham, WA: SPIE Optical Engineering Press.

Turner, J. (1955, November). Maxwell on the method of physical analogy. British Journal for the Philosophy of Science, 6, 226-238.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Newton, Metaphysics, and Measurement

January 20, 2011

Though Newton claimed to deduce quantitative propositions from phenomena, the record shows that he brought a whole cartload of presuppositions to bear on his observations (White, 1997), such as his belief that Pythagoras was the discoverer of the inverse square law, his knowledge of Galileo’s freefall experiments, and his theological and astrological beliefs in occult actions at a distance. Without his immersion in this intellectual environment, he likely would not have been able to then contrive the appearance of deducing quantity from phenomena.

The second edition of the Principia, in which appears the phrase “hypotheses non fingo,” was brought out in part to respond to the charge that Newton had not offered any explanation of what gravity is. De Morgan, in particular, felt that Newton seemed to know more than he could prove (Keynes, 1946). But in his response to the critics, and in asserting that he feigns no hypotheses, Newton was making an important distinction between explaining the causes or composition of gravity and describing how it works. Newton was saying he did not rely on or make or test any hypotheses as to what gravity is; his only concern was with how it behaves. In due course, gravity came to be accepted as a fundamental feature of the universe in no need of explanation.

Heidegger (1977, p. 121) contends that Newton was, as is implied in the translation “I do not feign hypotheses,” saying in effect that the ground plan he was offering as a basis for experiment and practical application was not something he just made up. Despite Newton’s rejection of metaphysical explanations, the charge of not explaining gravity for what it is was being answered with a metaphysics of how, first, to derive the foundation for a science of precise predictive control from nature, and then resituate that foundation back within nature as an experimental method incorporating a mathematical plan or model. This was, of course, quite astute of Newton, as far as he went, but he stopped far short of articulating the background assumptions informing his methods.

Newton’s desire for a logic of experimental science led him to reject anything “metaphysical or physical, or based on occult qualities, or mechanical” as a foundation for proceeding. Following in Descartes’ wake, Newton then was satisfied to solidify the subject-object duality and to move forward on the basis of objective results that seemed to make metaphysics a thing of the past. Unfortunately, as Burtt (1954/1932, pp. 225-230) observes in this context, the only thing that can possibly happen when you presume discourse to be devoid of metaphysical assumptions is that your metaphysics is more subtly insinuated and communicated to others because it is not overtly presented and defended. Thus we have the history of logical positivism as the dominant philosophy of science.

It is relevant to recall here that Newton was known for strong and accurate intuitions, and strong and unorthodox religious views (he held the Lucasian Chair at Cambridge only by royal dispensation, as he was not Anglican). It must be kept in mind that Newton’s combination of personal characteristics was situated in the social context of the emerging scientific culture’s increasing tendency to prioritize results that could be objectively detached from the particular people, equipment, samples, etc. involved in their production (Shapin, 1989). Newton then had insights that, while remarkably accurate, could not be entirely derived from the evidence he offered and that, moreover, could not acceptably be explained informally, psychologically, or theologically.

What is absolutely fascinating about this constellation of factors is that it became a model for the conduct of science. Of course, Newton’s laws of motion were adopted as the hallmark of successful scientific modeling in the form of the Standard Model applied throughout physics in the nineteenth century (Heilbron, 1993). But so was the metaphysical positivist logic of a pure objectivism detached from everything personal, intuitive, metaphorical, social, economic, or religious (Burtt, 1954/1932).

Kuhn (1970) made a major contribution to dismantling this logic when he contrasted textbook presentations of the methodical production of scientific effects with the actual processes of cobbled-together fits and starts that are lived out in the work of practicing scientists. But much earlier, James Clerk Maxwell (1879, pp. 162-163) had made exactly the same observation in a contrast of the work of Ampere with that of Faraday:

“The experimental investigation by which Ampere established the laws of the mechanical action between electric currents is one of the most brilliant achievements in science. The whole, theory and experiment, seems as if it had leaped, full grown and full armed, from the brain of the ‘Newton of electricity.’ It is perfect in form, and unassailable in accuracy, and it is summed up in a formula from which all the phenomena may be deduced, and which must always remain the cardinal formula of electro-dynamics.

“The method of Ampere, however, though cast into an inductive form, does not allow us to trace the formation of the ideas which guided it. We can scarcely believe that Ampere really discovered the law of action by means of the experiments which he describes. We are led to suspect, what, indeed, he tells us himself* [Ampere’s Theorie…, p. 9], that he discovered the law by some process which he has not shewn us, and that when he had afterwards built up a perfect demonstration he removed all traces of the scaffolding by which he had raised it.

“Faraday, on the other hand, shews us his unsuccessful as well as his successful experiments, and his crude ideas as well as his developed ones, and the reader, however inferior to him in inductive power, feels sympathy even more than admiration, and is tempted to believe that, if he had the opportunity, he too would be a discoverer. Every student therefore should read Ampere’s research as a splendid example of scientific style in the statement of a discovery, but he should also study Faraday for the cultivation of a scientific spirit, by means of the action and reaction which will take place between newly discovered facts and nascent ideas in his own mind.”

Where does this leave us? In sum, Rasch emulated Ampere in two ways. He did so first in wanting to become the “Newton of reading,” or even the “Newton of psychosocial constructs,” when he sought to show that data from reading test items and readers are structured with an invariance analogous to that of data from instruments applying a force to an object with mass (Rasch, 1960, pp. 110-115). Rasch emulated Ampere again when, like Ampere, after building up a perfect demonstration of a reading law structured in the form of Newton’s second law, he did not report the means by which he had constructed test items capable of producing the data fitting the model, effectively removing all traces of the scaffolding.

The scaffolding has been reconstructed for reading (Stenner, et al., 2006) and has also been left in plain view by others doing analogous work involving other constructs (cognitive and moral development, mathematics ability, short-term memory, etc.). Dawson (2002), for instance, compares developmental scoring systems of varying sophistication and predictive control. And it may turn out that the plethora of uncritically applied Rasch analyses may turn out to be a capital resource for researchers interested in focusing on possible universal laws, predictive theories, and uniform metrics.

That is, published reports of calibration, error, and fit estimates open up opportunities for “pseudo-equating” (Beltyukova, Stone, & Fox, 2004; Fisher 1997, 1999) in their documentation of the invariance, or lack thereof, of constructs over samples and instruments. The evidence will point to a need for theoretical and metric unification directly analogous to what happened in the study and use of electricity in the nineteenth century:

“…’the existence of quantitative correlations between the various forms of energy, imposes upon men of science the duty of bringing all kinds of physical quantity to one common scale of comparison.’” [Schaffer, 1992, p. 26; quoting Everett 1881; see Smith & Wise 1989, pp. 684-4]

Qualitative and quantitative correlations in scaling results converged on a common construct in the domain of reading measurement through the 1960s and 1970s, culminating in the Anchor Test Study and the calibration of the National Reference Scale for Reading (Jaeger, 1973; Rentz & Bashaw, 1977). The lack of a predictive theory and the entirely empirical nature of the scale estimates prevented the scale from wide application, as the items in the tests that were equated were soon replaced with new items.

But the broad scale of the invariance observed across tests and readers suggests that some mechanism must be at work (Stenner, Stone, & Burdick, 2009), or that some form of life must be at play (Fisher, 2003a, 2003b, 2004, 2010a), structuring the data. Eventually, some explanation accounting for the structure ought to become apparent, as it did for reading (Stenner, Smith, & Burdick, 1983; Stenner, et al., 2006). This emergence of self-organizing structures repeatedly asserting themselves as independently existing real things is the medium of the message we need to hear. That message is that instruments play a very large and widely unrecognized role in science. By facilitating the routine production of mutually consistent, regularly observable, and comparable results they set the stage for theorizing, the emergence of consensus on what’s what, and uniform metrics (Daston & Galison, 2007; Hankins & Silverman, 1999; Latour, 1987, 2005; Wise, 1988, 1995). The form of Rasch’s models as extensions of Maxwell’s method of analogy (Fisher, 2010b) makes them particularly productive as a means of providing self-organizing invariances with a medium for their self-inscription. But that’s a story for another day.

References

Beltyukova, S. A., Stone, G. E., & Fox, C. M. (2004). Equating student satisfaction measures. Journal of Applied Measurement, 5(1), 62-9.

Burtt, E. A. (1954/1932). The metaphysical foundations of modern physical science (Rev. ed.) [First edition published in 1924]. Garden City, New York: Doubleday Anchor.

Daston, L., & Galison, P. (2007). Objectivity. Cambridge, MA: MIT Press.

Dawson, T. L. (2002, Summer). A comparison of three developmental stage scoring systems. Journal of Applied Measurement, 3(2), 146-89.

Fisher, W. P., Jr. (1997). Physical disability construct convergence across instruments: Towards a universal metric. Journal of Outcome Measurement, 1(2), 87-113.

Fisher, W. P., Jr. (1999). Foundations for health status metrology: The stability of MOS SF-36 PF-10 calibrations across samples. Journal of the Louisiana State Medical Society, 151(11), 566-578.

Fisher, W. P., Jr. (2003a, December). Mathematics, measurement, metaphor, metaphysics: Part I. Implications for method in postmodern science. Theory & Psychology, 13(6), 753-90.

Fisher, W. P., Jr. (2003b, December). Mathematics, measurement, metaphor, metaphysics: Part II. Accounting for Galileo’s “fateful omission.” Theory & Psychology, 13(6), 791-828.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2010a). Reducible or irreducible? Mathematical reasoning and the ontological method. Journal of Applied Measurement, 11(1), 38-59.

Fisher, W. P., Jr. (2010b). The standard model in the history of the natural sciences, econometrics, and the social sciences. Journal of Physics: Conference Series, 238(1), http://iopscience.iop.org/1742-6596/238/1/012016/pdf/1742-6596_238_1_012016.pdf.

Hankins, T. L., & Silverman, R. J. (1999). Instruments and the imagination. Princeton, New Jersey: Princeton University Press.

Jaeger, R. M. (1973). The national test equating study in reading (The Anchor Test Study). Measurement in Education, 4, 1-8.

Keynes, J. M. (1946, July). Newton, the man. (Speech given at the Celebration of the Tercentenary of Newton’s birth in 1642.) MacMillan St. Martin’s Press (London, England), The Collected Writings of John Maynard Keynes Volume X, 363-364.

Kuhn, T. S. (1970). The structure of scientific revolutions. Chicago, Illinois: University of Chicago Press.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York: Cambridge University Press.

Latour, B. (2005). Reassembling the social: An introduction to Actor-Network-Theory. (Clarendon Lectures in Management Studies). Oxford, England: Oxford University Press.

Maxwell, J. C. (1879). Treatise on electricity and magnetism, Volumes I and II. London, England: Macmillan.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Rentz, R. R., & Bashaw, W. L. (1977, Summer). The National Reference Scale for Reading: An application of the Rasch model. Journal of Educational Measurement, 14(2), 161-179.

Schaffer, S. (1992). Late Victorian metrology and its instrumentation: A manufactory of Ohms. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 23-56). Bellingham, WA: SPIE Optical Engineering Press.

Shapin, S. (1989, November-December). The invisible technician. American Scientist, 77, 554-563.

Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2006). How accurate are Lexile text measures? Journal of Applied Measurement, 7(3), 307-22.

Stenner, A. J., Smith, M., III, & Burdick, D. S. (1983, Winter). Toward a theory of construct definition. Journal of Educational Measurement, 20(4), 305-316.

Stenner, A. J., Stone, M., & Burdick, D. (2009, Autumn). The concept of a measurement mechanism. Rasch Measurement Transactions, 23(2), 1204-1206.

White, M. (1997). Isaac Newton: The last sorcerer. New York: Basic Books.

Wise, M. N. (1988). Mediating machines. Science in Context, 2(1), 77-113.

Wise, M. N. (Ed.). (1995). The values of precision. Princeton, New Jersey: Princeton University Press.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Open Letter to the Impact Investment Community

May 4, 2010

It is very encouraging to discover your web sites (GIIN, IRIS, and GIIRS) and to see the work you’re doing in advancing the concept of impact investing. The defining issue of our time is figuring out how to harness the profit motive for socially responsible and environmentally sustainable prosperity. The economic, social, and environmental disasters of today might all have been prevented or significantly mitigated had social and environmental impacts been taken into account in all investing.

My contribution is to point out that, though the profit motive must be harnessed as the engine driving responsible and sustainable business practices, the force of that power is dissipated and negated by the lack of efficient human, social, and natural capital markets. If we cannot make these markets function more like financial markets, so that money naturally flows to those places where it produces the greatest returns, we will never succeed in the fundamental reorientation of the economy toward responsible sustainability. The goal has to be one of tying financial profits to growth in realized human potential, community, and environmental quality, but to do that we need measures of these intangible forms of capital that are as scientifically rigorous as they are eminently practical and convenient.

Better measurement is key to reducing the market frictions that inflate the cost of human, social, and natural capital transactions. A truly revolutionary paradigm shift has occurred in measurement theory and practice over the last fifty years and more. New methods make it possible

* to reduce data volume dramatically with no loss of information,
* to custom tailor measures by selectively adapting indicators to the entity rated, without compromising comparability,
* to remove rater leniency or severity effects from the measures,
* to design optimally efficient measurement systems that provide the level of precision needed to support decision making,
* to establish reference standard metrics that remain universally uniform across variations in local impact assessment indicator configurations, and
* to calibrate instruments that measure in metrics intuitively meaningful to stakeholders and end users.

Unfortunately, almost all the admirable energy and resources being poured into business intelligence measures skip over these “new” developments, defaulting to mistaken assumptions about numbers and the nature of measurement. Typical ratings, checklists, and scores provide units of measurement that

* change size depending on which question is asked, which rating category is assigned, and who or what is rated,
* increase data volume with every new question asked,
* push measures up and down in uncontrolled ways depending on who is judging the performance,
* are of unknown precision, and
* cannot be compared across different composite aggregations of ratings.

I have over 25 years experience in the use of advanced measurement and instrument calibration methods, backed up with MA and PhD degrees from the University of Chicago. The methods in which I am trained have been standard practice in educational testing for decades, and in the last 20 years have become the methods of choice in health care outcomes assessment.

I am passionately committed to putting these methods to work in the domain of impact investing, business intelligence, and ecological economics. As is shown in my attached CV, I have dozens of peer-reviewed publications presenting technical and philosophical research in measurement theory and practice.

In the last few years, I have taken my work in the direction of documenting the ways in which measurement can and should reduce information overload and transaction costs; enhance human, social, and natural capital market efficiencies; provide the instruments embodying common currencies for the exchange of value; and inform a new kind of Genuine Progress Indicator or Happiness Index.

For more information, please see the attached 2009 article I published in Measurement on these topics, and the attached White Paper I produced last July in response to call from NIST for critical national need ideas. Various entries in my blog (https://livingcapitalmetrics.wordpress.com) elaborate on measurement technicalities, history, and philosophy, as do my web site at http://www.livingcapitalmetrics.com and my profile at http://www.linkedin.com/in/livingcapitalmetrics.

For instance, the blog post at https://livingcapitalmetrics.wordpress.com/2009/11/22/al-gore-will-is-not-the-problem/ explores the idea with which I introduced myself to you here, that the profit motive embodies our collective will for responsible and sustainable business practices, but we hobble ourselves with self-defeating inattention to the ways in which capital is brought to life in efficient markets. We have the solutions to our problems at hand, though there are no panaceas, and the challenges are huge.

Please feel free to contact me at your convenience. Whether we are ultimately able to work together or not, I enthusiastically wish you all possible success in your endeavors.

Sincerely,

William P. Fisher, Jr., Ph.D.
LivingCapitalMetrics.com
919-599-7245

We are what we measure.
It’s time we measured what we want to be.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

How bad will the financial crises have to get before…?

April 30, 2010

More and more states and nations around the world face the possibility of defaulting on their financial obligations. The financial crises are of epic historical proportions. This is a disaster of the first order. And yet, it is so odd–we have the solutions and preventative measures we need at our finger tips, but no one knows about them or is looking for them.

So,  I am persuaded to once again wonder if there might now be some real interest in the possibilities of capitalizing on

  • measurement’s well-known capacity for reducing transaction costs by improving information quality and reducing information volume;
  • instruments calibrated to measure in constant units (not ordinal ones) within known error ranges (not as though the measures are perfectly precise) with known data quality;
  • measures made meaningful by their association with invariant scales defined in terms of the questions asked;
  • adaptive instrument administration methods that make all measures equally precise by targeting the questions asked;
  • judge calibration methods that remove the person rating performances as a factor influencing the measures;
  • the metaphor of transparency by calibrating instruments that we really look right through at the thing measured (risk, governance, abilities, health, performance, etc.);
  • efficient markets for human, social, and natural capital by means of the common currencies of uniform metrics, calibrated instrumentation, and metrological networks;
  • the means available for tuning the instruments of the human, social, and environmental sciences to well-tempered scales that enable us to more easily harmonize, orchestrate, arrange, and choreograph relationships;
  • our understandings that universal human rights require universal uniform measures, that fair dealing requires fair measures, and that our measures define who we are and what we value; and, last but very far from least,
  • the power of love–the back and forth of probing questions and honest answers in caring social intercourse plants seminal ideas in fertile minds that can be nurtured to maturity and Socratically midwifed as living meaning born into supportive ecologies of caring relations.

How bad do things have to get before we systematically and collectively implement the long-established and proven methods we have at our disposal? It is the most surreal kind of schizophrenia or passive-aggressive avoidance pathology to keep on tormenting ourselves with problems for which we have solutions.

For more information on these issues, see prior blogs posted here, the extensive documentation provided, and http://www.livingcapitalmetrics.com.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Reasoning by analogy in social science education: On the need for a new curriculum

April 12, 2010

I’d like to revisit the distinction between measurement models and statistical models. Rasch was well known for joking about burning all books containing the words “normal distribution” (Andersen, 1995, p. 385). Rasch’s book and 1961 article both start on their first pages with a distinction between statistical models describing intervariable relations at the group level and measurement models prescribing intravariable relations at the individual level. I think confusion between these kinds of models has caused huge problems.

We typically assume all statistical analyses are quantitative. We refer to any research that uses numbers as quantitative even when nothing is done to map a substantive and invariant unit on a number line. We distinguish between qualitative and quantitative data and methods as though quantification has ever been achieved in the history of science without substantive qualitative understandings of the constructs.

Quantification in fact predates the emergence of statistics by millennia. It seems to me that there is a great deal to be gained from maintaining a careful distinction between statistics and measurement. Measurement is not primarily performed by someone sitting at a computer analyzing data. Measurement is done by individuals using calibrated instruments to obtain immediately useful quantitative information expressed in a universally uniform unit.

Rasch was correct in his assertion that we can measure the reading ability of a child with the same kind of objectivity with which we measure his or her weight or height. But we don’t commonly express individual height and weight measures in statistical terms. 

Information overload is one of the big topics of the day. Which will contribute more to reducing that overload in efficient and meaningful ways: calibrated instruments measuring in common units giving individual users immediate feedback that summarizes responses to dozens of questions, or ordinal group-level item-by-item statistics reported six months too late to do anything about them?

Instrument calibration certainly makes use of statistics, and statistical models usually assume measurement has taken place, but much stands to be gained from a clear distinction between inter- and intra-variable models. And so I respectfully disagree with those who assert that “the Rasch model is first of all a statistical model.” Maxwell’s method of making analogies from well known physical laws (Nersessian, 2002; Turner, 1955) was adopted by Rasch (1960, pp. 110-115) so that his model would have the same structure as the laws of physics.

Statistical models are a different class of models from the laws of physics (Meehl, 1967), since they allow cross-variable interactions in ways that compromise and defeat the possibility of testing the hypotheses of constant unit size, parameter separation, sufficiency, etc.

I’d like to suggest a paraphrase of the first sentence of the abstract from a recent paper (Silva, 2007) on using analogies in science education: Despite its great importance, many students and even their teachers still cannot recognize the relevance of measurement models to build up psychosocial knowledge and are unable to develop qualitative explanations for mathematical expressions of the lawful structural invariances that exist within the social sciences.

And so, here’s a challenge: we need to make an analogy from Silva’s (2007) work in physics science education and develop a curriculum for social science education that follows a parallel track. We could trace the development of reading measurement from Rasch (1960) through the Anchor Test Study (Jaeger, 1973; Rentz & Bashaw, 1977) to the introduction of the Lexile Framework for Reading (Stenner, 2001) and its explicit continuity with Rasch’s use of Maxwell’s method of analogy (Burdick, Stone, & Stenner, 2006) and full blown predictive theory (Stenner & Stone, 2003).

With the example of the Rasch Reading Law in hand, we could then train students and teachers to think about structural invariance in the context of psychosocial constructs. It may be that, without the development and dissemination of at least a college-level curriculum of this kind, we will never overcome the confusion between statistical and measurement models.

References

Andersen, E. B. (1995). What George Rasch would have thought about this book. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 383-390). New York: Springer-Verlag.

Burdick, D. S., Stone, M. H., & Stenner, A. J. (2006). The Combined Gas Law and a Rasch Reading Law. Rasch Measurement Transactions, 20(2), 1059-60 [http://www.rasch.org/rmt/rmt202.pdf].

Jaeger, R. M. (1973). The national test equating study in reading (The Anchor Test Study). Measurement in Education, 4, 1-8.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.

Nersessian, N. J. (2002). Maxwell and “the Method of Physical Analogy”: Model-based reasoning, generic abstraction, and conceptual change. In D. Malament (Ed.), Essays in the history and philosophy of science and mathematics (pp. 129-166). Lasalle, Illinois: Open Court.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Rasch, G. (1961). On general laws and the meaning of measurement in psychology. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability (pp. 321-333 [http://www.rasch.org/memo1960.pdf]). Berkeley, California: University of California Press.

Rentz, R. R., & Bashaw, W. L. (1977, Summer). The National Reference Scale for Reading: An application of the Rasch model. Journal of Educational Measurement, 14(2), 161-179.

Silva, C. C. (2007, August). The role of models and analogies in the electromagnetic theory: A historical case study. Science & Education, 16(7-8), 835-848.

Stenner, A. J. (2001). The Lexile Framework: A common metric for matching readers and texts. California School Library Journal, 25(1), 41-2.

Stenner, A. J., & Stone, M. (2003). Item specification vs. item banking. Rasch Measurement Transactions, 17(3), 929-30 [http://www.rasch.org/rmt/rmt173a.htm].

Turner, J. (1955, November). Maxwell on the method of physical analogy. British Journal for the Philosophy of Science, 6, 226-238.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

How Evidence-Based Decision Making Suffers in the Absence of Theory and Instrument: The Power of a More Balanced Approach

January 28, 2010

The Basis of Evidence in Theory and Instrument

The ostensible point of basing decisions in evidence is to have reasons for proceeding in one direction versus any other. We want to be able to say why we are proceeding as we are. When we give evidence-based reasons for our decisions, we typically couch them in terms of what worked in past experience. That experience might have been accrued over time in practical applications, or it might have been deliberately arranged in one or more experimental comparisons and tests of concisely stated hypotheses.

At its best, generalizing from past experience to as yet unmet future experiences enables us to navigate life and succeed in ways that would not be possible if we could not learn and had no memories. The application of a lesson learned from particular past events to particular future events involves a very specific inferential process. To be able to recognize repeated iterations of the same things requires the accumulation of patterns of evidence. Experience in observing such patterns allows us to develop confidence in our understanding of what that pattern represents in terms of pleasant or painful consequences. When we are able to conceptualize and articulate an idea of a pattern, and when we are then able to recognize a new occurrence of that pattern, we have an idea of it.

Evidence-based decision making is then a matter of formulating expectations from repeatedly demonstrated and routinely reproducible patterns of observations that lend themselves to conceptual representations, as ideas expressed in words. Linguistic and cultural frameworks selectively focus attention by projecting expectations and filtering observations into meaningful patterns represented by words, numbers, and other symbols. The point of efforts aimed at basing decisions in evidence is to try to go with the flow of this inferential process more deliberately and effectively than might otherwise be the case.

None of this is new or controversial. However, the inferential step from evidence to decision always involves unexamined and unjustified assumptions. That is, there is always an element of metaphysical faith behind the expectation that any given symbol or word is going to work as a representation of something in the same way that it has in the past. We can never completely eliminate this leap of faith, since we cannot predict the future with 100% confidence. We can, however, do a lot to reduce the size of the leap, and the risks that go with it, by questioning our assumptions in experimental research that tests hypotheses as to the invariant stability and predictive utility of the representations we make.

Theoretical and Instrumental Assumptions Hidden Behind the Evidence

For instance, evidence as to the effectiveness of an intervention or treatment is often expressed in terms of measures commonly described as quantitative. But it is unusual for any evidence to be produced justifying that description in terms of something that really adds up in the way numbers do. So we often find ourselves in situations in which our evidence is much less meaningful, reliable, and valid than we suppose it to be.

Quantitative measures are often valued as the hallmark of rational science. But their capacity to live up to this billing depends on the quality of the inferences that can be supported. Very few researchers thoroughly investigate the quality of their measures and justify the inferences they make relative to that quality.

Measurement presumes a reproducible pattern of evidence that can serve as the basis for a decision concerning how much of something has been observed. It naturally follows that we often base measurement in counts of some kind—successes, failures, ratings, frequencies, etc. The counts, scores, or sums are then often transformed into percentages by dividing them into the maximum possible that could be obtained. Sometimes the scores are averaged for each person measured, and/or for each item or question on the test, assessment, or survey. These scores and percentages are then almost universally fed directly into decision processes or statistical analyses with no further consideration.

The reproducible pattern of evidence on which decisions are based is presumed to exist between the measures, not within them. In other words, the focus is on the group or population statistics, not on the individual measures. Attention is typically focused on the tip of the iceberg, the score or percentage, not on the much larger, but hidden, mass of information beneath it. Evidence is presumed to be sufficient to the task when the differences between groups of scores are of a consistent size or magnitude, but is this sufficient?

Going Past Assumptions to Testable Hypotheses

In other words, does not science require that evidence be explained by theory, and embodied in instrumentation that provides a shared medium of observation? As shown in the blue lines in the Figure below,

  • theory, whether or not it is explicitly articulated, inevitably influences both what counts as valid data and the configuration of the medium of its representation, the instrument;
  • data, whether or not it is systematically gathered and evaluated, inevitably influences both the medium of its representation, the instrument, and the implicit or explicit theory that explains its properties and justifies its applications; and
  • instruments, whether or not they are actually calibrated from a mapping of symbols and substantive amounts, inevitably influence data gathering and the image of the object explained by theory.

The rhetoric of evidence-based decision making skips over the roles of theory and instrumentation, drawing a direct line from data to decision. In leaving theory laxly formulated, we allow any story that makes a bit of sense and is communicated by someone with a bit of charm or power to carry the day. In not requiring calibrated instrumentation, we allow any data that cross the threshold into our awareness to serve as an acceptable basis for decisions.

What we want, however, is to require meaningful measures that really provide the evidence needed for instruments that exhibit invariant calibrations and for theories that provide predictive explanatory control over the variable. As shown in the Figure, we want data that push theory away from the instrument, theory that separates the data and instrument, and instruments that get in between the theory and data.

We all know to distrust too close a correspondence between theory and data, but we too rarely understand or capitalize on the role of the instrument in mediating the theory-data relation. Similarly, when the questions used as a medium for making observations are obviously biased to produce responses conforming overly closely with a predetermined result, we see that the theory and the instrument are too close for the data to serve as an effective mediator.

Finally, the situation predominating in the social sciences is one in which both construct and measurement theories are nearly nonexistent, which leaves data completely dependent on the instrument it came from. In other words, because counts of correct answers or sums of ratings are mistakenly treated as measures, instruments fully determine and restrict the range of measurement to that defined by the numbers of items and rating categories. Once the instrument is put in play, changes to it would make new data incommensurable with old, so, to retain at least the appearance of comparability, the data structure then fully determines and restricts the instrument.

What we want, though, is a situation in which construct and measurement theories work together to make the data autonomous of the particular instrument it came from. We want a theory that explains what is measured well enough for us to be able to modify existing instruments, or create entirely new ones, that give the same measures for the same amounts as the old instruments. We want to be able to predict item calibrations from the properties of the items, we want to obtain the same item calibrations across data sets, and we want to be able to predict measures on the basis of the observed responses (data) no matter which items or instrument was used to produce them.

Most importantly, we want a theory and practice of measurement that allows us to take missing data into account by providing us with the structural invariances we need as media for predicting the future from the past. As Ben Wright (1997, p. 34) said, any data analysis method that requires complete data to produce results disqualifies itself automatically as a viable basis for inference because we never have complete data—any practical system of measurement has to be positioned so as to be ready to receive, process, and incorporate all of the data we have yet to gather. This goal is accomplished to varying degrees in Rasch measurement (Rasch, 1960; Burdick, Stone, & Stenner, 2006; Dawson, 2004). Stenner and colleagues (Stenner, Burdick, Sanford, & Burdick, 2006) provide a trajectory of increasing degrees to which predictive theory is employed in contemporary measurement practice.

The explanatory and predictive power of theory is embodied in instruments that focus attention on recording observations of salient phenomena. These observations become data that inform the calibration of instruments, which then are used to gather further data that can be used in practical applications and in checks on the calibrations and the theory.

“Nothing is so practical as a good theory” (Lewin, 1951, p. 169). Good theory makes it possible to create symbolic representations of things that are easy to think with. To facilitate clear thinking, our words, numbers, and instruments must be transparent. We have to be able to look right through them at the thing itself, with no concern as to distortions introduced by the instrument, the sample, the observer, the time, the place, etc. This happens only when the structure of the instrument corresponds with invariant features of the world. And where words effect this transparency to an extent, it is realized most completely when we can measure in ways that repeatedly give the same results for the same amounts in the same conditions no matter which instrument, sample, operator, etc. is involved.

Where Might Full Mathematization Lead?

The attainment of mathematical transparency in measurement is remarkable for the way it focuses attention and constrains the imagination. It is essential to appreciate the context in which this focusing occurs, as popular opinion is at odds with historical research in this regard. Over the last 60 years, historians of science have come to vigorously challenge the widespread assumption that technology is a product of experimentation and/or theory (Kuhn, 1961/1977; Latour, 1987, 2005; Maas, 2001; Mendelsohn, 1992; Rabkin, 1992; Schaffer, 1992; Heilbron, 1993; Hankins & Silverman, 1999; Baird, 2002). Neither theory nor experiment typically advances until a key technology is widely available to end users in applied and/or research contexts. Rabkin (1992) documents multiple roles played by instruments in the professionalization of scientific fields. Thus, “it is not just a clever historical aphorism, but a general truth, that ‘thermodynamics owes much more to the steam engine than ever the steam engine owed to thermodynamics’” (Price, 1986, p. 240).

The prior existence of the relevant technology comes to bear on theory and experiment again in the common, but mistaken, assumption that measures are made and experimentally compared in order to discover scientific laws. History shows that measures are rarely made until the relevant law is effectively embodied in an instrument (Kuhn, 1961/1977, pp. 218-9): “…historically the arrow of causality is largely from the technology to the science” (Price, 1986, p. 240). Instruments do not provide just measures; rather they produce the phenomenon itself in a way that can be controlled, varied, played with, and learned from (Heilbron, 1993, p. 3; Hankins & Silverman, 1999; Rabkin, 1992). The term “technoscience” has emerged as an expression denoting recognition of this priority of the instrument (Baird, 1997; Ihde & Selinger, 2003; Latour, 1987).

Because technology often dictates what, if any, phenomena can be consistently produced, it constrains experimentation and theorizing by focusing attention selectively on reproducible, potentially interpretable effects, even when those effects are not well understood (Ackermann, 1985; Daston & Galison, 1992; Ihde, 1998; Hankins & Silverman, 1999; Maasen & Weingart, 2001). Criteria for theory choice in this context stem from competing explanatory frameworks’ experimental capacities to facilitate instrument improvements, prediction of experimental results, and gains in the efficiency with which a phenomenon is produced.

In this context, the relatively recent introduction of measurement models requiring additive, invariant parameterizations (Rasch, 1960) provokes speculation as to the effect on the human sciences that might be wrought by the widespread availability of consistently reproducible effects expressed in common quantitative languages. Paraphrasing Price’s comment on steam engines and thermodynamics, might it one day be said that as yet unforeseeable advances in reading theory will owe far more to the Lexile analyzer (Stenner, et al., 2006) than ever the Lexile analyzer owed reading theory?

Kuhn (1961/1977) speculated that the second scientific revolution of the early- to mid-nineteenth century followed in large part from the full mathematization of physics, i.e., the emergence of metrology as a professional discipline focused on providing universally accessible, theoretically predictable, and evidence-supported uniform units of measurement (Roche, 1998). Kuhn (1961/1977, p. 220) specifically suggests that a number of vitally important developments converged about 1840 (also see Hacking, 1983, p. 234). This was the year in which the metric system was formally instituted in France after 50 years of development (it had already been obligatory in other nations for 20 years at that point), and metrology emerged as a professional discipline (Alder, 2002, p. 328, 330; Heilbron, 1993, p. 274; Kula, 1986, p. 263). Daston (1992) independently suggests that the concept of objectivity came of age in the period from 1821 to 1856, and gives examples illustrating the way in which the emergence of strong theory, shared metric standards, and experimental data converged in a context of particular social mores to winnow out unsubstantiated and unsupportable ideas and contentions.

Might a similar revolution and new advances in the human sciences follow from the introduction of evidence-based, theoretically predictive, instrumentally mediated, and mathematical uniform measures? We won’t know until we try.

Figure. The Dialectical Interactions and Mutual Mediations of Theory, Data, and Instruments

Figure. The Dialectical Interactions and Mutual Mediations of Theory, Data, and Instruments

Acknowledgment. These ideas have been drawn in part from long consideration of many works in the history and philosophy of science, primarily Ackermann (1985), Ihde (1991), and various works of Martin Heidegger, as well as key works in measurement theory and practice. A few obvious points of departure are listed in the references.

References

Ackermann, J. R. (1985). Data, instruments, and theory: A dialectical approach to understanding science. Princeton, New Jersey: Princeton University Press.

Alder, K. (2002). The measure of all things: The seven-year odyssey and hidden error that transformed the world. New York: The Free Press.

Aldrich, J. (1989). Autonomy. Oxford Economic Papers, 41, 15-34.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Baird, D. (1997, Spring-Summer). Scientific instrument making, epistemology, and the conflict between gift and commodity economics. Techné: Journal of the Society for Philosophy and Technology, 3-4, 25-46. Retrieved 08/28/2009, from http://scholar.lib.vt.edu/ejournals/SPT/v2n3n4/baird.html.

Baird, D. (2002, Winter). Thing knowledge – function and truth. Techné: Journal of the Society for Philosophy and Technology, 6(2). Retrieved 19/08/2003, from http://scholar.lib.vt.edu/ejournals/SPT/v6n2/baird.html.

Burdick, D. S., Stone, M. H., & Stenner, A. J. (2006). The Combined Gas Law and a Rasch Reading Law. Rasch Measurement Transactions, 20(2), 1059-60 [http://www.rasch.org/rmt/rmt202.pdf].

Carroll-Burke, P. (2001). Tools, instruments and engines: Getting a handle on the specificity of engine science. Social Studies of Science, 31(4), 593-625.

Daston, L. (1992). Baconian facts, academic civility, and the prehistory of objectivity. Annals of Scholarship, 8, 337-363. (Rpt. in L. Daston, (Ed.). (1994). Rethinking objectivity (pp. 37-64). Durham, North Carolina: Duke University Press.)

Daston, L., & Galison, P. (1992, Fall). The image of objectivity. Representations, 40, 81-128.

Dawson, T. L. (2004, April). Assessing intellectual development: Three approaches, one sequence. Journal of Adult Development, 11(2), 71-85.

Galison, P. (1999). Trading zone: Coordinating action and belief. In M. Biagioli (Ed.), The science studies reader (pp. 137-160). New York, New York: Routledge.

Hacking, I. (1983). Representing and intervening: Introductory topics in the philosophy of natural science. Cambridge: Cambridge University Press.

Hankins, T. L., & Silverman, R. J. (1999). Instruments and the imagination. Princeton, New Jersey: Princeton University Press.

Heelan, P. A. (1983, June). Natural science as a hermeneutic of instrumentation. Philosophy of Science, 50, 181-204.

Heelan, P. A. (1998, June). The scope of hermeneutics in natural science. Studies in History and Philosophy of Science Part A, 29(2), 273-98.

Heidegger, M. (1977). Modern science, metaphysics, and mathematics. In D. F. Krell (Ed.), Basic writings [reprinted from M. Heidegger, What is a thing? South Bend, Regnery, 1967, pp. 66-108] (pp. 243-282). New York: Harper & Row.

Heidegger, M. (1977). The question concerning technology. In D. F. Krell (Ed.), Basic writings (pp. 283-317). New York: Harper & Row.

Heilbron, J. L. (1993). Weighing imponderables and other quantitative science around 1800. Historical studies in the physical and biological sciences), 24(Supplement), Part I, pp. 1-337.

Hessenbruch, A. (2000). Calibration and work in the X-ray economy, 1896-1928. Social Studies of Science, 30(3), 397-420.

Ihde, D. (1983). The historical and ontological priority of technology over science. In D. Ihde, Existential technics (pp. 25-46). Albany, New York: State University of New York Press.

Ihde, D. (1991). Instrumental realism: The interface between philosophy of science and philosophy of technology. (The Indiana Series in the Philosophy of Technology). Bloomington, Indiana: Indiana University Press.

Ihde, D. (1998). Expanding hermeneutics: Visualism in science. Northwestern University Studies in Phenomenology and Existential Philosophy). Evanston, Illinois: Northwestern University Press.

Ihde, D., & Selinger, E. (Eds.). (2003). Chasing technoscience: Matrix for materiality. (Indiana Series in Philosophy of Technology). Bloomington, Indiana: Indiana University Press.

Kuhn, T. S. (1961/1977). The function of measurement in modern physical science. Isis, 52(168), 161-193. (Rpt. In T. S. Kuhn, The essential tension: Selected studies in scientific tradition and change (pp. 178-224). Chicago: University of Chicago Press, 1977).

Kula, W. (1986). Measures and men (R. Screter, Trans.). Princeton, New Jersey: Princeton University Press (Original work published 1970).

Lapre, M. A., & Van Wassenhove, L. N. (2002, October). Learning across lines: The secret to more efficient factories. Harvard Business Review, 80(10), 107-11.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York, New York: Cambridge University Press.

Latour, B. (2005). Reassembling the social: An introduction to Actor-Network-Theory. (Clarendon Lectures in Management Studies). Oxford, England: Oxford University Press.

Lewin, K. (1951). Field theory in social science: Selected theoretical papers (D. Cartwright, Ed.). New York: Harper & Row.

Maas, H. (2001). An instrument can make a science: Jevons’s balancing acts in economics. In M. S. Morgan & J. Klein (Eds.), The age of economic measurement (pp. 277-302). Durham, North Carolina: Duke University Press.

Maasen, S., & Weingart, P. (2001). Metaphors and the dynamics of knowledge. (Vol. 26. Routledge Studies in Social and Political Thought). London: Routledge.

Mendelsohn, E. (1992). The social locus of scientific instruments. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 5-22). Bellingham, WA: SPIE Optical Engineering Press.

Polanyi, M. (1964/1946). Science, faith and society. Chicago: University of Chicago Press.

Price, D. J. d. S. (1986). Of sealing wax and string. In Little Science, Big Science–and Beyond (pp. 237-253). New York, New York: Columbia University Press.

Rabkin, Y. M. (1992). Rediscovering the instrument: Research, industry, and education. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 57-82). Bellingham, Washington: SPIE Optical Engineering Press.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Roche, J. (1998). The mathematics of measurement: A critical history. London: The Athlone Press.

Schaffer, S. (1992). Late Victorian metrology and its instrumentation: A manufactory of Ohms. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 23-56). Bellingham, WA: SPIE Optical Engineering Press.

Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2006). How accurate are Lexile text measures? Journal of Applied Measurement, 7(3), 307-22.

Thurstone, L. L. (1959). The measurement of values. Chicago: University of Chicago Press, Midway Reprint Series.

Wright, B. D. (1997, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45, 52 [http://www.rasch.org/memo62.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.