Posts Tagged ‘Models’

Psychology and the social sciences: An atheoretical, scattered, and disconnected body of research

February 16, 2019

A new article in Nature Human Behaviour (NHB) points toward the need for better theory and more rigorous mathematical models in psychology and the social sciences (Muthukrishna & Henrich, 2019). The authors rightly say that the lack of an overarching cumulative theoretical framework makes it very difficult to see whether new results fit well with previous work, or if something surprising has come to light. Mathematical models are especially emphasized as being of value in specifying clear and precise expectations.

The point that the social sciences and psychology need better theories and models is painfully obvious. But there are in fact thousands of published studies and practical real world applications that not only provide, but indeed often surpass, the kinds of predictive theories and mathematical models called for in the NHB article. The article not only makes no mention of any of this work, its argument is framed entirely in a statistical context instead of the more appropriate context of measurement science.

The concept of reliability provides an excellent point of entry. Most behavioral scientists think of reliability statistically, as a coefficient with a positive numeric value usually between 0.00 and 1.00. The tangible sense of reliability as indicating exactly how predictable an outcome is does not usually figure in most researchers’ thinking. But that sense of the specific predictability of results has been the focus of attention in social and psychological measurement science for decades.

For instance, the measurement of time is reliable in the sense that the position of the sun relative to the earth can be precisely predicted from geographic location, the time of day, and the day of the year. The numbers and words assigned to noon time are closely associated with the Sun being at the high point in the sky (though there are political variations by season and location across time zones).

That kind of a reproducible association is rarely sought in psychology and the social sciences, but it is far from nonexistent. One can discern different degrees to which that kind of association is included in models of measured constructs. Though most behavioral research doesn’t mention the connection between linear amounts of a measured phenomenon and a reproducible numeric representation of it (level 0), quite a significant body of work focuses on that connection (level 1). The disappointing thing about that level 1 work is that the relentless obsession with statistical methods prevents most researchers from connecting a reproducible quantity with a single expression of it in a standard unit, and with an associated uncertainty term (level 2). That is, level 1 researchers conceive measurement in statistical terms, as a product of data analysis. Even when results across data sets are highly correlated and could be equated to a common metric, level 1 researchers do not leverage that source of potential value for simplified communication and accumulated comparability.

And then, for their part, level 2 researchers usually do not articulate theories about the measured constructs, by augmenting the mathematical data model with an explanatory model predicting variation (level 3). Level 2 researchers are empirically grounded in data, and can expand their network of measures only by gathering more data and analyzing it in ways that bring it into their standard unit’s frame of reference.

Level 3 researchers, however, have come to see what makes their measures tick. They understand the mechanisms that make their questions vary. They can write new questions to their theoretical specifications, test those questions by asking them of a relevant sample, and produce the predicted calibrations. For instance, reading comprehension is well established to be a function of the difference between a person’s reading ability and the complexity of the text they encounter (see articles by Stenner in the list below). We have built our entire educational system around this idea, as we deliberately introduce children first to the alphabet, then to the most common words, then to short sentences, and then to ever longer and more complicated text. But stating the construct model, testing it against data, calibrating a unit to which all tests and measures can be traced, and connecting together all the books, articles, tests, curricula, and students is a process that began (in English and Spanish) only in the 1980s. The process still is far from finished, and most reading research still does not use the common metric.

In this kind of theory-informed context, new items can be automatically generated on the fly at the point of measurement. Those items and inferences made from them are validated by the consistency of the responses and the associated expression of the expected probability of success, agreement, etc. The expense of constant data gathering and analysis can be cut to a very small fraction of what it is at levels 0-2.

Level 3 research methods are not widely known or used, but they are not new. They are gaining traction as their use by national metrology institutes globally grows. As high profile critiques of social and psychological research practices continue to emerge, perhaps more attention will be paid to this important body of work. A few key references are provided below, and virtually every post in this blog pertains to these issues.

References

Baghaei, P. (2008). The Rasch model as a construct validation tool. Rasch Measurement Transactions, 22(1), 1145-6 [http://www.rasch.org/rmt/rmt221a.htm].

Bergstrom, B. A., & Lunz, M. E. (1994). The equivalence of Rasch item calibrations and ability estimates across modes of administration. In M. Wilson (Ed.), Objective measurement: Theory into practice, Vol. 2 (pp. 122-128). Norwood, New Jersey: Ablex.

Cano, S., Pendrill, L., Barbic, S., & Fisher, W. P., Jr. (2018). Patient-centred outcome metrology for healthcare decision-making. Journal of Physics: Conference Series, 1044, 012057.

Dimitrov, D. M. (2010). Testing for factorial invariance in the context of construct validation. Measurement & Evaluation in Counseling & Development, 43(2), 121-149.

Embretson, S. E. (2010). Measuring psychological constructs: Advances in model-based approaches. Washington, DC: American Psychological Association.

Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.

Fischer, G. H. (1983). Logistic latent trait models with linear constraints. Psychometrika, 48(1), 3-26.

Fisher, W. P., Jr. (1992). Reliability statistics. Rasch Measurement Transactions, 6(3), 238 [http://www.rasch.org/rmt/rmt63i.htm].

Fisher, W. P., Jr. (2008). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-1163 [http://www.rasch.org/rmt/rmt221.pdf].

Fisher, W. P., Jr., & Stenner, A. J. (2016). Theory-based metrological traceability in education: A reading measurement network. Measurement, 92, 489-496.

Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37(4), 827-833.

Hattie, J. (1985). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9(2), 139-64.

Hobart, J. C., Cano, S. J., Zajicek, J. P., & Thompson, A. J. (2007). Rating scales as outcome measures for clinical trials in neurology: Problems, solutions, and recommendations. Lancet Neurology, 6, 1094-1105.

Irvine, S. H., Dunn, P. L., & Anderson, J. D. (1990). Towards a theory of algorithm-determined cognitive test construction. British Journal of Psychology, 81, 173-195.

Kline, T. L., Schmidt, K. M., & Bowles, R. P. (2006). Using LinLog and FACETS to model item components in the LLTM. Journal of Applied Measurement, 7(1), 74-91.

Lunz, M. E., & Linacre, J. M. (2010). Reliability of performance examinations: Revisited. In M. Garner, G. Engelhard, Jr., W. P. Fisher, Jr. & M. Wilson (Eds.), Advances in Rasch Measurement, Vol. 1 (pp. 328-341). Maple Grove, MN: JAM Press.

Mari, L., & Wilson, M. (2014). An introduction to the Rasch measurement approach for metrologists. Measurement, 51, 315-327.

Markward, N. J., & Fisher, W. P., Jr. (2004). Calibrating the genome. Journal of Applied Measurement, 5(2), 129-141.

Maul, A., Mari, L., Torres Irribarra, D., & Wilson, M. (2018). The quality of measurement results in terms of the structural features of the measurement process. Measurement, 116, 611-620.

Muthukrishna, M., & Henrich, J. (2019). A problem in theory. Nature Human Behaviour, 1-9.

Obiekwe, J. C. (1999, August 1). Application and validation of the linear logistic test model for item difficulty prediction in the context of mathematics problems. Dissertation Abstracts International: Section B: The Sciences & Engineering, 60(2-B), 0851.

Pendrill, L. (2014). Man as a measurement instrument [Special Feature]. NCSLi Measure: The Journal of Measurement Science, 9(4), 22-33.

Pendrill, L., & Fisher, W. P., Jr. (2015). Counting and quantification: Comparing psychometric and metrological perspectives on visual perceptions of number. Measurement, 71, 46-55.

Pendrill, L., & Petersson, N. (2016). Metrology of human-based and other qualitative measurements. Measurement Science and Technology, 27(9), 094003.

Sijtsma, K. (2009). Correcting fallacies in validity, reliability, and classification. International Journal of Testing, 8(3), 167-194.

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74(1), 107-120.

Stenner, A. J. (2001). The necessity of construct theory. Rasch Measurement Transactions, 15(1), 804-5 [http://www.rasch.org/rmt/rmt151q.htm].

Stenner, A. J., Fisher, W. P., Jr., Stone, M. H., & Burdick, D. S. (2013). Causal Rasch models. Frontiers in Psychology: Quantitative Psychology and Measurement, 4(536), 1-14.

Stenner, A. J., & Horabin, I. (1992). Three stages of construct definition. Rasch Measurement Transactions, 6(3), 229 [http://www.rasch.org/rmt/rmt63b.htm].

Stenner, A. J., Stone, M. H., & Fisher, W. P., Jr. (2018). The unreasonable effectiveness of theory based instrument calibration in the natural sciences: What can the social sciences learn? Journal of Physics Conference Series, 1044(012070).

Stone, M. H. (2003). Substantive scale construction. Journal of Applied Measurement, 4(3), 282-297.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wilson, M. R. (2013). Using the concept of a measurement system to characterize measurement models used in psychometrics. Measurement, 46, 3766-3774.

Wright, B. D., & Stone, M. H. (1979). Chapter 5: Constructing a variable. In Best test design: Rasch measurement (pp. 83-128). Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1999). Measurement essentials. Wilmington, DE: Wide Range, Inc. [http://www.rasch.org/measess/me-all.pdf].

Wright, B. D., Stone, M., & Enos, M. (2000). The evolution of meaning in practice. Rasch Measurement Transactions, 14(1), 736 [http://www.rasch.org/rmt/rmt141g.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Advertisements

Newton, Metaphysics, and Measurement

January 20, 2011

Though Newton claimed to deduce quantitative propositions from phenomena, the record shows that he brought a whole cartload of presuppositions to bear on his observations (White, 1997), such as his belief that Pythagoras was the discoverer of the inverse square law, his knowledge of Galileo’s freefall experiments, and his theological and astrological beliefs in occult actions at a distance. Without his immersion in this intellectual environment, he likely would not have been able to then contrive the appearance of deducing quantity from phenomena.

The second edition of the Principia, in which appears the phrase “hypotheses non fingo,” was brought out in part to respond to the charge that Newton had not offered any explanation of what gravity is. De Morgan, in particular, felt that Newton seemed to know more than he could prove (Keynes, 1946). But in his response to the critics, and in asserting that he feigns no hypotheses, Newton was making an important distinction between explaining the causes or composition of gravity and describing how it works. Newton was saying he did not rely on or make or test any hypotheses as to what gravity is; his only concern was with how it behaves. In due course, gravity came to be accepted as a fundamental feature of the universe in no need of explanation.

Heidegger (1977, p. 121) contends that Newton was, as is implied in the translation “I do not feign hypotheses,” saying in effect that the ground plan he was offering as a basis for experiment and practical application was not something he just made up. Despite Newton’s rejection of metaphysical explanations, the charge of not explaining gravity for what it is was being answered with a metaphysics of how, first, to derive the foundation for a science of precise predictive control from nature, and then resituate that foundation back within nature as an experimental method incorporating a mathematical plan or model. This was, of course, quite astute of Newton, as far as he went, but he stopped far short of articulating the background assumptions informing his methods.

Newton’s desire for a logic of experimental science led him to reject anything “metaphysical or physical, or based on occult qualities, or mechanical” as a foundation for proceeding. Following in Descartes’ wake, Newton then was satisfied to solidify the subject-object duality and to move forward on the basis of objective results that seemed to make metaphysics a thing of the past. Unfortunately, as Burtt (1954/1932, pp. 225-230) observes in this context, the only thing that can possibly happen when you presume discourse to be devoid of metaphysical assumptions is that your metaphysics is more subtly insinuated and communicated to others because it is not overtly presented and defended. Thus we have the history of logical positivism as the dominant philosophy of science.

It is relevant to recall here that Newton was known for strong and accurate intuitions, and strong and unorthodox religious views (he held the Lucasian Chair at Cambridge only by royal dispensation, as he was not Anglican). It must be kept in mind that Newton’s combination of personal characteristics was situated in the social context of the emerging scientific culture’s increasing tendency to prioritize results that could be objectively detached from the particular people, equipment, samples, etc. involved in their production (Shapin, 1989). Newton then had insights that, while remarkably accurate, could not be entirely derived from the evidence he offered and that, moreover, could not acceptably be explained informally, psychologically, or theologically.

What is absolutely fascinating about this constellation of factors is that it became a model for the conduct of science. Of course, Newton’s laws of motion were adopted as the hallmark of successful scientific modeling in the form of the Standard Model applied throughout physics in the nineteenth century (Heilbron, 1993). But so was the metaphysical positivist logic of a pure objectivism detached from everything personal, intuitive, metaphorical, social, economic, or religious (Burtt, 1954/1932).

Kuhn (1970) made a major contribution to dismantling this logic when he contrasted textbook presentations of the methodical production of scientific effects with the actual processes of cobbled-together fits and starts that are lived out in the work of practicing scientists. But much earlier, James Clerk Maxwell (1879, pp. 162-163) had made exactly the same observation in a contrast of the work of Ampere with that of Faraday:

“The experimental investigation by which Ampere established the laws of the mechanical action between electric currents is one of the most brilliant achievements in science. The whole, theory and experiment, seems as if it had leaped, full grown and full armed, from the brain of the ‘Newton of electricity.’ It is perfect in form, and unassailable in accuracy, and it is summed up in a formula from which all the phenomena may be deduced, and which must always remain the cardinal formula of electro-dynamics.

“The method of Ampere, however, though cast into an inductive form, does not allow us to trace the formation of the ideas which guided it. We can scarcely believe that Ampere really discovered the law of action by means of the experiments which he describes. We are led to suspect, what, indeed, he tells us himself* [Ampere’s Theorie…, p. 9], that he discovered the law by some process which he has not shewn us, and that when he had afterwards built up a perfect demonstration he removed all traces of the scaffolding by which he had raised it.

“Faraday, on the other hand, shews us his unsuccessful as well as his successful experiments, and his crude ideas as well as his developed ones, and the reader, however inferior to him in inductive power, feels sympathy even more than admiration, and is tempted to believe that, if he had the opportunity, he too would be a discoverer. Every student therefore should read Ampere’s research as a splendid example of scientific style in the statement of a discovery, but he should also study Faraday for the cultivation of a scientific spirit, by means of the action and reaction which will take place between newly discovered facts and nascent ideas in his own mind.”

Where does this leave us? In sum, Rasch emulated Ampere in two ways. He did so first in wanting to become the “Newton of reading,” or even the “Newton of psychosocial constructs,” when he sought to show that data from reading test items and readers are structured with an invariance analogous to that of data from instruments applying a force to an object with mass (Rasch, 1960, pp. 110-115). Rasch emulated Ampere again when, like Ampere, after building up a perfect demonstration of a reading law structured in the form of Newton’s second law, he did not report the means by which he had constructed test items capable of producing the data fitting the model, effectively removing all traces of the scaffolding.

The scaffolding has been reconstructed for reading (Stenner, et al., 2006) and has also been left in plain view by others doing analogous work involving other constructs (cognitive and moral development, mathematics ability, short-term memory, etc.). Dawson (2002), for instance, compares developmental scoring systems of varying sophistication and predictive control. And it may turn out that the plethora of uncritically applied Rasch analyses may turn out to be a capital resource for researchers interested in focusing on possible universal laws, predictive theories, and uniform metrics.

That is, published reports of calibration, error, and fit estimates open up opportunities for “pseudo-equating” (Beltyukova, Stone, & Fox, 2004; Fisher 1997, 1999) in their documentation of the invariance, or lack thereof, of constructs over samples and instruments. The evidence will point to a need for theoretical and metric unification directly analogous to what happened in the study and use of electricity in the nineteenth century:

“…’the existence of quantitative correlations between the various forms of energy, imposes upon men of science the duty of bringing all kinds of physical quantity to one common scale of comparison.’” [Schaffer, 1992, p. 26; quoting Everett 1881; see Smith & Wise 1989, pp. 684-4]

Qualitative and quantitative correlations in scaling results converged on a common construct in the domain of reading measurement through the 1960s and 1970s, culminating in the Anchor Test Study and the calibration of the National Reference Scale for Reading (Jaeger, 1973; Rentz & Bashaw, 1977). The lack of a predictive theory and the entirely empirical nature of the scale estimates prevented the scale from wide application, as the items in the tests that were equated were soon replaced with new items.

But the broad scale of the invariance observed across tests and readers suggests that some mechanism must be at work (Stenner, Stone, & Burdick, 2009), or that some form of life must be at play (Fisher, 2003a, 2003b, 2004, 2010a), structuring the data. Eventually, some explanation accounting for the structure ought to become apparent, as it did for reading (Stenner, Smith, & Burdick, 1983; Stenner, et al., 2006). This emergence of self-organizing structures repeatedly asserting themselves as independently existing real things is the medium of the message we need to hear. That message is that instruments play a very large and widely unrecognized role in science. By facilitating the routine production of mutually consistent, regularly observable, and comparable results they set the stage for theorizing, the emergence of consensus on what’s what, and uniform metrics (Daston & Galison, 2007; Hankins & Silverman, 1999; Latour, 1987, 2005; Wise, 1988, 1995). The form of Rasch’s models as extensions of Maxwell’s method of analogy (Fisher, 2010b) makes them particularly productive as a means of providing self-organizing invariances with a medium for their self-inscription. But that’s a story for another day.

References

Beltyukova, S. A., Stone, G. E., & Fox, C. M. (2004). Equating student satisfaction measures. Journal of Applied Measurement, 5(1), 62-9.

Burtt, E. A. (1954/1932). The metaphysical foundations of modern physical science (Rev. ed.) [First edition published in 1924]. Garden City, New York: Doubleday Anchor.

Daston, L., & Galison, P. (2007). Objectivity. Cambridge, MA: MIT Press.

Dawson, T. L. (2002, Summer). A comparison of three developmental stage scoring systems. Journal of Applied Measurement, 3(2), 146-89.

Fisher, W. P., Jr. (1997). Physical disability construct convergence across instruments: Towards a universal metric. Journal of Outcome Measurement, 1(2), 87-113.

Fisher, W. P., Jr. (1999). Foundations for health status metrology: The stability of MOS SF-36 PF-10 calibrations across samples. Journal of the Louisiana State Medical Society, 151(11), 566-578.

Fisher, W. P., Jr. (2003a, December). Mathematics, measurement, metaphor, metaphysics: Part I. Implications for method in postmodern science. Theory & Psychology, 13(6), 753-90.

Fisher, W. P., Jr. (2003b, December). Mathematics, measurement, metaphor, metaphysics: Part II. Accounting for Galileo’s “fateful omission.” Theory & Psychology, 13(6), 791-828.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2010a). Reducible or irreducible? Mathematical reasoning and the ontological method. Journal of Applied Measurement, 11(1), 38-59.

Fisher, W. P., Jr. (2010b). The standard model in the history of the natural sciences, econometrics, and the social sciences. Journal of Physics: Conference Series, 238(1), http://iopscience.iop.org/1742-6596/238/1/012016/pdf/1742-6596_238_1_012016.pdf.

Hankins, T. L., & Silverman, R. J. (1999). Instruments and the imagination. Princeton, New Jersey: Princeton University Press.

Jaeger, R. M. (1973). The national test equating study in reading (The Anchor Test Study). Measurement in Education, 4, 1-8.

Keynes, J. M. (1946, July). Newton, the man. (Speech given at the Celebration of the Tercentenary of Newton’s birth in 1642.) MacMillan St. Martin’s Press (London, England), The Collected Writings of John Maynard Keynes Volume X, 363-364.

Kuhn, T. S. (1970). The structure of scientific revolutions. Chicago, Illinois: University of Chicago Press.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York: Cambridge University Press.

Latour, B. (2005). Reassembling the social: An introduction to Actor-Network-Theory. (Clarendon Lectures in Management Studies). Oxford, England: Oxford University Press.

Maxwell, J. C. (1879). Treatise on electricity and magnetism, Volumes I and II. London, England: Macmillan.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Rentz, R. R., & Bashaw, W. L. (1977, Summer). The National Reference Scale for Reading: An application of the Rasch model. Journal of Educational Measurement, 14(2), 161-179.

Schaffer, S. (1992). Late Victorian metrology and its instrumentation: A manufactory of Ohms. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 23-56). Bellingham, WA: SPIE Optical Engineering Press.

Shapin, S. (1989, November-December). The invisible technician. American Scientist, 77, 554-563.

Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2006). How accurate are Lexile text measures? Journal of Applied Measurement, 7(3), 307-22.

Stenner, A. J., Smith, M., III, & Burdick, D. S. (1983, Winter). Toward a theory of construct definition. Journal of Educational Measurement, 20(4), 305-316.

Stenner, A. J., Stone, M., & Burdick, D. (2009, Autumn). The concept of a measurement mechanism. Rasch Measurement Transactions, 23(2), 1204-1206.

White, M. (1997). Isaac Newton: The last sorcerer. New York: Basic Books.

Wise, M. N. (1988). Mediating machines. Science in Context, 2(1), 77-113.

Wise, M. N. (Ed.). (1995). The values of precision. Princeton, New Jersey: Princeton University Press.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Statistics and Measurement: Clarifying the Differences

August 26, 2009

Measurement is qualitatively and paradigmatically quite different from statistics, even though statistics obviously play important roles in measurement, and vice versa. The perception of measurement as conceptually difficult stems in part from its rearrangement of most of the concepts that we take for granted in the statistical paradigm as landmarks of quantitative thinking. When we recognize and accept the qualitative differences between statistics and measurement, they both become easier to understand.

Statistical analyses are commonly referred to as quantitative, even though the numbers analyzed most usually have not been derived from the mapping of an invariant substantive unit onto a number line. Measurement takes such mapping as its primary concern, focusing on the quantitative meaningfulness of numbers (Falmagne & Narens, 1983; Luce, 1978; ,  Marcus-Roberts & Roberts, 1987; Mundy, 1986; Narens, 2002; Roberts, 1999). Statistical models focus on group processes and relations among variables, while measurement models focus on individual processes and relations within variables (Duncan, 1992; Duncan & Stenbeck, 1988; Rogosa, 1987). Statistics makes assumptions about factors beyond its control, while measurement sets requirements for objective inference (Andrich, 1989). Statistics primarily involves data analysis, while measurement primarily calibrates instruments in common metrics for interpretation at the point of use (Cohen, 1994; Fisher, 2000; Guttman, 1985; Goodman, 1999a-c; Rasch, 1960).

Statistics focuses on making the most of the data in hand, while measurement focuses on using the data in hand to inform (a) instrument calibration and improvement, and (b) the prediction and efficient gathering of meaningful new data on individuals in practical applications. Where statistical “measures” are defined inherently by a particular analytic method, measures read from calibrated instruments—and the raw observations informing these measures—need not be computerized for further analysis.

Because statistical “measures” are usually derived from ordinal raw scores, changes to the instrument change their meaning, resulting in a strong inclination to avoid improving the instrument. Measures, in contrast, take missing data into account, so their meaning remains invariant over instrument configurations, resulting in a firm basis for the emergence of a measurement quality improvement culture. So statistical “measurement” begins and ends with data analysis, where measurement from calibrated instruments is in a constant cycle of application, new item calibrations, and critical recalibrations that require only intermittent resampling.

The vast majority of statistical methods and models make strong assumptions about the nature of the unit of measurement, but provide either very limited ways of checking those assumptions, or no checks at all. Statistical models are descriptive in nature, meaning that models are fit to data, that the validity of the data is beyond the immediate scope of interest, and that the model accounting for the most variation is regarded as best. Finally, and perhaps most importantly, statistical models are inherently oriented toward the relations among variables at the level of samples and populations.

Measurement models, however, impose strong requirements on data quality in order to achieve the unit of measurement that is easiest to think with, one that stays constant and remains invariant across the local particulars of instrument and sample. Measurement methods and models, then, provide extensive and varied ways of checking the quality of the unit, and so must be prescriptive rather than descriptive. That is, measurement models define the data quality that must be obtained for objective inference. In the measurement paradigm, data are fit to models, data quality is of paramount interest, and data quality evaluation must be informed as much by qualitative criteria as by quantitative.

To repeat the most fundamental point, measurement models are oriented toward individual-level response processes, not group-level aggregate processes. Herbert Blumer pointed out as early as 1930 that quantitative method is not equivalent to statistical method, and that the natural sciences had conspicuous degrees of success long before the emergence of statistical techniques (Hammersly, 1989, pp. 113-4). Both the initial scientific revolution in the 16th-17th centuries and the second scientific revolution of the 19th century found a basis in measurement for publicly objective and reproducible results, but statistics played little or no role in the major discoveries of the times.

The scientific value of statistics resides largely in the reproducibility of cross-variable data relations, and statisticians widely agree that statistical analyses should depend only on sufficient statistics (Arnold, 1982, p. 79). Measurement theoreticians and practitioners also agree, but the sufficiency of the mean and standard deviation relative to a normal distribution is one thing, and the sufficiency of individual responses relative to an invariant construct is quite another (Andersen, 1977; Arnold, 1985; Dynkin, 1951; Fischer, 1981; Hall, Wijsman, & Ghosh, 1965; van der Linden, 1992).

It is of historical interest, though, to point out that Rasch, foremost proponent of the latter, attributes credit for the general value of the concept of sufficiency to Ronald Fisher, foremost proponent of the former. Rasch’s strong statements concerning the fundamental inferential value of sufficiency (Andrich, 1997; Rasch, 1977; Wright, 1980) would seem to contradict his repeated joke about burning all the statistics texts making use of the normal distribution (Andersen, 1995, p. 385) were it not for the paradigmatic distinction between statistical models of group-level relations among variables, and measurement models of individual processes. Indeed, this distinction is made on the first page of Rasch’s (1980) book.

Now we are in a position to appreciate a comment by Ernst Rutherford, the winner of the 1908 Nobel Prize in Chemistry, who held that, if you need statistics to understand the results of your experiment, then you should have designed a better experiment (Wise, 1995, p. 11). A similar point was made by Feinstein (1995) concerning meta-analysis. The rarely appreciated point is that the generalizable replication and application of results depends heavily on the existence of a portable and universally uniform observational framework. The inferences, judgments, and adjustments that can be made at the point of use by clinicians, teachers, managers, etc. provided with additive measures expressed in a substantively meaningful common metric far outstrip those that can be made using ordinal measures expressed in instrument- and sample-dependent scores. See Andrich (1989, 2002, 2004), Cohen (1994), Davidoff (1999), Duncan (1992), Embretson (1996), Goodman (1999a, 1999b, 1999c), Guttman (1981, 1985), Meehl (1967), Michell (1986), Rogosa (1987), Romanowski and Douglas (2002), and others for more on this distinction between statistics and measurement.

These contrasts show that the confounding of statistics and measurement is a problem of vast significance that persists in spite of repeated efforts to clarify the distinction. For a wide variety of reasons ranging from cultural presuppositions about the nature of number to the popular notion that quantification is as easy as assigning numbers to observations, measurement is not generally well understood by the public (or even by statisticians!). And so statistics textbooks rarely, if ever, include even passing mention of instrument calibration methods, metric equating processes, the evaluation of data quality relative to the requirements of objective inference, traceability to metrological reference standards, or the integration of qualitative and quantitative methods in the interpretation of measures.

Similarly, in business, marketing, health care, and quality improvement circles, we find near-universal repetition of the mantra, “You manage what you measure,” with very little or no attention paid to the quality of the numbers treated as measures. And so, we find ourselves stuck with so-called measurement systems where,

• instead of linear measures defined by a unit that remains constant across samples and instruments we saddle ourselves with nonlinear scores and percentages defined by units that vary in unknown ways across samples and instruments;
• instead of availing ourselves of the capacity to take missing data into account, we hobble ourselves with the need for complete data;
• instead of dramatically reducing data volume with no loss of information, we insist on constantly re-enacting the meaningless ritual of poring over undigestible masses of numbers;
• instead of adjusting measures for the severity or leniency of judges assigning ratings, we allow measures to depend unfairly on which rater happens to make the observations;
• instead of using methods that give the same result across different distributions, we restrict ourselves to ones that give different results when assumptions of normality are not met and/or standard deviations differ;
• instead of calibrating instruments in an experimental test of the hypothesis that the intended construct is in fact structured in such a way as to make its mapping onto a number line meaningful, we assign numbers and make quantitative inferences with no idea as to whether they relate at all to anything real;
• instead of checking to see whether rating scales work as intended, with higher ratings consistently representing more of the variable, we make assumptions that may be contradicted by the order and spacing of the way rating scales actually work in practice;
• instead of defining a comprehensive framework for interpreting measures relative to a construct, we accept the narrow limits of frameworks defined by the local sample and items;
• instead of capitalizing on the practicality and convenience of theories capable of accurately predicting item calibrations and measures apart from data, we counterproductively define measurement empirically in terms of data analysis;
• instead of putting calibrated tools into the hands of front-line managers, service representatives, teachers and clinicians, we require them to submit to cumbersome data entry, analysis, and reporting processes that defeat the purpose of measurement by ensuring the information provided is obsolete by the time it gets back to the person who could act on it; and
• instead of setting up efficient systems for communicating meaningful measures in common languages with shared points of reference, we settle for inefficient systems for communicating meaningless scores in local incommensurable languages.

Because measurement is simultaneously ubiquitous and rarely well understood, we find ourselves in a world that gives near-constant lip service to the importance of measurement while it does almost nothing to provide measures that behave the way we assume they do. This state of affairs seems to have emerged in large part due to our failure to distinguish between the group-level orientation of statistics and the individual-level orientation of measurement. We seem to have been seduced by a variation on what Whitehead (1925, pp. 52-8) called the fallacy of misplaced concreteness. That is, we have assumed that the power of lawful regularities in thought and behavior would be best revealed and acted on via statistical analyses of data that themselves embody the aggregate mass of the patterns involved.

It now appears, however, in light of studies in the history of science (Latour, 1987, 2005; Wise, 1995), that an alternative and likely more successful approach will be to capitalize on the “wisdom of crowds” (Surowiecki, 2004) phenomenon of collective, distributed cognition (Akkerman, et al., 2007; Douglas, 1986; Hutchins, 1995; Magnus, 2007). This will be done by embodying lawful regularities in instruments calibrated in ideal, abstract, and portable metrics put to work by front-line actors on mass scales (Fisher, 2000, 2005, 2009a, 2009b). In this way, we will inform individual decision processes and structure communicative transactions with efficiencies, meaningfulness, substantive effectiveness, and power that go far beyond anything that could be accomplished by trying to make inferences about individuals from group-level statistics.

We ought not accept the factuality of data as the sole criterion of objectivity, with all theory and instruments constrained by and focused on the passing ephemera of individual sets of local particularities. Properly defined and operationalized via a balanced interrelation of theory, data, and instrument, advanced measurement is not a mere mathematical exercise but offers a wealth of advantages and conveniences that cannot otherwise be obtained. We ignore its potentials at our peril.

References
Akkerman, S., Van den Bossche, P., Admiraal, W., Gijselaers, W., Segers, M., Simons, R.-J., et al. (2007, February). Reconsidering group cognition: From conceptual confusion to a boundary area between cognitive and socio-cultural perspectives? Educational Research Review, 2, 39-63.

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.

Andersen, E. B. (1995). What George Rasch would have thought about this book. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 383-390). New York: Springer-Verlag.

Andrich, D. (1989). Distinctions between assumptions and requirements in measurement in the social sciences. In J. A. Keats, R. Taft, R. A. Heath & S. H. Lovibond (Eds.), Mathematical and Theoretical Systems: Proceedings of the 24th International Congress of Psychology of the International Union of Psychological Science, Vol. 4 (pp. 7-16). North-Holland: Elsevier Science Publishers.

Andrich, D. (1997). Georg Rasch in his own words [excerpt from a 1979 interview]. Rasch Measurement Transactions, 11(1), 542-3. [http://www.rasch.org/rmt/rmt111.htm#Georg].

Andrich, D. (2002). Understanding resistance to the data-model relationship in Rasch’s paradigm: A reflection for the next generation. Journal of Applied Measurement, 3(3), 325-59.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Arnold, S. F. (1982-1988). Sufficient statistics. In S. Kotz, N. L. Johnson & C. B. Read (Eds.), Encyclopedia of Statistical Sciences (pp. 72-80). New York: John Wiley & Sons.

Arnold, S. F. (1985, September). Sufficiency and invariance. Statistics & Probability Letters, 3, 275-279.

Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist, 49, 997-1003.

Davidoff, F. (1999, 15 June). Standing statistics right side up (Editorial). Annals of Internal Medicine, 130(12), 1019-1021.

Douglas, M. (1986). How institutions think. Syracuse, New York: Syracuse University Press.

Dynkin, E. B. (1951). Necessary and sufficient statistics for a family of probability distributions. Selected Translations in Mathematical Statistics and Probability, 1, 23-41.

Duncan, O. D. (1992, September). What if? Contemporary Sociology, 21(5), 667-668.

Duncan, O. D., & Stenbeck, M. (1988). Panels and cohorts: Design and model in the study of voting turnout. In C. C. Clogg (Ed.), Sociological Methodology 1988 (pp. 1-35). Washington, DC: American Sociological Association.

Embretson, S. E. (1996, September). Item Response Theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, 20(3), 201-212.

Falmagne, J.-C., & Narens, L. (1983). Scales and meaningfulness of quantitative laws. Synthese, 55, 287-325.

Feinstein, A. R. (1995, January). Meta-analysis: Statistical alchemy for the 21st century. Journal of Clinical Epidemiology, 48(1), 71-79.

Fischer, G. H. (1981, March). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.

Fisher, W. P., Jr. (2000). Objectivity in psychosocial measurement: What, why, how. Journal of Outcome Measurement, 4(2), 527-563.

Fisher, W. P., Jr. (2005). Daredevil barnstorming to the tipping point: New aspirations for the human sciences. Journal of Applied Measurement, 6(3), 173-9.

Fisher, W. P., Jr. (2009a). Bringing human, social, and natural capital to life: Practical consequences and opportunities. In M. Wilson, K. Draney, N. Brown & B. Duckor (Eds.), Advances in Rasch Measurement, Vol. Two (p. in press). Maple Grove, MN: JAM Press.

Fisher, W. P., Jr. (2009b, July). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), in press.

Goodman, S. N. (1999a, 6 April). Probability at the bedside: The knowing of chances or the chances of knowing? (Editorial). Annals of Internal Medicine, 130(7), 604-6.

Goodman, S. N. (1999b, 15 June). Toward evidence-based medical statistics. 1: The p-value fallacy. Annals of Internal Medicine, 130(12), 995-1004.

Goodman, S. N. (1999c, 15 June). Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine, 130(12), 1005-1013.

Guttman, L. (1981). What is not what in theory construction. In I. Borg (Ed.), Multidimensional data representations: When & why. Ann Arbor, MI: Mathesis Press.

Guttman, L. (1985). The illogic of statistical inference for cumulative science. Applied Stochastic Models and Data Analysis, 1, 3-10.

Hall, W. J., Wijsman, R. A., & Ghosh, J. K. (1965). The relationship between sufficiency and invariance with applications in sequential analysis. Annals of Mathematical Statistics, 36, 575-614.

Hammersley, M. (1989). The dilemma of qualitative method: Herbert Blumer and the Chicago Tradition. New York: Routledge.

Hutchins, E. (1995). Cognition in the wild. Cambridge, Massachusetts: MIT Press.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York: Cambridge University Press.

Latour, B. (1995). Cogito ergo sumus! Or psychology swept inside out by the fresh air of the upper deck: Review of Hutchins’ Cognition in the Wild, MIT Press, 1995. Mind, Culture, and Activity: An International Journal, 3(192), 54-63.

Latour, B. (2005). Reassembling the social: An introduction to Actor-Network-Theory. (Clarendon Lectures in Management Studies). Oxford, England: Oxford University Press.

Luce, R. D. (1978, March). Dimensionally invariant numerical laws correspond to meaningful qualitative relations. Philosophy of Science, 45, 1-16.

Magnus, P. D. (2007). Distributed cognition and the task of science. Social Studies of Science, 37(2), 297-310.

Marcus-Roberts, H., & Roberts, F. S. (1987). Meaningless statistics. Journal of Educational Statistics, 12(4), 383-394.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.

Michell, J. (1986). Measurement scales and statistics: A clash of paradigms. Psychological Bulletin, 100, 398-407.

Mundy, B. (1986, June). On the general theory of meaningful representation. Synthese, 67(3), 391-437.

Narens, L. (2002, December). A meaningful justification for the representational theory of measurement. Journal of Mathematical Psychology, 46(6), 746-68.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 14, 58-94.

Roberts, F. S. (1999). Meaningless statements. In R. Graham, J. Kratochvil, J. Nesetril & F. Roberts (Eds.), Contemporary trends in discrete mathematics, DIMACS Series, Volume 49 (pp. 257-274). Providence, RI: American Mathematical Society.

Rogosa, D. (1987). Casual [sic] models do not support scientific conclusions: A comment in support of Freedman. Journal of Educational Statistics, 12(2), 185-95.

Romanoski, J. T., & Douglas, G. (2002). Rasch-transformed raw scores and two-way ANOVA: A simulation analysis. Journal of Applied Measurement, 3(4), 421-430.

Stevens, S. S. (1951). Mathematics, measurement, and psychophysics. In S. S. Stevens (Ed.), Handbook of experimental psychology (pp. 1-49). New York: John Wiley & Sons.

Surowiecki, J. (2004). The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies and nations. New York: Doubleday.

van der Linden, W. J. (1992). Sufficient and necessary statistics. Rasch Measurement Transactions, 6(3), 231 [http://www.rasch.org/rmt/rmt63d.htm].

Whitehead, A. N. (1925). Science and the modern world. New York: Macmillan.

Wise, M. N. (Ed.). (1995). The values of precision. Princeton, New Jersey: Princeton University Press.

Wright, B. D. (1980). Foreword, Afterword. In Probabilistic models for some intelligence and attainment tests, by Georg Rasch (pp. ix-xix, 185-199. http://www.rasch.org/memo63.htm) [Reprint; original work published in 1960 by the Danish Institute for Educational Research]. Chicago, Illinois: University of Chicago Press.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

The “Standard Model,” Part II: Natural Law, Economics, Measurement, and Capital

July 15, 2009

At Tjalling Koopmans’ invitation, Rasch became involved with the Cowles Commission, working at the University of Chicago in the 1947 academic year, and giving presentations in the same seminar series as Milton Friedman, Kenneth Arrow, and Jimmie Savage (Linacre, 1998; Cowles Foundation, 1947, 1952; Rasch, 1953). Savage would later be instrumental in bringing Rasch back to Chicago in 1960.

Rasch was prompted to approach Savage about giving a course at Chicago after receiving a particularly strong response to some of his ideas from his old mentor, Frisch, when Frisch had come to Copenhagen to receive an honorary doctorate in 1959. Frisch shared the first Nobel Prize in economics with Tinbergen, was a co-founder, with Irving Fisher, of the Econometric Society,  invented words such as “econometrics” and “macro-economics,” and was the editor of Econometrica for many years. As recounted by Rasch (1977, pp. 63-66; also see Andrich, 1997; Wright, 1980, 1998), Frisch was struck by the disappearance of the person parameter from the comparisons of item calibrations in the series of equations he presented. In response to Frisch’s reaction, Rasch formalized his mathematical ideas in a Separability Theorem.

Why were the separable parameters  significant to Frisch? Because they addressed the problem that was at the center of Frisch’s network of concepts: autonomy, better known today as structural invariance (Aldrich, 1989, p. 15; Boumans, 2005, pp. 51 ff.; Haavelmo, 1948). Autonomy concerns the capacity of data to represent a pattern of relationships that holds up across the local particulars. It is, in effect, Frisch’s own particular way of extending the Standard Model. Irving Fisher (1930) had similarly stated what he termed a Separation Theorem, which, in the manner of previous work by Walras, Jevons, and others, was also presented in terms of a multiplicative relation between three variables. Frisch (1930) complemented Irving Fisher’s focus on an instrumental approach with a mathematical, axiomatic approach (Boumans, 2005) offering necessary and sufficient conditions for tests of Irving Fisher’s theorem.

When Rasch left Frisch, he went directly to London to work with Ronald Fisher, where he remained for a year. In the following decades, Rasch became known as the foremost advocate of Ronald Fisher’s ideas in Denmark. In particular, he stressed the value of statistical sufficiency, calling it the “high mark” of Fisher’s work (Fisher, 1922). Rasch’s student, Erling Andersen, later showed that when raw scores are both necessary and sufficient statistics for autonomous, separable parameters, the model employed is Rasch’s (Andersen, 1977; Fischer, 1981; van der Linden, 1992).

Whether or not Rasch’s conditions exactly reproduce Frisch’s, and whether or not his Separability Theorem is identical with Irving Fisher’s Separation Theorem, it would seem that time with Frisch exerted a significant degree of influence on Rasch, likely focusing his attention on statistical sufficiency, the autonomy implied by separable parameters, and the multiplicative relations of variable triples.

These developments, and those documented in previous of my blogs, suggest the existence of powerful and untapped potentials hidden within psychometrics and econometrics. The story told thus far remains incomplete. However compelling the logic and personal histories may be, central questions remain unanswered. To provide a more well-rounded assessment of the situation, we must take up several unresolved philosophical issues (Fisher, 2003a, 2003b, 2004).

It is my contention that, for better measurement to become more mainstream, a certain kind of cultural shift is going to have to happen. This shift has already been underway for decades, and has precedents that go back centuries. Its features are becoming more apparent as long term economic sustainability is understood to involve significant investments in humanly, socially and environmentally responsible practices.  For such practices to be more than just superficial expressions of intentions that might be less interested in the greater good than selfish gain, they have to emerge organically from cultural roots that are already alive and thriving.

It is not difficult to see how such an organic emergence might happen, though describing it appropriately requires an ability to keep the relationship of the local individual to the global universal always in mind. And even if and when that description might be provided, having it in hand in no way shows how it could be brought about. All we can do is to persist in preparing ourselves for the opportunities that arise, reading, thinking, discussing, and practicing. Then, and only then, might we start to plant the seeds, nurture them, and see them grow.

References

Aldrich, J. (1989). Autonomy. Oxford Economic Papers, 41, 15-34.

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.

Andrich, D. (1997). Georg Rasch in his own words [excerpt from a 1979 interview]. Rasch Measurement Transactions, 11(1), 542-3. [http://www.rasch.org/rmt/rmt111.htm#Georg].

Boumans, M. (2001). Fisher’s instrumental approach to index numbers. In M. S. Morgan & J. Klein (Eds.), The age of economic measurement (pp. 313-44). Durham, North Carolina: Duke University Press.

Bjerkholt, O. (2001). Tracing Haavelmo’s steps from Confluence Analysis to the Probability Approach (Tech. Rep. No. 25). Oslo, Norway: Department of Economics, University of Oslo, in cooperation with The Frisch Centre for Economic Research.

Boumans, M. (1993). Paul Ehrenfest and Jan Tinbergen: A case of limited physics transfer. In N. De Marchi (Ed.), Non-natural social science: Reflecting on the enterprise of “More Heat than Light” (pp. 131-156). Durham, NC: Duke University Press.

Boumans, M. (2005). How economists model the world into numbers. New York: Routledge.

Burdick, D. S., Stone, M. H., & Stenner, A. J. (2006). The Combined Gas Law and a Rasch Reading Law. Rasch Measurement Transactions, 20(2), 1059-60 [http://www.rasch.org/rmt/rmt202.pdf].

Cowles Foundation for Research in Economics. (1947). Report for period 1947, Cowles Commission for Research in Economics. Retrieved 7 July 2009, from Yale University Dept. of Economics: http://cowles.econ.yale.edu/P/reports/1947.htm.

Cowles Foundation for Research in Economics. (1952). Biographies of Staff, Fellows, and Guests, 1932-1952. Retrieved 7 July 2009 from Yale University Dept. of Economics: http://cowles.econ.yale.edu/P/reports/1932-52d.htm#Biographies.

Fischer, G. H. (1981, March). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.

Fisher, I. (1930). The theory of interest. New York: Macmillan.

Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, A, 222, 309-368.

Fisher, W. P., Jr. (1992). Objectivity in measurement: A philosophical history of Rasch’s separability theorem. In M. Wilson (Ed.), Objective measurement: Theory into practice. Vol. I (pp. 29-58). Norwood, New Jersey: Ablex Publishing Corporation.

Fisher, W. P., Jr. (2003a, December). Mathematics, measurement, metaphor, metaphysics: Part I. Implications for method in postmodern science. Theory & Psychology, 13(6), 753-90.

Fisher, W. P., Jr. (2003b, December). Mathematics, measurement, metaphor, metaphysics: Part II. Accounting for Galileo’s “fateful omission.” Theory & Psychology, 13(6), 791-828.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2007, Summer). Living capital metrics. Rasch Measurement Transactions, 21(1), 1092-3 [http://www.rasch.org/rmt/rmt211.pdf].

Fisher, W. P., Jr. (2008, March 28). Rasch, Frisch, two Fishers and the prehistory of the Separability Theorem. In Session 67.056. Reading Rasch Closely: The History and Future of Measurement. American Educational Research Association, Rasch Measurement SIG, New York University, New York City.

Frisch, R. (1930). Necessary and sufficient conditions regarding the form of an index number which shall meet certain of Fisher’s tests. Journal of the American Statistical Association, 25, 397-406.

Haavelmo, T. (1948). The autonomy of an economic relation. In R. Frisch &  et al. (Eds.), Autonomy of economic relations. Oslo, Norway: Memo DE-UO, 25-38.

Heilbron, J. L. (1993). Weighing imponderables and other quantitative science around 1800 Historical studies in the physical and biological sciences, 24 (Supplement), Part I, 1-337.

Jammer, M. (1999). Concepts of mass in contemporary physics and philosophy. Princeton, NJ: Princeton University Press.

Linacre, J. M. (1998). Rasch at the Cowles Commission. Rasch Measurement Transactions, 11(4), 603.

Maas, H. (2001). An instrument can make a science: Jevons’s balancing acts in economics. In M. S. Morgan & J. Klein (Eds.), The age of economic measurement (pp. 277-302). Durham, North Carolina: Duke University Press.

Mirowski, P. (1988). Against mechanism. Lanham, MD: Rowman & Littlefield.

Rasch, G. (1953, March 17-19). On simultaneous factor analysis in several populations. From the Uppsala Symposium on Psychological Factor Analysis. Nordisk Psykologi’s Monograph Series, 3, 65-71, 76-79, 82-88, 90.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy,  14, 58-94.

van der Linden, W. J. (1992). Sufficient and necessary statistics. Rasch Measurement Transactions, 6(3), 231 [http://www.rasch.org/rmt/rmt63d.htm].

Wright, B. D. (1980). Foreword, Afterword. In Probabilistic models for some intelligence and attainment tests, by Georg Rasch (pp. ix-xix, 185-199. http://www.rasch.org/memo63.htm) [Reprint; original work published in 1960 by the Danish Institute for Educational Research]. Chicago, Illinois: University of Chicago Press.

Wright, B. D. (1994, Summer). Theory construction from empirical observations. Rasch Measurement Transactions, 8(2), 362 [http://www.rasch.org/rmt/rmt82h.htm].

Wright, B. D. (1998, Spring). Georg Rasch: The man behind the model. Popular Measurement, 1, 15-22 [http://www.rasch.org/pm/pm1-15.pdf].