Posts Tagged ‘theory’

A Summary End-of-Year Philosophical Overview

December 25, 2009

So the end of the year and the start of a new one makes a good time to reflect a bit on just what the situation in the world looks like, philosophically speaking.

As is so often the case, we hold the keys to our own liberation, but don’t know it, can’t see them, or refuse out of pure contrariness to fit them in the locks. Here, then, is a list of locks and keys for those who might want to match them up and see new ways of doing things.

  • The way we define a problem sets up a class of solutions as a restricted range of ways that things can be done. Historians and philosophers of science have shown that, contrary to the way we usually think of things, solutions come first. As the old expression goes, “When the only tool you have is a hammer, everything looks like a nail.” Science is so dependent on the available technology for the way it defines problems that this point has led to the emergence of the term “technoscience” as an explicit marker of the difference between this new point of view and the old one (among many works in this area, see Ihde, 1983; Latour, 1987).
  • One of the most ancient human technologies is language itself; the word “text” has the same root in the Sanskrit TEK and Greek techne as technique and textile. Just as is the case with technoscience, before we have the  slightest chance to do anything about it, language prethinks the world for us. In the same way that the Grateful Dead sings about the music playing the band, the words and grammar we use are using us much more than vice versa. We have rightly become more sensitive to the way words restrict our expectations, so that “man” is no longer taken to refer to all people. But the problem is far more complex than this example might lead us to believe. The very way in which words represent things is itself the paradigmatic model for science, as becomes apparent as we think this through.
  • One very important way that language sets us up to think in a particular way stems from the subject-verb-object structure of Western European languages. We habitually define problems in terms of what is sometimes called the Cartesian duality or subject-object split. Our language has led to the perception that thinking subjects are completely separate from and independent of the objects they encounter and act on. The limited framework in which this split can be reasonably entertained has been enormously productive, but has led to equally enormous undesired consequences in terms of human, social, and environmental waste.
  • Descartes himself recognized the limits of separating the thinking subject from the world of objects, but took a pragmatic attitude toward simplifying things. If Descartes hadn’t existed, we would have had to invent him, and to some extent, we probably already have. Descartes (1971, pp. 183-4) understood the situation very well, saying: “I have often observed that philosophers make the mistake of trying to explain by logical definitions those things which are most simple and self-evident; they thus only make them more obscure. When I said that the proposition I experience (cogito) therefore I am is the first and most certain of those we come across when we philosophize in an orderly way, I was not denying that we must first know what is meant by experience, existence, certainty; again, we must know such things as that it is impossible for that which is experiencing to be non-existent; but I thought it needless to enumerate these notions, for they are of the greatest simplicity, and by themselves they can give us no knowledge that anything exists.”
  • Descartes then did not use the phrase ‘cogito, ergo sum’ in the rigid and over-simplified way which is often attributed to him.  Heidegger (1967, p. 104) explains that
    “The formula which the proposition sometimes has, ‘cogito, ergo sum,’ suggests the misunderstanding that it is here a question of inference.  That is not the case and cannot be so, because this conclusion would have to have as its major premise: Id quod cogitat, est; and the minor premise: cogito; conclusion: ergo sum.  However, the major premise would only be a formal generalization of what lies in the proposition: ‘cogito-sum.’ Descartes himself emphasizes that no inference is present.  The sum is not a consequence of the thinking, but vice versa; it is the ground of thinking, the fundamentum.”
  • Today, though, the matters that were too simple for Descartes to concern himself with have become problems of huge proportion.  In a note to Heidegger’s discussion of this passage from Descartes, the editor suggests that the greatest part of Heidegger’s philosophical work has been devoted to enumerating and putting on record what Descartes left out as too simple to be concerned with (Krell in Heidegger 1982b, p. 125).
  • No doubt a great many thinkers and scholars have an intellectual grasp of these issues. Putting those thoughts in action is proving difficult, to say the least. Institutionalized habits of mind seem nearly impossible to overcome. In one of those great ironies of history, we now have a situation in which we are trying to solve a new class of problems (nonCartesian ones) using the approaches that are the cause of the class of problems (Cartesian ones). Of course, as long we insist on operating this way, all we can do is make things worse. (For more on this, see a previous blog describing how the problem is the problem.)
  • We can see our way out of this, and moreover find the motivation to act, by considering how we got into it. Descartes (1961, p. 8) held that “…in seeking the correct path to truth we should be concerned with nothing about which we cannot have a certainty equal to that of the demonstrations of arithmetic and geometry.” In saying this, Descartes identifies himself as a student of Plato, as someone experienced enough in mathematics to have met the requirements for admission to the Academy. Plato wanted students familiar with arithmetic and geometry because they know that numeric and geometric figures plainly are not the mathematical objects they stand for. Geometrical analyses of squares, circles, and triangles always come out the same, no matter which particular figure of a type is involved. Understanding this distinction was fundamental to taking up the study of philosophy, which actually involves nothing but the independence of figure from meaning, of word from concept. The Cartesian duality is a natural extension of Plato into the distinction between mind and body, subject and object.
  • So we look right through the particular words, numbers, and geometrical figures representing things and see the things themselves in terms of abstract ideals that are basically mathematical. But even in naming abstract ideals as such we do not come any closer to grasping or apprehending the complete truth of being. All we have are words, but this does not mean that we are trapped forever in a linguistic cage. The situation is quite the contrary, in fact. Science is poetry in motion. Science is a systematic way of simultaneously inventing and discovering things brought into words via dialogues with life. Science is the way we let the metaphoric process do its thing (among many works in this area, see especially Gerhart & Russell, 1984, and Kuhn, 1993; for an example of recent work, see Colburn & Shute, 2008).
  • Far from controlling and dominating the world, what science enables us to do via metaphor is to subject ourselves systematically to very specific aspects of the world.  Our problem today is not one of overcoming the way we have subdued nature, each other, and ourselves so much as it is one of subjecting ourselves to a more comprehensive range of things about which we can “have a certainty equal to that of the demonstrations of arithmetic and geometry,” as Descartes put it. In other words, how do we extend the power of nonCartesian scientific metaphor-making into the human, social, and environmental sciences? This project has been the focus of my work from the beginning of my professional career to the present, and is elaborated in detail in a number of works (Fisher, 1988, 1992, 2004, 2010b).
  • Though explanations and logic can be compelling to some readers, the real power of ideas is exhibited in practice. Living the change we want to see happen has, for me, involved acting on yet another aspect of the way science poetically extends language’s prethinking of the world. The identity and coherence of a culture or an historical epoch is largely a matter of the way particular metaphors inform a worldview and the paradigmatic objects of the conversations of the time. Individual thoughts and behaviors are coordinated and harmonized via conversations that take place in terms, of course, of the words and concepts in circulation. And so we see that language is the original network that makes collective cognition and action possible. Language is the model for the not-always-so-wise wisdom of crowds effect that synchronizes everything from markets to laboratories to rush hours.
  • Seen from this angle, then, the problem is one of seeing how mathematical clarity can be embodied in the instruments of a technoscience distributed across the nodes of networks. How can we think and act together on the problems of the human, social, and environmental sciences with the same kind of coordination we experience in time via clocks or in the sequencing of the SARS virus via laboratories sharing metrological standards (to cite an example given by Surowiecki (2004), with (Latour, 1987, 2005) in the background)? The answer to this question lies in the calibration of instruments that are linked together and are so traceable to reference standards in a kind of metric system for each major construct of interest, such as the abilities, health, attitudes, trust, and environmental qualities essential to human, social, and natural capital (Fisher, 1996, 2000a, 2000b, 2002, 2005, 2009a, 2009b, 2010a).
  • Instruments are being calibrated on a broad scale across a great many applied and research contexts in business and academic contexts (among thousands of publications, see Bezruczko, 2005; Drehmer, Belohlav, & Coye, 2000; Masters, 2007; Salzberger & Sinkovics, 2006). Though local or proprietary implementations work to coordinate thought and behavior within restricted communities, systematic approaches to creating universally uniform metric systems for human, social, and natural capital are as yet nonexistent (Fisher, 2009a, 2009b).
  • Finally, in accord with our acceptance of the way we are always already caught up in the play and flow of language, what does a nonCartesian approach to facilitating networked harmonizations look like? There are four main features to be aware of. First off, we want to be acutely aware of and vigilantly sensitive to the role of metaphor. In abstracting from individuals to universals, we generalize from particulars in ways that must be justified (Ballard, 1978, pp. 186-190; Ricoeur, 1974; Gadamer, 1991, pp. 7-8).  All generalization involves telling a story that is largely true of everyone and everything that has a part in it, but which simultaneously is not perfectly or exactly true of any of them. As Rasch (1960, p. 115) points out, if force, mass, and acceleration are measured with enough precision we see that the actual measures do not accord exactly with Newton’s laws; rather, their parameters in probability distributions do. Respect and attention to the potential for what Ricoeur (1974) called the violence of the premature conclusion must be brought to bear in systematic ways to aid in “recalling the uniqueness of the person measured” (Ballard, 1978, p. 189). It will be essential to incorporate the ontological method’s (Fisher, 2010b; Heidegger, 1982a, pp. 21-23, 32-330) deconstructive moment as a judicial element in a balance of powers with the legislative moment’s experimentally justified reductions and the executive moment’s constructive applications.
  • Second, attuned to those instances in which the philosophical thesis of the independence of figure and meaning, or the separation of signifier and signified, is difficult to satisfy (Derrida, 1982, p. 229; Wood & Bernasconi, 1988, 88-89), a nonCartesian approach to facilitating network harmonizations requires that we focus on identifying where, when, and what signifier-signified separations can be obtained. Because the universality and objectivity of mathematical objects make them “the absolute model for any object whatsoever” (Derrida, 1989, p. 66, also see p. 27), and because it is number and not word that is the real paradigm of the domain of things that can be understood in language (Gadamer, 1989, p. 412), we now strive to test the limits of the mathematical as “the fundamental presupposition of all ‘academic’ work” and “of the knowledge of things” (Heidegger, 1967, pp. 75-76).  This is the same thing as attending to the calibration of the instruments that are ultimately to be linked to reference standards. This is the domain of Rasch measurement (Andrich, 1988, 2004; Bond & Fox, 2007; Rasch, 1960; Wilson, 2005; Wright, 1997), which takes the assessment of data consistency, unidimensionality, reliability, and construct validity as essential.
  • Third, with calibrated instruments in hand, attention turns to linking and equating them systematically in networks tracing connections to and from metrological reference standards, adapting the methods for maintaining the existing metric system (Fisher, 1996, 2000a, 2000b, 2005, 2009a, 2009b, 2010a). The goal here will be one of coordinating and synchronizing the self-organizing structures of each distinct construct, much as was done for the measurement of literacy (Stenner, et al., 2006).
  • Fourth, though we have to this point completely respected our inescapable immersion in the play of language, there still remains the question of how such a massive transformation from the modern Cartesian dualist point of view to a postmodern nonCartesian one will be brought about. Like any paradigm shift, the new way of doing things emerges as a function of the returns–economic, political, social, and psychological–that can be expected from the investments made. And in accord with the broad qualitative sense of the mathematical as learning through what we already know (Heidegger, 1967; Kisiel, 1973), the new will emerge as an amplification of something old. A great deal of attention and investment is currently being focused on creating whole new sources of sustainable, socially responsible, and long-term profits from closer management of human, social, and natural capital. In the same way that the metric system is an essential component of global trade, and in the same way that origins of the metric system coincide with the scientific, industrial, and political revolutions of the late 18th and early 19th centuries, so, too, will a new metric system for human, social, and natural capital provide a foundation for new efficiencies and degrees of effectiveness across multiple domains. The profit motive is an engine of great energy and resources. We need to learn how to harness it as a driver of growth in realized human potential, social cohesion, and environmental quality. What other way of giving ourselves over to the nonCartesian and playful creation of meaning is there, in fact, except to extend the rule of law and the invisible hand’s matching of supply and demand into all of the areas essential to human being?

Philosophically speaking, then, it would seem that all of the elements are in place for a positive answer to Zimmerman’s (1990, p. 274) question, “can we develop the non-absolutist, non-foundational categories necessary to assess, to confront, and to transform the technological and economic mobilization of humanity and the earth at the beginning of the twenty-first century?” Zimmerman might not agree with my sense that we can, since, reflecting on Heidegger’s efforts to put his political philosophy in action, he (1990, p. 257) remarks that “Heidegger’s political engagement in 1933-34 led him to conclude that all merely human ‘revolutions’ and ‘decisions’ would simply reinforce the system already in play. The question for us is: Is that conclusion tenable?” Zimmerman (pp. 245-246) apparently hopes it is not, and looks to love, compassion, and respect as alternatives to Heidegger’s hope for divine intervention.

But let’s consider what is “merely human.” The nonhuman is not necessarily divine, even if that is what Heidegger might have meant. And has not Heidegger (1962) himself already identified care as the defining characteristic of human being, with Habermas (1995) underscoring “considerateness” for our shared vulnerability, Ricoeur (1974) focusing on the desire for meaning and the choice in favor of discourse over violence, and Gadamer (1991, p. 61) also holding that “the first concern of all dialogical and dialectical inquiry is a care for the unity and sameness of the thing under discussion”? Beyond these are shifts of focus away from death as our common end, and toward our common birth from women as our shared beginning (Fielding, 2003; Schues, 1997; Schutz, 1962, 1966; Tymieniecka 1998, 2000; Zaner, 2002). And even in this, we must inevitably draw from Plato, now in Socrates’ stress on his role as a midwife of ideas, and from Aristotle, who provides the model for how to take possession of the value of living meaning in theory (Gadamer, 1980, p. 200).

Further, the conception, gestation, midwifery, and nurturing of ideas that takes place via considerateness and the desire for meaning were never the product of “merely human” intentions or designs, any more than biological reproduction was. Rather, we submit to the demands of the ways meaning is created to the same extent that we submit to the ways that life is recreated; in both cases, there is such Hegelian joy in the ways we find ourselves in each other that we can hardly complain (though whole cultures have figured out ways of doing so).

And we can indeed fault Heidegger, as Zimmerman (1990, p. 244, 258) does, for having “refused to take seriously the organic dimension of human existence,” and for somehow managing “to ignore the concrete history of actual existence and actual inquiry.”  We arrive at an entirely different, democratic, sphere of political implications (Ihde, 1990; Latour, 2004; Latour & Weibel, 2005), when we extend the deconstruction of metaphysics into examinations of the actual material practices of science, as Latour (1987, 2005) and others have done (Ihde, 1991, 1998; Ihde & Selinger, 2003). The dialogue with nonhuman others (Latour, 1994) is conceived as explicitly nonCartesian and nondualist, such that it is literally impossible to conceive of anything that does not incorporate social relations, or of any social relations that do not incorporate nonhuman others.

The self-organized unfolding of such dialogues play out the self-representative activity of the things themselves, with method defined as their movement in thought (Gadamer, 1989; Fisher, 2004). Reinforcing some aspect or aspects of the system already in play is indeed inevitable, as Heidegger concluded. But no important “revolutions” or “decisions” have ever been based in “merely human” inputs (Latour, 1993), as becomes apparent if we pay close attention to the concrete behaviors and communications through which meaning is created and shared. The “non-absolutist, non-foundational categories necessary to assess, to confront, and to transform the technological and economic mobilization of humanity and the earth at the beginning of the twenty-first century” referred to by Zimmerman are indeed in hand. Though many unfamiliar with the evidence, theory, and instruments may doubt this is true, a contemporary Galileo might be heard to mutter, “E pur si muove!”

References

Andrich, D. (1988). Rasch models for measurement. (Vols. series no. 07-068, Sage University Paper Series on Quantitative Applications in the Social Sciences). Beverly Hills, California: Sage Publications.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Ballard, E. G. (1978). Man and technology: Toward the measurement of a culture. Pittsburgh, Pennsylvania: Duquesne University Press.

Bezruczko, N. (Ed.). (2005). Rasch measurement in health sciences. Maple Grove, MN: JAM Press.

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2d edition. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Colburn, T. R., & Shute, G. M. (2008, December). Metaphor in computer science. Journal of Applied Logic, 6(4), 526-533.

Derrida, J. (1982). Margins of philosophy. Chicago, Illinois: University of Chicago Press.

Derrida, J. (1989). Edmund Husserl’s Origin of Geometry: An introduction. Lincoln: University of Nebraska Press.

Descartes, R. (1961). Rules for the direction of the mind. Indianapolis: Bobbs-Merrill.

Descartes, R. (1971). Philosophical writings (E. Anscombe & P. T. Geach, Eds.). Indianapolis, Indiana: Bobbs-Merrill.

Drehmer, D. E., Belohlav, J. A., & Coye, R. W. (2000, Dec). A exploration of employee participation using a scaling approach. Group & Organization Management, 25(4), 397-418.

Fielding, H. (2003, March). Questioning nature: Irigaray, Heidegger and the potentiality of matter. Continental Philosophy Review, 36(1), 1-26.

Fisher, W. P., Jr. (1988). Truth, method, and measurement: The hermeneutic of instrumentation and the Rasch model [Diss]. Dissertation Abstracts International (University of Chicago, Dept. of Education, Division of the Social Sciences), 49, 0778A.

Fisher, W. P., Jr. (1992). Objectivity in measurement: A philosophical history of Rasch’s separability theorem. In M. Wilson (Ed.), Objective measurement: Theory into practice. Vol. I (pp. 29-58). Norwood, New Jersey: Ablex Publishing Corporation.

Fisher, W. P., Jr. (1996, Winter). The Rasch alternative. Rasch Measurement Transactions, 9(4), 466-467 [http://www.rasch.org/rmt/rmt94.htm].

Fisher, W. P., Jr. (2000a). Objectivity in psychosocial measurement: What, why, how. Journal of Outcome Measurement, 4(2), 527-563 [http://www.livingcapitalmetrics.com/images/WP_Fisher_Jr_2000.pdf].

Fisher, W. P., Jr. (2000b). Rasch measurement as the definition of scientific agency. Rasch Measurement Transactions, 14(3), 761 [http://www.rasch.org/rmt/rmt143f.htm].

Fisher, W. P., Jr. (2002, Spring). “The Mystery of Capital” and the human sciences. Rasch Measurement Transactions, 15(4), 854 [http://www.rasch.org/rmt/rmt154j.htm].

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2005). Daredevil barnstorming to the tipping point: New aspirations for the human sciences. Journal of Applied Measurement, 6(3), 173-9 [http://www.livingcapitalmetrics.com/images/FisherJAM05.pdf].

Fisher, W. P., Jr. (2009a, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), 42(9), 1278-1287.

Fisher, W. P. Jr. (2009b). NIST Critical national need idea White Paper: metrological infrastructure for human, social, and natural capital (Tech. Rep. No. http://www.livingcapitalmetrics.com/images/FisherNISTWhitePaper2.pdf). New Orleans: LivingCapitalMetrics.com.

Fisher, W. P., Jr. (2010a). Bringing human, social, and natural capital to life: Practical consequences and opportunities. Journal of Applied Measurement, 11, in press [http://www.livingcapitalmetrics.com/images/BringingHSN_FisherARMII.pdf].

Fisher, W. P., Jr. (2010b). Reducible or irreducible? Mathematical reasoning and the ontological method. Journal of Applied Measurement, 11, in press.

Gadamer, H.-G. (1980). Dialogue and dialectic: Eight hermeneutical studies on Plato (P. C. Smith, Trans.). New Haven: Yale University Press.

Gadamer, H.-G. (1989). Truth and method (J. Weinsheimer & D. G. Marshall, Trans.) (Rev. ed.). New York: Crossroad.

Gadamer, H.-G. (1991). Plato’s dialectical ethics: Phenomenological interpretations relating to the Philebus (R. M. Wallace, Trans.). New Haven, Connecticut: Yale University Press.

Gerhart, M., & Russell, A. (1984). Metaphoric process: The creation of scientific and religious understanding [Foreword by Paul Ricoeur]. Fort Worth, Texas: Texas Christian University Press.

Habermas, J. (1995). Moral consciousness and communicative action. Cambridge, Massachusetts: MIT Press.

Heidegger, M. (1962). Being and time (J. Macquarrie & E. Robinson, Trans.). New York: Harper & Row.

Heidegger, M. (1967). What is a thing? (W. B. Barton, Jr. & V. Deutsch, Trans.). South Bend, Indiana: Regnery/Gateway.

Heidegger, M. (1982a). The basic problems of phenomenology (J. M. Edie, Ed.) (A. Hofstadter, Trans.). Studies in Phenomenology and Existential Philosophy. Bloomington, Indiana: Indiana University Press (Original work published 1975).

Heidegger, M. (1982b). Nietzsche, Vol. 4: Nihilism (D. F. Krell, Ed.) (F. A. Capuzzi, Trans.). San Francisco: Harper & Row.

Ihde, D. (1983). The historical and ontological priority of technology over science. In D. Ihde, Existential technics (pp. 25-46). Albany, New York: State University of New York Press.

Ihde, D. (1990). Technology and the lifeworld: From garden to earth. Bloomington, Indiana: Indiana University Press.

Ihde, D. (1991). Instrumental realism: The interface between philosophy of science and philosophy of technology. The Indiana Series in the Philosophy of Technology). Bloomington, Indiana: Indiana University Press.

Ihde, D. (1998). Expanding hermeneutics: Visualism in science. (Northwestern University Studies in Phenomenology and Existential Philosophy). Evanston, Illinois: Northwestern University Press.

Ihde, D., & Selinger, E. (Eds.). (2003). Chasing technoscience: Matrix for materiality. Indiana Series in Philosophy of Technology). Bloomington, Indiana: Indiana University Press.

Kisiel, T. (1973). The mathematical and the hermeneutical: On Heidegger’s notion of the apriori. In E. G. Ballard & C. E. Scott (Eds.), Martin Heidegger: In Europe and America (pp. 109-20). The Hague: Martinus Nijhoff.

Kuhn, T. S. (1993). Metaphor in science. In A. Ortony (Ed.), Metaphor and thought (2nd Ed.) (pp. 533-42). Cambridge: Cambridge University Press.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York: Cambridge University Press.

Latour, B. (1993). We have never been modern. Cambridge, Massachusetts: Harvard University Press.

Latour, B. (1994, May). Pragmatogonies: A mythical account of how humans and nonhumans swap properties. American Behavioral Scientist, 37(6), 791-808.

Latour, B. (2004). Politics of nature: How to bring the sciences into democracy. Cambridge, Massachusetts: Harvard University Press.

Latour, B. (2005). Reassembling the social: An introduction to Actor-Network-Theory. Clarendon Lectures in Management Studies). Oxford, England: Oxford University Press.

Latour, B., & Weibel, P. (2005). Making things public: Atmospheres of democracy. Cambridge, MA: MIT Press.

Masters, G. N. (2007). Special issue: Programme for International Student Assessment (PISA). Journal of Applied Measurement, 8(3), 235-335.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Ricoeur, P. (1974). Violence and language. In D. Stewart & J. Bien (Eds.), Political and social essays by Paul Ricoeur (pp. 88-101). Athens, Ohio: Ohio University Press.

Salzberger, T., & Sinkovics, R. R. (2006). Reconsidering the problem of data equivalence in international marketing research: Contrasting approaches based on CFA and the Rasch model for measurement. International Marketing Review, 23(4), 390-417.

Schues, C. (1997). The birth of difference. Human Studies, 20(2), 243-52.

Schutz, A. (1962). Scheler’s theory of intersubjectivity and the general thesis of the alter ego. In Collected Papers of Alfred Schutz, Volume I. The Hague, The Netherlands: Martinus Nijhoff.

Schutz, A. (1966). The problem of transcendental intersubjectivity in Husserl. In Collected Papers of Alfred Schutz, Volume III. The Hague, The Netherlands: Martinus Nijhoff.

Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2006). How accurate are Lexile text measures? Journal of Applied Measurement, 7(3), 307-22.

Surowiecki, J. (2004). The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies and nations. New York: Doubleday.

Tymieniecka, A.-T. (1998). The ontopoesis of life as a new philosophical paradigm. Phenomenological Inquiry, 22, 12-59.

Tymieniecka, A.-T. (2000). Origins of life and the new critique of reason. Analecta Husserliana, 66, 3-16.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wood, D., & Bernasconi, R. (1988). Derrida and différance. Evanston, Illinois: Northwestern University Press.

Wright, B. D. (1997, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45, 52 [http://www.rasch.org/memo62.htm].

Zaner, R. M. (2002). Making music together while growing older: Further reflections on intersubjectivity. Human Studies, 25, 1-18.

Zimmerman, M. E. (1990). Heidegger’s confrontation with modernity: Technology, politics, art. Bloomington, Indiana: Indiana University Press.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

s becomes apparent
Advertisements

Clarifying the Goal: Submitting Rasch-based White Papers to NIST

October 23, 2009

NIST does not currently have any metrological standards (metrics to which all instruments measuring a particular construct are traceable) for anything measured with tests, surveys, rating scale assessments, or rankings–i.e., for anything of core interest in education, psychology, sociology, health status assessment, etc.

The ostensible reason for the lack of these standards is that no one has stepped up to demand them, to demonstrate their feasibility, or argue on behalf of their value. So anything of general interest as something for which we would want univerally uniform and available metrics could be proposed. As can be seen in the NIST call, you have to be able to argue for the viability of a fundamentally new innovation that would produce high returns on the investment in a system of networked, equated, or item banked instruments all measuring in a common metric.

Jack Stenner expressed the opinion some years ago that constructs already measured on mass scales using many different instruments that could conceivably be equated present the most persuasive cases for which strong metrological arguments could be made. I have wondered if that is necessarily true.

The idea is to establish a new division in NIST, managed jointly with the National Institutes of Health and of Education, that focuses on creating a new kind of metric system for informing human, social, and natural capital management, quality improvement, and research.

Because NIST has historically focused on metrological systems in the physical sciences, the immediate goal is only one of informing researchers at NIST as to the viability and potential value to be realized in analogous systems for the psychosocial sciences. No one understands the human, social, and economic value of measurement standards like NIST does.

Work that results in fundamental measures of psychosocial constructs should be proposed as areas deserving of NIST’s support. White Papers describing the “high risk-high reward” potential of Rasch applications might get them to start to consider the possibility of a whole new domain of metrics.

For more info, see http://www.nist.gov/tip/call_for_white_papers_sept09.pdf, and feel free to reference the arguments I made in the White Paper I submitted (www.livingcapitalmetrics.com/images/FisherNISTWhitePaper2.pdf), or in my recent paper in Measurement: Fisher, W. P., Jr. (2009, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), 42(9), 1278-1287.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Reliability Coefficients: Starting from the Beginning

August 31, 2009

[This posting was prompted by questions concerning a previous blog entry, Reliability Revisited, and provides background on reliability that only Rasch measurement practitioners are likely to possess.] Most measurement applications based in ordinal data do not implement rigorous checks of the internal consistency of the observations, nor do they typically use the log-odds transformation to convert the nonlinear scores into linear measures. Measurement is usually defined in statistical terms, applying population-level models to obtain group-level summary scores, means, and percentages. Measurement, however, ought to involve individual-level models and case-specific location estimates. (See one of my earlier blogs for more on this distinction between statistics and measurement.)

Given the appropriate measurement focus on the individual, the instrument is initially calibrated and measures are estimated in a simultaneous conjoint process. Once the instrument is calibrated, the item estimates can be anchored, measures can be routinely produced from them, and new items can be calibrated into the system, and others dropped, over time. This method has been the norm in admissions, certification, licensure, and high stakes testing for decades (Fisher & Wright, 1994; Bezruczko, 2005).

Measurement modelling of individual response processes has to be stochastic, or else we run into the attenuation paradox (Engelhard, 1993, 1994). This is the situation in which a deterministic progression of observations from one end of the instrument to the other produces apparently error-free data strings that look like this (1 being a correct answer, a higher rating, or the presence of an attribute, and 0 being incorrect, a lower rating, or the absence of the attribute):

00000000000

10000000000

11000000000

11100000000

11110000000

11111000000

11111100000

11111111000

11111111100

11111111110

11111111111

In this situation, strings with all 0s and all 1s give no information useful for estimating measures (rows) or calibrations (columns). It is as though some of the people are shorter than the first unit on the ruler, and others are taller than the top unit. We don’t really have any way of knowing how short or tall they are, so their rows drop out. But eliminating the top and bottom rows makes the leftmost and rightmost columns all 0s and 1s, and eliminating them then gives new rows with all 0s and 1s, etc., until there’s no data left. (See my Revisiting Reliability blog for evaluations of five different probabilistically-structured data sets of this kind simulated to contrast various approaches to assessing reliability and internal consistency.)

The problem for estimation (Linacre, 1991, 1999, 2000) in data like those shown above is that the lack of informational overlaps between the columns, on the one hand, and between the rows, on the other, gives us no basis for knowing how much more of the variable is represented by any one item relative to any other, or by any one person measured relative to any other. In addition, whenever we actually construct measures of abilities, attitudes, or behaviors that conform with this kind of Guttman (1950) structure (Andrich, 1985; Douglas & Wright, 1989; Engelhard, 2008), the items have to be of such markedly different difficulties or agreeabilities that the results tend to involve large numbers of indistinguishable groups of respondents. But when that information is present in a probabilistically consistent way, we have an example of the phenomenon of stochastic resonance (Fisher, 1992b), so called because of the way noise amplifies weak deterministic signals (Andò & Graziani, 2000; Benzi, Sutera, & Vulpiani, 1981; Bulsara & Gammaitoni, 1996; Dykman & McClintock, 1998; Schimansky-Geier, Freund, Neiman, & Shulgin, 1998).

We need the noise, but we can’t let it overwhelm the system. We have to be able to know how much error there is relative to actual signal. Reliability is traditionally defined (Guilford 1965, pp. 439-40) as an estimate of this relation of signal and noise:

“The reliability of any set of measurements is logically defined as the proportion of their variance that is true variance…. We think of the total variance of a set of measures as being made up of two sources of variance: true variance and error variance… The true measure is assumed to be the genuine value of whatever is being measured… The error components occur independently and at random.”

Traditional reliability coefficients, like Cronbach’s alpha, are correlational, implementing a statistical model of group-level information. Error is taken to be the unexplained portion of the variance:

“In his description of alpha Cronbach (1951) proved (1) that alpha is the mean of all possible split-half coefficients, (2) that alpha is the value expected when two random samples of items from a pool like those in the given test are correlated, and (3) that alpha is a lower bound to the proportion of test variance attributable to common factors among the items” (Hattie, 1985, pp. 143-4).

But measurement models of individual-level response processes (Rasch, 1960; Andrich, 1988; Wright, 1977; Fisher & Wright, 1994; Bond & Fox, 2007; Wilson, 2005; Bezruczko, 2005) employ individual-level error estimates (Wright, 1977; Wright & Stone, 1979; Wright & Masters, 1982), not correlational group-level variance estimates. The individual measurement errors are statistically equivalent to sampling confidence intervals, as is evident in both Wright’s equations and in plots of errors and confidence intervals (see Figure 4 in Fisher, 2008). That is, error and confidence intervals both decline at the same rate with larger numbers of item responses per person, or larger numbers of person responses per item.

This phenomenon has a constructive application in instrument design. If a reasonable expectation for the measurement standard deviation can be formulated and related to the error expected on the basis of the number of items and response categories, a good estimate of the measurement reliability can be read off a nomograph (Linacre, 1993).

Wright (Wright & Masters, 1982, pp. 92, 106; Wright, 1996) introduced several vitally important measurement precision concepts and tools that follow from access to individual person and item error estimates. They improve on the traditional KR-20 or Cronbach reliability coefficients because the individualized error estimates better account for the imprecisions of mistargeted instruments, and for missing data, and so more accurately and conservatively estimate reliability.

Wright and Masters introduce a new reliability statistic, G, the measurement separation reliability index. The availability of individual error estimates makes it possible to estimate the true variance of the measures more directly, by subtracting the mean square error from the total variance. The standard deviation based on this estimate of true variance is then made the numerator of a ratio, G, having the root mean square error as its denominator.

Each unit increase in this G index then represents another multiple of the error unit in the amount of quantitative variation present in the measures. This multiple is nonlinearly represented in the traditional reliability coefficients expressed in the 0.00 – 1.00 range, such that the same separation index unit difference is found in the 0.00 to 0.50, 0.50 to 0.80, 0.80 to 0.90, 0.90 to 0.94, 0.94 to 0.96, and 0.96 to 0.97 reliability ranges (see Fisher, 1992a, for a table of values; available online: see references).

G can also be estimated as the square root of the reliability divided by one minus the reliability. Conversely, a reliability coefficient roughly equivalent to Cronbach’s alpha is estimated as G squared divided by G squared plus the error variance. Because individual error estimates are inflated in the presence of missing data and when an instrument is mistargeted and measures tend toward the extremes, the Rasch-based reliability coefficients tend to be more conservative than Cronbach’s alpha, as these sources of error are hidden within the variances and correlations. For a comparison of the G separation index, the G reliability coefficient, and Cronbach’s alpha over five simulated data sets, see the Reliability Revisited blog entry.

Error estimates can be made more conservative yet by multiplying each individual error term by the larger of either 1.0 or the square root of the associated individual mean square fit statistic for that case (Wright, 1995). (The mean square fit statistics are chi-squares divided by their degrees of freedom, and so have an expected value of 1.00; see Smith (1996) for more on fit, and see my recent blog, Revisiting Reliability, for more on the conceptualization and evaluation of reliability relative to fit.)

Wright and Masters (1982, pp. 92, 105-6) also introduce the concept of strata, ranges on the measurement continuum with centers separated by three errors. Strata are in effect a more forgiving expression of the separation reliability index, G, since the latter approximates strata with centers separated by four errors. An estimate of strata defined as having centers separated by four errors is very nearly identical with the separation index. If three errors define a 95% confidence interval, four are equivalent to 99% confidence.

There is a particular relevance in all of this for practical applications involving the combination or aggregation of physical, chemical, and other previously calibrated measures. This is illustrated in, for instance, the use of chemical indicators in assessing disease severity, environmental pollution, etc. Though any individual measure of the amount of a chemical or compound is valid within the limits of its intended purpose, to arrive at measures delineating disease severity, overall pollution levels, etc., the relevant instruments must be designed, tested, calibrated, and maintained, just as any instruments are (Alvarez, 2005; Cipriani, Fox, Khuder, et al., 2005; Fisher, Bernstein, et al., 2002; Fisher, Priest, Gilder, et al., 2008; Hughes, Perkins, Wright, et al., 2003; Perkins, Wright, & Dorsey, 2005; Wright, 2000).

The same methodology that is applied in this work, involving the rating or assessment of the quality of the outcomes or impacts counted, expressed as percentages, or given in an indicator’s native metric (parts per million, acres, number served, etc.), is needed in the management of all forms of human, social, and natural capital. (Watch this space for a forthcoming blog applying this methodology to the scaling of the UN Millennium Development Goals data.) The practical advantages of working from calibrated instrumentation in these contexts include data quality evaluations, the replacement of nonlinear percentages with linear measures, data volume reduction with no loss of information, and the integration of meaningful and substantive qualities with additive quantities on annotated metrics.

References

Alvarez, P. (2005). Several noncategorical measures define air pollution. In N. Bezruczko (Ed.), Rasch measurement in health sciences (pp. 277-93). Maple Grove, MN: JAM Press.

Andò, B., & Graziani, S. (2000). Stochastic resonance theory and applications. New York: Kluwer Academic Publishers.

Andrich, D. (1985). An elaboration of Guttman scaling with Rasch models for measurement. In N. B. Tuma (Ed.), Sociological methodology 1985 (pp. 33-80). San Francisco, California: Jossey-Bass.

Andrich, D. (1988). Rasch models for measurement. (Vols. series no. 07-068). Sage University Paper Series on Quantitative Applications in the Social Sciences). Beverly Hills, California: Sage Publications.

Benzi, R., Sutera, A., & Vulpiani, A. (1981). The mechanism of stochastic resonance. Journal of Physics. A. Mathematical and General, 14, L453-L457.

Bezruczko, N. (Ed.). (2005). Rasch measurement in health sciences. Maple Grove, MN: JAM Press.

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2d edition. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Bulsara, A. R., & Gammaitoni, L. (1996, March). Tuning in to noise. Physics Today, 49, 39-45.

Cipriani, D., Fox, C., Khuder, S., & Boudreau, N. (2005). Comparing Rasch analyses probability estimates to sensitivity, specificity and likelihood ratios when examining the utility of medical diagnostic tests. Journal of Applied Measurement, 6(2), 180-201.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334.

Douglas, G. A., & Wright, B. D. (1989). Response patterns and their probabilities. Rasch Measurement Transactions, 3(4), 75-77 [http://www.rasch.org/rmt/rmt34.htm].

Dykman, M. I., & Mcclintock, P. V. E. (1998, January 22). What can stochastic resonance do? Nature, 391(6665), 344.

Engelhard, G., Jr. (1993). What is the attenuation paradox? Rasch Measurement Transactions, 6(4), 257 [http://www.rasch.org/rmt/rmt64.htm].

Engelhard, G., Jr. (1994). Resolving the attenuation paradox. Rasch Measurement Transactions, 8(3), 379.

Engelhard, G. (2008, July). Historical perspectives on invariant measurement: Guttman, Rasch, and Mokken. Measurement: Interdisciplinary Research & Perspectives, 6(3), 155-189.

Fisher, W. P., Jr. (1992a). Reliability statistics. Rasch Measurement Transactions, 6(3), 238 [http://www.rasch.org/rmt/rmt63i.htm].

Fisher, W. P., Jr. (1992b, Spring). Stochastic resonance and Rasch measurement. Rasch Measurement Transactions, 5(4), 186-187 [http://www.rasch.org/rmt/rmt54k.htm].

Fisher, W. P., Jr. (2008, Summer). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-3 [http://www.rasch.org/rmt/rmt221.pdf].

Fisher, W. P., Jr., Bernstein, L. H., Qamar, A., Babb, J., Rypka, E. W., & Yasick, D. (2002, February). At the bedside: Measuring patient outcomes. Advance for Administrators of the Laboratory, 11(2), 8, 10 [http://laboratory-manager.advanceweb.com/Article/At-the-Bedside-7.aspx].

Fisher, W. P., Jr., Priest, E., Gilder, R., Blankenship, D., & Burton, E. C. (2008, July 3-6). Development of a novel heart failure measure to identify hospitalized patients at risk for intensive care unit admission. Presented at the World Congress on Controversies in Cardiovascular Diseases [http://www.comtecmed.com/ccare/2008/authors_abstract.aspx#Author15], Intercontinental Hotel, Berlin, Germany.

Fisher, W. P., Jr., & Wright, B. D. (Eds.). (1994). Applications of probabilistic conjoint measurement. International Journal of Educational Research, 21(6), 557-664.

Guilford, J. P. (1965). Fundamental statistics in psychology and education. 4th Edn. New York: McGraw-Hill.

Guttman, L. (1950). The basis for scalogram analysis. In S. A. Stouffer & et al. (Eds.), Studies in social psychology in World War II. volume 4: Measurement and prediction (pp. 60-90). New York: Wiley.

Hattie, J. (1985, June). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9(2), 139-64.

Hughes, L., Perkins, K., Wright, B. D., & Westrick, H. (2003). Using a Rasch scale to characterize the clinical features of patients with a clinical diagnosis of uncertain, probable or possible Alzheimer disease at intake. Journal of Alzheimer’s Disease, 5(5), 367-373.

Linacre, J. M. (1991, Spring). Stochastic Guttman order. Rasch Measurement Transactions, 5(4), 189 [http://www.rasch.org/rmt/rmt54p.htm].

Linacre, J. M. (1993). Rasch-based generalizability theory. Rasch Measurement Transactions, 7(1), 283-284; [http://www.rasch.org/rmt/rmt71h.htm].

Linacre, J. M. (1999). Understanding Rasch measurement: Estimation methods for Rasch measures. Journal of Outcome Measurement, 3(4), 382-405.

Linacre, J. M. (2000, Autumn). Guttman coefficients and Rasch data. Rasch Measurement Transactions, 14(2), 746-7 [http://www.rasch.org/rmt/rmt142e.htm].

Perkins, K., Wright, B. D., & Dorsey, J. K. (2005). Using Rasch measurement with medical data. In N. Bezruczko (Ed.), Rasch measurement in health sciences (pp. 221-34). Maple Grove, MN: JAM Press.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Schimansky-Geier, L., Freund, J. A., Neiman, A. B., & Shulgin, B. (1998). Noise induced order: Stochastic resonance. International Journal of Bifurcation and Chaos, 8(5), 869-79.

Smith, R. M. (2000). Fit analysis in latent trait measurement models. Journal of Applied Measurement, 1(2), 199-218.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1995, Summer). Which standard error? Rasch Measurement Transactions, 9(2), 436-437 [http://www.rasch.org/rmt/rmt92n.htm].

Wright, B. D. (1996, Winter). Reliability and separation. Rasch Measurement Transactions, 9(4), 472 [http://www.rasch.org/rmt/rmt94n.htm].

Wright, B. D. (2000). Rasch regression: My recipe. Rasch Measurement Transactions, 14(3), 758-9 [http://www.rasch.org/rmt/rmt143u.htm].

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, Illinois: MESA Press.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Reliability Revisited: Distinguishing Consistency from Error

August 28, 2009

When something is meaningful to us, and we understand it, then we can successfully restate it in our own words and predictably reproduce approximately the same representation across situations  as was obtained in the original formulation. When data fit a Rasch model, the implications are (1) that different subsets of items (that is, different ways of composing a series of observations summarized in a sufficient statistic) will all converge on the same pattern of person measures, and (2) that different samples of respondents or examinees will all converge on the same pattern of item calibrations. The meaningfulness of propositions based in these patterns will then not depend on which collection of items (instrument) or sample of persons is obtained, and all instruments might be equated relative to a single universal, uniform metric so that the same symbols reliably represent the same amount of the same thing.

Statistics and research methods textbooks in psychology and the social sciences commonly make statements like the following about reliability: “Reliability is consistency in measurement. The reliability of individual scale items increases with the number of points in the item. The reliability of the complete scale increases with the number of items.” (These sentences are found at the top of p. 371 in Experimental Methods in Psychology, by Gustav Levine and Stanley Parkinson (Lawrence Erlbaum Associates, 1994).) The unproven, perhaps unintended, and likely unfounded implication of these statements is that consistency increases as items are added.

Despite the popularity of doing so, Green, Lissitz, and Mulaik (1977) argue that reliability coefficients are misused when they are interpreted as indicating the extent to which data are internally consistent. “Green et al. (1977) observed that though high ‘internal consistency’ as indexed by a high alpha results when a general factor runs through the items, this does not rule out obtaining high alpha when there is no general factor running through the test items…. They concluded that the chief defect of alpha as an index of dimensionality is its tendency to increase as the number of items increase” (Hattie, 1985, p. 144).

In addressing the internal consistency of data, the implicit but incompletely realized purpose of estimating scale reliability is to evaluate the extent to which sum scores function as sufficient statistics. How limited is reliability as a tool for this purpose? To answer this question, five dichotomous data sets of 23 items and 22 persons were simulated. The first one was constructed so as to be highly likely to fit a Rasch model, with a deliberately orchestrated probabilistic Guttman pattern. The second one was made nearly completely random. The third, fourth, and fifth data sets were modifications of the first one in which increasing numbers of increasingly inconsistent responses were introduced. (The inconsistencies were not introduced in any systematic way apart from inserting contrary responses in the ordered matrix.) The data sets are shown in the Appendix. Tables 1 and 2 summarize the results.

Table 1 shows that the reliability coefficients do in fact decrease, along with the global model fit log-likelihood chi-squares, as the amount of randomness and inconsistency is increased. Contrary to what is implied in Levine and Parkinson’s statements, however, reliability can vary within a given number of items, as it might across different data sets produced from the same test, survey, or assessment, depending on how much structural invariance is present within them.

Two other points about the tables are worthy of note. First, the Rasch-based person separation reliability coefficients drop at a faster rate than Cronbach’s alpha does. This is probably an effect of the individualized error estimates in the Rasch context, which makes its reliability coefficients more conservative than correlation-based, group-level error estimates. (It is worth noting, as well, that the Winsteps and SPSS estimates of Cronbach’s alpha match. They are reported to one fewer decimal places by Winsteps, but the third decimal place is shown for the SPSS values for contrast.)

Second, the fit statistics are most affected by the initial and most glaring introduction of inconsistencies, in data set three. As the randomness in the data increases, the reliabilities continue to drop, but the fit statistics improve, culminating in the case of data set two, where complete randomness results in near-perfect model fit. This is, of course, the situation in which both the instrument and the sample are as well targeted as they can be, since all respondents have about the same measure and all the items about the same calibration; see Wood (1978) for a commentary on this situation, where coin tosses fit a Rasch model.

Table 2 shows the results of the Winsteps Principal Components Analysis of the standardized residuals for all five data sets. Again, the results conform with and support the pattern shown in the reliability coefficients. It is, however, interesting to note that, for data sets 4 and 5, with their Cronbach’s alphas of about .89 and .80, respectively, which are typically deemed quite good, the PCA shows more variance left unexplained than is explained by the Rasch dimension. The PCA is suggesting that two or more constructs might be represented in the data, but this would never be known from Cronbach’s alpha alone.

Alpha alone would indicate the presence of a unidimensional construct for data sets 3, 4 and 5, despite large standard deviations in the fit statistics and even though more than half the variance cannot be explained by the primary dimension. Worse, for the fifth data set, more variance is captured in the first three contrasts than is explained by the Rasch dimension. But with Cronbach’s alpha at .80, most researchers would consider this scale quite satisfactorily unidimensional and internally consistent.

These results suggest that, first, in seeking high reliability, what is sought more fundamentally is fit to a Rasch model (Andrich & Douglas, 1977; Andrich, 1982; Wright, 1977). That is, in addressing the internal consistency of data, the popular conception of reliability is taking on the concerns of construct validity. A conceptually clearer sense of reliability focuses on the extent to which an instrument works as expected every time it is used, in the sense of the way a car can be reliable. For instance, with an alpha of .70, a screening tool would be able to reliably distinguish measures into two statistically distinct groups (Fisher, 1992; Wright, 1996), problematic and typical. Within the limits of this purpose, the tool would meet the need for the repeated production of information capable of meeting the needs of the situation. Applications in research, accountability, licensure/certification, or diagnosis, however, might demand alphas of .95 and the kind of precision that allows for statistically distinct divisions into six or more groups. In these kinds of applications, where experimental designs or practical demands require more statistical power, measurement precision articulates finer degrees of differences. Finely calibrated instruments provide sensitivity over the entire length of the measurement continuum, which is needed for repeated reproductions of the small amounts of change that might accrue from hard to detect treatment effects.

Separating the construct, internal consistency, and unidimensionality issues  from the repeatability and reproducibility of a given degree of measurement precision provides a much-needed conceptual and methodological clarification of reliability. This clarification is routinely made in Rasch measurement applications (Andrich, 1982; Andrich & Douglas, 1977; Fisher, 1992; Linacre, 1993, 1996, 1997). It is reasonable to want to account for inconsistencies in the data in the error estimates and in the reliability coefficients, and so errors and reliabilities are routinely reported in terms of both the modeled expectations and in a fit-inflated form (Wright, 1995). The fundamental value of proceeding from a basis in individual error and fit statistics (Wright, 1996), is that local imprecisions and failures of invariance can be isolated for further study and selective attention.

The results of the simulated data analyses suggest, second, that, used in isolation, reliability coefficients can be misleading. As Green, et al. say, reliability estimates tend to systematically increase as the number of items increases (Fisher, 2008). The simulated data show that reliability coefficients also systematically decrease as inconsistency increases.

The primary problem with relying on reliability coefficients alone as indications of data consistency hinges on their inability to reveal the location of departures from modeled expectations. Most uses of reliability coefficients take place in contexts in which the model remains unstated and expectations are not formulated or compared with observations. The best that can be done in the absence of a model statement and test of data fit to it is to compare the reliability obtained against that expected on the basis of the number of items and response categories, relative to the observed standard deviation in the scores, expressed in logits (Linacre, 1993). One might then raise questions as to targeting, data consistency, etc. in order to explain larger than expected differences.

A more methodical way, however, would be to employ multiple avenues of approach to the evaluation of the data, including the use of model fit statistics and Principal Components Analysis in the evaluation of differential item and person functioning. Being able to see which individual observations depart the furthest from modeled expectation can provide illuminating qualitative information on the meaningfulness of the data, the measures, and the calibrations, or the lack thereof.  This information is crucial to correcting data entry errors, identifying sources of differential item or person functioning, separating constructs and populations, and improving the instrument. The power of the reliability-coefficient-only approach to data quality evaluation is multiplied many times over when the researcher sets up a nested series of iterative dialectics in which repeated data analyses explore various hypotheses as to what the construct is, and in which these analyses feed into revisions to the instrument, its administration, and/or the population sampled.

For instance, following the point made by Smith (1996), it may be expected that the PCA results will illuminate the presence of multiple constructs in the data with greater clarity than the fit statistics, when there are nearly equal numbers of items representing each different measured dimension. But the PCA does not work as well as the fit statistics when there are only a few items and/or people exhibiting inconsistencies.

This work should result in a full circle return to the drawing board (Wright, 1994; Wright & Stone, 2003), such that a theory of the measured construct ultimately provides rigorously precise predictive control over item calibrations, in the manner of the Lexile Framework (Stenner, et al., 2006) or developmental theories of hierarchical complexity (Dawson, 2004). Given that the five data sets employed here were simulations with no associated item content, the invariant stability and meaningfulness of the construct cannot be illustrated or annotated. But such illustration also is implicit in the quest for reliable instrumentation: the evidentiary basis for a delineation of meaningful expressions of amounts of the thing measured. The hope to be gleaned from the successes in theoretical prediction achieved to date is that we might arrive at practical applications of psychosocial measures that are as meaningful, useful, and economically productive as the theoretical applications of electromagnetism, thermodynamics, etc. that we take for granted in the technologies of everyday life.

Table 1

Reliability and Consistency Statistics

22 Persons, 23 Items, 506 Data Points

Data set Intended reliability Winsteps Real/Model Person Separation Reliability Winsteps/SPSS Cronbach’s alpha Winsteps Person Infit/Outfit Average Mn Sq Winsteps Person Infit/Outfit SD Winsteps Real/Model Item Separation Reliability Winsteps Item Infit/Outfit Average Mn Sq Winsteps Item Infit/Outfit SD Log-Likelihood Chi-Sq/d.f./p
First Best .96/.97 .96/.957 1.04/.35 .49/.25 .95/.96 1.08/0.35 .36/.19 185/462/1.00
Second Worst .00/.00 .00/-1.668 1.00/1.00 .05/.06 .00/.00 1.00/1.00 .05/.06 679/462/.0000
Third Good .90/.91 .93/.927 .92/2.21 .30/2.83 .85/.88 .90/2.13 .64/3.43 337/462/.9996
Fourth Fair .86/.87 .89/.891 .96/1.91 .25/2.18 .79/.83 .94/1.68 .53/2.27 444/462/.7226
Fifth Poor .76/.77 .80/.797 .98/1.15 .24/.67 .59/.65 .99/1.15 .41/.84 550/462/.0029
Table 2

Principal Components Analysis

Data set Intended reliability % Raw Variance Explained by Measures/Persons/Items % Raw Variance Captured in First Three Contrasts Total number of loadings > |.40| in first contrast
First Best 76/41/35 12 8
Second Worst 4.3/1.7/2.6 56 15
Third Good 59/34/25 20 14
Fourth Fair 47/27/20 26 13
Fifth Poor 29/17/11 41 15

References

Andrich, D. (1982, June). An index of person separation in Latent Trait Theory, the traditional KR-20 index, and the Guttman scale response pattern. Education Research and Perspectives, 9(1), http://www.rasch.org/erp7.htm.

Andrich, D. & G. A. Douglas. (1977). Reliability: Distinctions between item consistency and subject separation with the simple logistic model. Paper presented at the Annual Meeting of the American Educational Research Association, New York.

Dawson, T. L. (2004, April). Assessing intellectual development: Three approaches, one sequence. Journal of Adult Development, 11(2), 71-85.

Fisher, W. P., Jr. (1992). Reliability statistics. Rasch Measurement Transactions, 6(3), 238  [http://www.rasch.org/rmt/rmt63i.htm].

Fisher, W. P., Jr. (2008, Summer). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-3.

Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977, Winter). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37(4), 827-833.

Hattie, J. (1985, June). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9(2), 139-64.

Levine, G., & Parkinson, S. (1994). Experimental methods in psychology. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Linacre, J. M. (1993). Rasch-based generalizability theory. Rasch Measurement Transactions, 7(1), 283-284; [http://www.rasch.org/rmt/rmt71h.htm].

Linacre, J. M. (1996). True-score reliability or Rasch statistical validity? Rasch Measurement Transactions, 9(4), 455 [http://www.rasch.org/rmt/rmt94a.htm].

Linacre, J. M. (1997). KR-20 or Rasch reliability: Which tells the “Truth?”. Rasch Measurement Transactions, 11(3), 580-1 [http://www.rasch.org/rmt/rmt113l.htm].

Smith, R. M. (1996). A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling, 3(1), 25-40.

Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2006). How accurate are Lexile text measures? Journal of Applied Measurement, 7(3), 307-22.

Wood, R. (1978). Fitting the Rasch model: A heady tale. British Journal of Mathematical and Statistical Psychology, 31, 27-32.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1980). Foreword, Afterword. In Probabilistic models for some intelligence and attainment tests, by Georg Rasch (pp. ix-xix, 185-199. http://www.rasch.org/memo63.htm) [Reprint; original work published in 1960 by the Danish Institute for Educational Research]. Chicago, Illinois: University of Chicago Press.

Wright, B. D. (1994, Summer). Theory construction from empirical observations. Rasch Measurement Transactions, 8(2), 362 [http://www.rasch.org/rmt/rmt82h.htm].

Wright, B. D. (1995, Summer). Which standard error? Rasch Measurement Transactions, 9(2), 436-437 [http://www.rasch.org/rmt/rmt92n.htm].

Wright, B. D. (1996, Winter). Reliability and separation. Rasch Measurement Transactions, 9(4), 472 [http://www.rasch.org/rmt/rmt94n.htm].

Wright, B. D., & Stone, M. H. (2003). Five steps to science: Observing, scoring, measuring, analyzing, and applying. Rasch Measurement Transactions, 17(1), 912-913 [http://www.rasch.org/rmt/rmt171j.htm].

Appendix

Data Set 1

01100000000000000000000

10100000000000000000000

11000000000000000000000

11100000000000000000000

11101000000000000000000

11011000000000000000000

11100100000000000000000

11110100000000000000000

11111010100000000000000

11111101000000000000000

11111111010101000000000

11111111101010100000000

11111111111010101000000

11111111101101010010000

11111111111010101100000

11111111111111010101000

11111111111111101010100

11111111111111110101011

11111111111111111010110

11111111111111111111001

11111111111111111111101

11111111111111111111100

Data Set 2

01101010101010101001001

10100101010101010010010

11010010101010100100101

10101001010101001001000

01101010101010110010011

11011010010101100100101

01100101001001001001010

10110101000110010010100

01011010100100100101001

11101101001001001010010

11011010010101010100100

10110101101010101001001

01101011010000101010010

11010110101001010010100

10101101010000101101010

11011010101010010101010

10110101010101001010101

11101010101010110101011

11010101010101011010110

10101010101010110111001

01010101010101101111101

10101010101011011111100

Data Set 3

01100000000000100000010

10100000000000000010001

11000000000000100000010

11100000000000100000000

11101000000000100010000

11011000000000000000000

11100100000000100000000

11110100000000000000000

11111010100000100000000

11111101000000000000000

11111111010101000000000

11111111101010100000000

11111111111010001000000

11011111111111010010000

11011111111111101100000

11111111111111010101000

11011111111111101010100

11111111111111010101011

11011111111111111010110

11111111111111111111001

11011111111111111111101

10111111111111111111110

Data Set 4

01100000000000100010010

10100000000000000010001

11000000000000100000010

11100000000000100000001

11101000000000100010000

11011000000000000010000

11100100000000100010000

11110100000000000000000

11111010100000100010000

11111101000000000000000

11111011010101000010000

11011110111010100000000

11111111011010001000000

11011111101011110010000

11011111101101101100000

11111111110101010101000

11011111111011101010100

11111111111101110101011

01011111111111011010110

10111111111111111111001

11011111111111011111101

10111111111111011111110

Data Set 5

11100000010000100010011

10100000000000000011001

11000000010000100001010

11100000010000100000011

11101000000000100010010

11011000000000000010011

11100100000000100010000

11110100000000000000011

11111010100000100010000

00000000000011111111111

11111011010101000010000

11011110111010100000000

11111111011010001000000

11011111101011110010000

11011111101101101100000

11111111110101010101000

11011111101011101010100

11111111111101110101011

01011111111111011010110

10111111101111111111001

11011111101111011111101

00111111101111011111110

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Statistics and Measurement: Clarifying the Differences

August 26, 2009

Measurement is qualitatively and paradigmatically quite different from statistics, even though statistics obviously play important roles in measurement, and vice versa. The perception of measurement as conceptually difficult stems in part from its rearrangement of most of the concepts that we take for granted in the statistical paradigm as landmarks of quantitative thinking. When we recognize and accept the qualitative differences between statistics and measurement, they both become easier to understand.

Statistical analyses are commonly referred to as quantitative, even though the numbers analyzed most usually have not been derived from the mapping of an invariant substantive unit onto a number line. Measurement takes such mapping as its primary concern, focusing on the quantitative meaningfulness of numbers (Falmagne & Narens, 1983; Luce, 1978; ,  Marcus-Roberts & Roberts, 1987; Mundy, 1986; Narens, 2002; Roberts, 1999). Statistical models focus on group processes and relations among variables, while measurement models focus on individual processes and relations within variables (Duncan, 1992; Duncan & Stenbeck, 1988; Rogosa, 1987). Statistics makes assumptions about factors beyond its control, while measurement sets requirements for objective inference (Andrich, 1989). Statistics primarily involves data analysis, while measurement primarily calibrates instruments in common metrics for interpretation at the point of use (Cohen, 1994; Fisher, 2000; Guttman, 1985; Goodman, 1999a-c; Rasch, 1960).

Statistics focuses on making the most of the data in hand, while measurement focuses on using the data in hand to inform (a) instrument calibration and improvement, and (b) the prediction and efficient gathering of meaningful new data on individuals in practical applications. Where statistical “measures” are defined inherently by a particular analytic method, measures read from calibrated instruments—and the raw observations informing these measures—need not be computerized for further analysis.

Because statistical “measures” are usually derived from ordinal raw scores, changes to the instrument change their meaning, resulting in a strong inclination to avoid improving the instrument. Measures, in contrast, take missing data into account, so their meaning remains invariant over instrument configurations, resulting in a firm basis for the emergence of a measurement quality improvement culture. So statistical “measurement” begins and ends with data analysis, where measurement from calibrated instruments is in a constant cycle of application, new item calibrations, and critical recalibrations that require only intermittent resampling.

The vast majority of statistical methods and models make strong assumptions about the nature of the unit of measurement, but provide either very limited ways of checking those assumptions, or no checks at all. Statistical models are descriptive in nature, meaning that models are fit to data, that the validity of the data is beyond the immediate scope of interest, and that the model accounting for the most variation is regarded as best. Finally, and perhaps most importantly, statistical models are inherently oriented toward the relations among variables at the level of samples and populations.

Measurement models, however, impose strong requirements on data quality in order to achieve the unit of measurement that is easiest to think with, one that stays constant and remains invariant across the local particulars of instrument and sample. Measurement methods and models, then, provide extensive and varied ways of checking the quality of the unit, and so must be prescriptive rather than descriptive. That is, measurement models define the data quality that must be obtained for objective inference. In the measurement paradigm, data are fit to models, data quality is of paramount interest, and data quality evaluation must be informed as much by qualitative criteria as by quantitative.

To repeat the most fundamental point, measurement models are oriented toward individual-level response processes, not group-level aggregate processes. Herbert Blumer pointed out as early as 1930 that quantitative method is not equivalent to statistical method, and that the natural sciences had conspicuous degrees of success long before the emergence of statistical techniques (Hammersly, 1989, pp. 113-4). Both the initial scientific revolution in the 16th-17th centuries and the second scientific revolution of the 19th century found a basis in measurement for publicly objective and reproducible results, but statistics played little or no role in the major discoveries of the times.

The scientific value of statistics resides largely in the reproducibility of cross-variable data relations, and statisticians widely agree that statistical analyses should depend only on sufficient statistics (Arnold, 1982, p. 79). Measurement theoreticians and practitioners also agree, but the sufficiency of the mean and standard deviation relative to a normal distribution is one thing, and the sufficiency of individual responses relative to an invariant construct is quite another (Andersen, 1977; Arnold, 1985; Dynkin, 1951; Fischer, 1981; Hall, Wijsman, & Ghosh, 1965; van der Linden, 1992).

It is of historical interest, though, to point out that Rasch, foremost proponent of the latter, attributes credit for the general value of the concept of sufficiency to Ronald Fisher, foremost proponent of the former. Rasch’s strong statements concerning the fundamental inferential value of sufficiency (Andrich, 1997; Rasch, 1977; Wright, 1980) would seem to contradict his repeated joke about burning all the statistics texts making use of the normal distribution (Andersen, 1995, p. 385) were it not for the paradigmatic distinction between statistical models of group-level relations among variables, and measurement models of individual processes. Indeed, this distinction is made on the first page of Rasch’s (1980) book.

Now we are in a position to appreciate a comment by Ernst Rutherford, the winner of the 1908 Nobel Prize in Chemistry, who held that, if you need statistics to understand the results of your experiment, then you should have designed a better experiment (Wise, 1995, p. 11). A similar point was made by Feinstein (1995) concerning meta-analysis. The rarely appreciated point is that the generalizable replication and application of results depends heavily on the existence of a portable and universally uniform observational framework. The inferences, judgments, and adjustments that can be made at the point of use by clinicians, teachers, managers, etc. provided with additive measures expressed in a substantively meaningful common metric far outstrip those that can be made using ordinal measures expressed in instrument- and sample-dependent scores. See Andrich (1989, 2002, 2004), Cohen (1994), Davidoff (1999), Duncan (1992), Embretson (1996), Goodman (1999a, 1999b, 1999c), Guttman (1981, 1985), Meehl (1967), Michell (1986), Rogosa (1987), Romanowski and Douglas (2002), and others for more on this distinction between statistics and measurement.

These contrasts show that the confounding of statistics and measurement is a problem of vast significance that persists in spite of repeated efforts to clarify the distinction. For a wide variety of reasons ranging from cultural presuppositions about the nature of number to the popular notion that quantification is as easy as assigning numbers to observations, measurement is not generally well understood by the public (or even by statisticians!). And so statistics textbooks rarely, if ever, include even passing mention of instrument calibration methods, metric equating processes, the evaluation of data quality relative to the requirements of objective inference, traceability to metrological reference standards, or the integration of qualitative and quantitative methods in the interpretation of measures.

Similarly, in business, marketing, health care, and quality improvement circles, we find near-universal repetition of the mantra, “You manage what you measure,” with very little or no attention paid to the quality of the numbers treated as measures. And so, we find ourselves stuck with so-called measurement systems where,

• instead of linear measures defined by a unit that remains constant across samples and instruments we saddle ourselves with nonlinear scores and percentages defined by units that vary in unknown ways across samples and instruments;
• instead of availing ourselves of the capacity to take missing data into account, we hobble ourselves with the need for complete data;
• instead of dramatically reducing data volume with no loss of information, we insist on constantly re-enacting the meaningless ritual of poring over undigestible masses of numbers;
• instead of adjusting measures for the severity or leniency of judges assigning ratings, we allow measures to depend unfairly on which rater happens to make the observations;
• instead of using methods that give the same result across different distributions, we restrict ourselves to ones that give different results when assumptions of normality are not met and/or standard deviations differ;
• instead of calibrating instruments in an experimental test of the hypothesis that the intended construct is in fact structured in such a way as to make its mapping onto a number line meaningful, we assign numbers and make quantitative inferences with no idea as to whether they relate at all to anything real;
• instead of checking to see whether rating scales work as intended, with higher ratings consistently representing more of the variable, we make assumptions that may be contradicted by the order and spacing of the way rating scales actually work in practice;
• instead of defining a comprehensive framework for interpreting measures relative to a construct, we accept the narrow limits of frameworks defined by the local sample and items;
• instead of capitalizing on the practicality and convenience of theories capable of accurately predicting item calibrations and measures apart from data, we counterproductively define measurement empirically in terms of data analysis;
• instead of putting calibrated tools into the hands of front-line managers, service representatives, teachers and clinicians, we require them to submit to cumbersome data entry, analysis, and reporting processes that defeat the purpose of measurement by ensuring the information provided is obsolete by the time it gets back to the person who could act on it; and
• instead of setting up efficient systems for communicating meaningful measures in common languages with shared points of reference, we settle for inefficient systems for communicating meaningless scores in local incommensurable languages.

Because measurement is simultaneously ubiquitous and rarely well understood, we find ourselves in a world that gives near-constant lip service to the importance of measurement while it does almost nothing to provide measures that behave the way we assume they do. This state of affairs seems to have emerged in large part due to our failure to distinguish between the group-level orientation of statistics and the individual-level orientation of measurement. We seem to have been seduced by a variation on what Whitehead (1925, pp. 52-8) called the fallacy of misplaced concreteness. That is, we have assumed that the power of lawful regularities in thought and behavior would be best revealed and acted on via statistical analyses of data that themselves embody the aggregate mass of the patterns involved.

It now appears, however, in light of studies in the history of science (Latour, 1987, 2005; Wise, 1995), that an alternative and likely more successful approach will be to capitalize on the “wisdom of crowds” (Surowiecki, 2004) phenomenon of collective, distributed cognition (Akkerman, et al., 2007; Douglas, 1986; Hutchins, 1995; Magnus, 2007). This will be done by embodying lawful regularities in instruments calibrated in ideal, abstract, and portable metrics put to work by front-line actors on mass scales (Fisher, 2000, 2005, 2009a, 2009b). In this way, we will inform individual decision processes and structure communicative transactions with efficiencies, meaningfulness, substantive effectiveness, and power that go far beyond anything that could be accomplished by trying to make inferences about individuals from group-level statistics.

We ought not accept the factuality of data as the sole criterion of objectivity, with all theory and instruments constrained by and focused on the passing ephemera of individual sets of local particularities. Properly defined and operationalized via a balanced interrelation of theory, data, and instrument, advanced measurement is not a mere mathematical exercise but offers a wealth of advantages and conveniences that cannot otherwise be obtained. We ignore its potentials at our peril.

References
Akkerman, S., Van den Bossche, P., Admiraal, W., Gijselaers, W., Segers, M., Simons, R.-J., et al. (2007, February). Reconsidering group cognition: From conceptual confusion to a boundary area between cognitive and socio-cultural perspectives? Educational Research Review, 2, 39-63.

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.

Andersen, E. B. (1995). What George Rasch would have thought about this book. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 383-390). New York: Springer-Verlag.

Andrich, D. (1989). Distinctions between assumptions and requirements in measurement in the social sciences. In J. A. Keats, R. Taft, R. A. Heath & S. H. Lovibond (Eds.), Mathematical and Theoretical Systems: Proceedings of the 24th International Congress of Psychology of the International Union of Psychological Science, Vol. 4 (pp. 7-16). North-Holland: Elsevier Science Publishers.

Andrich, D. (1997). Georg Rasch in his own words [excerpt from a 1979 interview]. Rasch Measurement Transactions, 11(1), 542-3. [http://www.rasch.org/rmt/rmt111.htm#Georg].

Andrich, D. (2002). Understanding resistance to the data-model relationship in Rasch’s paradigm: A reflection for the next generation. Journal of Applied Measurement, 3(3), 325-59.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Arnold, S. F. (1982-1988). Sufficient statistics. In S. Kotz, N. L. Johnson & C. B. Read (Eds.), Encyclopedia of Statistical Sciences (pp. 72-80). New York: John Wiley & Sons.

Arnold, S. F. (1985, September). Sufficiency and invariance. Statistics & Probability Letters, 3, 275-279.

Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist, 49, 997-1003.

Davidoff, F. (1999, 15 June). Standing statistics right side up (Editorial). Annals of Internal Medicine, 130(12), 1019-1021.

Douglas, M. (1986). How institutions think. Syracuse, New York: Syracuse University Press.

Dynkin, E. B. (1951). Necessary and sufficient statistics for a family of probability distributions. Selected Translations in Mathematical Statistics and Probability, 1, 23-41.

Duncan, O. D. (1992, September). What if? Contemporary Sociology, 21(5), 667-668.

Duncan, O. D., & Stenbeck, M. (1988). Panels and cohorts: Design and model in the study of voting turnout. In C. C. Clogg (Ed.), Sociological Methodology 1988 (pp. 1-35). Washington, DC: American Sociological Association.

Embretson, S. E. (1996, September). Item Response Theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, 20(3), 201-212.

Falmagne, J.-C., & Narens, L. (1983). Scales and meaningfulness of quantitative laws. Synthese, 55, 287-325.

Feinstein, A. R. (1995, January). Meta-analysis: Statistical alchemy for the 21st century. Journal of Clinical Epidemiology, 48(1), 71-79.

Fischer, G. H. (1981, March). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.

Fisher, W. P., Jr. (2000). Objectivity in psychosocial measurement: What, why, how. Journal of Outcome Measurement, 4(2), 527-563.

Fisher, W. P., Jr. (2005). Daredevil barnstorming to the tipping point: New aspirations for the human sciences. Journal of Applied Measurement, 6(3), 173-9.

Fisher, W. P., Jr. (2009a). Bringing human, social, and natural capital to life: Practical consequences and opportunities. In M. Wilson, K. Draney, N. Brown & B. Duckor (Eds.), Advances in Rasch Measurement, Vol. Two (p. in press). Maple Grove, MN: JAM Press.

Fisher, W. P., Jr. (2009b, July). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), in press.

Goodman, S. N. (1999a, 6 April). Probability at the bedside: The knowing of chances or the chances of knowing? (Editorial). Annals of Internal Medicine, 130(7), 604-6.

Goodman, S. N. (1999b, 15 June). Toward evidence-based medical statistics. 1: The p-value fallacy. Annals of Internal Medicine, 130(12), 995-1004.

Goodman, S. N. (1999c, 15 June). Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine, 130(12), 1005-1013.

Guttman, L. (1981). What is not what in theory construction. In I. Borg (Ed.), Multidimensional data representations: When & why. Ann Arbor, MI: Mathesis Press.

Guttman, L. (1985). The illogic of statistical inference for cumulative science. Applied Stochastic Models and Data Analysis, 1, 3-10.

Hall, W. J., Wijsman, R. A., & Ghosh, J. K. (1965). The relationship between sufficiency and invariance with applications in sequential analysis. Annals of Mathematical Statistics, 36, 575-614.

Hammersley, M. (1989). The dilemma of qualitative method: Herbert Blumer and the Chicago Tradition. New York: Routledge.

Hutchins, E. (1995). Cognition in the wild. Cambridge, Massachusetts: MIT Press.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York: Cambridge University Press.

Latour, B. (1995). Cogito ergo sumus! Or psychology swept inside out by the fresh air of the upper deck: Review of Hutchins’ Cognition in the Wild, MIT Press, 1995. Mind, Culture, and Activity: An International Journal, 3(192), 54-63.

Latour, B. (2005). Reassembling the social: An introduction to Actor-Network-Theory. (Clarendon Lectures in Management Studies). Oxford, England: Oxford University Press.

Luce, R. D. (1978, March). Dimensionally invariant numerical laws correspond to meaningful qualitative relations. Philosophy of Science, 45, 1-16.

Magnus, P. D. (2007). Distributed cognition and the task of science. Social Studies of Science, 37(2), 297-310.

Marcus-Roberts, H., & Roberts, F. S. (1987). Meaningless statistics. Journal of Educational Statistics, 12(4), 383-394.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.

Michell, J. (1986). Measurement scales and statistics: A clash of paradigms. Psychological Bulletin, 100, 398-407.

Mundy, B. (1986, June). On the general theory of meaningful representation. Synthese, 67(3), 391-437.

Narens, L. (2002, December). A meaningful justification for the representational theory of measurement. Journal of Mathematical Psychology, 46(6), 746-68.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 14, 58-94.

Roberts, F. S. (1999). Meaningless statements. In R. Graham, J. Kratochvil, J. Nesetril & F. Roberts (Eds.), Contemporary trends in discrete mathematics, DIMACS Series, Volume 49 (pp. 257-274). Providence, RI: American Mathematical Society.

Rogosa, D. (1987). Casual [sic] models do not support scientific conclusions: A comment in support of Freedman. Journal of Educational Statistics, 12(2), 185-95.

Romanoski, J. T., & Douglas, G. (2002). Rasch-transformed raw scores and two-way ANOVA: A simulation analysis. Journal of Applied Measurement, 3(4), 421-430.

Stevens, S. S. (1951). Mathematics, measurement, and psychophysics. In S. S. Stevens (Ed.), Handbook of experimental psychology (pp. 1-49). New York: John Wiley & Sons.

Surowiecki, J. (2004). The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies and nations. New York: Doubleday.

van der Linden, W. J. (1992). Sufficient and necessary statistics. Rasch Measurement Transactions, 6(3), 231 [http://www.rasch.org/rmt/rmt63d.htm].

Whitehead, A. N. (1925). Science and the modern world. New York: Macmillan.

Wise, M. N. (Ed.). (1995). The values of precision. Princeton, New Jersey: Princeton University Press.

Wright, B. D. (1980). Foreword, Afterword. In Probabilistic models for some intelligence and attainment tests, by Georg Rasch (pp. ix-xix, 185-199. http://www.rasch.org/memo63.htm) [Reprint; original work published in 1960 by the Danish Institute for Educational Research]. Chicago, Illinois: University of Chicago Press.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Contesting the Claim, Part III: References

July 24, 2009

References

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.

Andersen, E. B. (1995). What George Rasch would have thought about this book. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 383-390). New York: Springer-Verlag.

Andrich, D. (1988). Rasch models for measurement. (Vols. series no. 07-068). Sage University Paper Series on Quantitative Applications in the Social Sciences). Beverly Hills, California: Sage Publications.

Andrich, D. (1998). Thresholds, steps and rating scale conceptualization. Rasch Measurement Transactions, 12(3), 648-9 [http://209.238.26.90/rmt/rmt1239.htm].

Arnold, S. F. (1985, September). Sufficiency and invariance. Statistics & Probability Letters, 3, 275-279.

Bond, T., & Fox, C. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Burdick, D. S., Stone, M. H., & Stenner, A. J. (2006). The Combined Gas Law and a Rasch Reading Law. Rasch Measurement Transactions, 20(2), 1059-60 [http://www.rasch.org/rmt/rmt202.pdf].

Burdick, H., & Stenner, A. J. (1996). Theoretical prediction of test items. Rasch Measurement Transactions, 10(1), 475 [http://www.rasch.org/rmt/rmt101b.htm].

Choi, E. (1998, Spring). Rasch invents “Ounces.” Popular Measurement, 1(1), 29 [http://www.rasch.org/pm/pm1-29.pdf].

Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist, 49, 997-1003.

DeBoeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. (Statistics for Social and Behavioral Sciences). New York: Springer-Verlag.

Dynkin, E. B. (1951). Necessary and sufficient statistics for a family of probability distributions. Selected Translations in Mathematical Statistics and Probability, 1, 23-41.

Embretson, S. E. (1996, September). Item Response Theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, pp. 201-212.

Falmagne, J.-C., & Narens, L. (1983). Scales and meaningfulness of quantitative laws. Synthese, 55, 287-325.

Fischer, G. H. (1981, March). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.

Fischer, G. H. (1995). Derivations of the Rasch model. In G. Fischer & I. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 15-38). New York: Springer-Verlag.

Fisher, W. P., Jr. (1988). Truth, method, and measurement: The hermeneutic of instrumentation and the Rasch model [diss]. Dissertation Abstracts International, 49, 0778A, Dept. of Education, Division of the Social Sciences: University of Chicago (376 pages, 23 figures, 31 tables).

Fisher, W. P., Jr. (1997). Physical disability construct convergence across instruments: Towards a universal metric. Journal of Outcome Measurement, 1(2), 87-113.

Fisher, W. P., Jr. (1997, June). What scale-free measurement means to health outcomes research. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 357-373.

Fisher, W. P., Jr. (1999). Foundations for health status metrology: The stability of MOS SF-36 PF-10 calibrations across samples. Journal of the Louisiana State Medical Society, 151(11), 566-578.

Fisher, W. P., Jr. (2000). Objectivity in psychosocial measurement: What, why, how. Journal of Outcome Measurement, 4(2), 527-563.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2008, Summer). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-3 [http://www.rasch.org/rmt/rmt221.pdf].

Fisher, W. P., Jr. (2009, July). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), in press.

Goodman, S. N. (1999, 15 June). Toward evidence-based medical statistics. 1: The p-value fallacy. Annals of Internal Medicine, 130(12), 995-1004.

Guttman, L. (1985). The illogic of statistical inference for cumulative science. Applied Stochastic Models and Data Analysis, 1, 3-10.

Hall, W. J., Wijsman, R. A., & Ghosh, J. K. (1965). The relationship between sufficiency and invariance with applications in sequential analysis. Annals of Mathematical Statistics, 36, 575-614.

Linacre, J. M. (1993). Rasch-based generalizability theory. Rasch Measurement Transactions, 7(1), 283-284 [http://www.rasch.org/rmt/rmt71h.htm].

Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint measurement: A new kind of fundamental measurement. Journal of Mathematical Psychology, 1(1), 1-27.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-34.

Michell, J. (1999). Measurement in psychology: A critical history of a methodological concept. Cambridge: Cambridge University Press.

Moulton, M. (1993). Probabilistic mapping. Rasch Measurement Transactions, 7(1), 268 [http://www.rasch.org/rmt/rmt71b.htm].

Mundy, B. (1986, June). On the general theory of meaningful representation. Synthese, 67(3), 391-437.

Narens, L. (2002). Theories of meaningfulness (S. W. Link & J. T. Townsend, Eds.). Scientific Psychology Series. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Newby, V. A., Conner, G. R., Grant, C. P., & Bunderson, C. V. (2009). The Rasch model and additive conjoint measurement. Journal of Applied Measurement, 10(4), 348-354.

Pelton, T., & Bunderson, V. (2003). The recovery of the density scale using a stochastic quasi-realization of additive conjoint measurement. Journal of Applied Measurement, 4(3), 269-81.

Ramsay, J. O., Bloxom, B., & Cramer, E. M. (1975, June). Review of Foundations of Measurement, Vol. 1, by D. H. Krantz et al. Psychometrika, 40(2), 257-262.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Roberts, F. S., & Rosenbaum, Z. (1986). Scale type, meaningfulness, and the possible psychophysical laws. Mathematical Social Sciences, 12, 77-95.

Romanoski, J. T., & Douglas, G. (2002). Rasch-transformed raw scores and two-way ANOVA: A simulation analysis. Journal of Applied Measurement, 3(4), 421-430.

Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. Psychological Bulletin, 57(5), 416-428.

Smith, R. M., & Taylor, P. (2004). Equating rehabilitation outcome scales: Developing common metrics. Journal of Applied Measurement, 5(3), 229-42.

Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, XXXIII, 529-544. Reprinted in L. L. Thurstone, The Measurement of Values. Midway Reprint Series. Chicago, Illinois: University of Chicago Press, 1959, pp. 215-233.

van der Linden, W. J. (1992). Sufficient and necessary statistics. Rasch Measurement Transactions, 6(3), 231 [http://www.rasch.org/rmt/rmt63d.htm].

Velleman, P. F., & Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician, 47(1), 65-72.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1997, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, pp. 33-45, 52 [http://www.rasch.org/memo62.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Contesting the Claim, Part II: Are Rasch Measures Really as Objective as Physical Measures?

July 22, 2009

When a raw score is sufficient to the task of measurement, the model is the Rasch model, we can estimate the parameters consistently, and we can evaluate the fit of the data to the model. The invariance properties that follow from a sufficient statistic include virtually the entire class of invariant rules (Hall, Wijsman, & Ghosh, 1965; Arnold, 1985), and similar relationships with other key measurement properties follow from there (Fischer, 1981, 1995; Newby, Conner, Grant, & Bunderson, 2009; Wright, 1977, 1997).

What does this all actually mean? Imagine we were able to ask an infinite number of people an infinite number of questions that all work together to measure the same thing. Because (1) the scores are sufficient statistics, (2) the ruler is not affected by what is measured, (3) the parameters separate, and (4) the data fit the model, any subset of the questions asked would give the same measure. This means that any subscore for any person measured would be a function of any and all other subscores. When a sufficient statistic is a function of all other sufficient statistics, it is not only sufficient, it is necessary, and is referred to as a minimally sufficient statistic. Thus, if separable, independent model parameters can be estimated, the model must be the Rasch model, and the raw score is both sufficient and necessary (Andersen, 1977; Dynkin, 1951; van der Linden, 1992).

This means that scores, ratings, and percentages actually stand for something measurable only when they fit a Rasch model.  After all, what actually would be the point of using data that do not support the estimation of independent parameters? If the meaning of the results is tied in unknown ways to the specific particulars of a given situation, then those results are meaningless, by definition (Roberts & Rosenbaum, 1986; Falmagne & Narens, 1983; Mundy, 1986; Narens, 2002; also see Embretson, 1996; Romanoski and Douglas, 2002). There would be no point in trying to learn anything from them, as whatever happened was a one-time unique event that tells us nothing we can use in any future event (Wright, 1977, 1997).

What we’ve done here is akin to taking a narrative stroll through a garden of mathematical proofs. These conceptual analyses can be very convincing, but actual demonstrations of them are essential. Demonstrations would be especially persuasive if there would be some way of showing three things. First, shouldn’t there be some way of constructing ordinal ratings or scores for one or another physical variable that, when scaled, give us measures that are the same as the usual measures we are accustomed to?

This would show that we can use the type of instrument usually found in the social sciences to construct physical measures with the characteristics we expect. There are four available examples, in fact, involving paired comparisons of weights (Choi, 1998), measures of short lengths (Fisher, 1988), ratings of medium-range distances (Moulton, 1993), and a recovery of the density scale (Pelton & Bunderson, 2003). In each case, the Rasch-calibrated experimental instruments produced measures equivalent to the controls, as shown in linear plots of the pairs of measures.

A second thing to build out from the mathematical proofs are experiments in which we check the purported stability of measures and calibrations. We can do this by splitting large data sets, using different groups of items to produce two or more measures for each person, or using different groups of respondents/examinees to provide data for two or more sets of item calibrations. This is a routine experimental procedure in many psychometric labs, and results tend to conform with theory, with strong associations found between increasing sample sizes and increasing reliability coefficients for the respective measures or calibrations. These associations can be plotted (Fisher, 2008), as can the pairs of calibrations estimated from different samples (Fisher, 1999), and the pairs of measures estimated from different instruments (Fisher, Harvey, Kilgore, et al., 1995; Smith & Taylor, 2004). The theoretical expectation of tighter plots for better designed instruments, larger sample sizes, and longer tests is confirmed so regularly that it should itself have the status of a law of nature (Linacre, 1993).

A third convincing demonstration is to compare studies of the same thing conducted in different times and places by different researchers using different instruments on different samples. If the instruments really measure the same thing, there will not only be obvious similarities in their item contents, but similar items will calibrate in similar positions on the metric across samples. Results of this kind have been obtained in at least three published studies (Fisher, 1997a, 1997b; Belyukova, Stone, & Fox, 2004).

All of these arguments are spelled out in greater length and detail, with illustrations, in a forthcoming article (Fisher, 2009). I learned all of this from Benjamin Wright, who worked directly with Rasch himself, and who, perhaps more importantly, was prepared for what he could learn from Rasch in his previous career as a physicist. Before encountering Rasch in 1960, Wright had worked with Feynman at Cornell, Townes at Bell Labs, and Mulliken at the University of Chicago. Taught and influenced not just by three of the great minds of twentieth-century physics, but also by Townes’ philosophical perspectives on meaning and beauty, Wright had left physics in search of life. He was happy to transfer his experience with computers into his new field of educational research, but he was dissatisfied with the quality of the data and how it was treated.

Rasch’s ideas gave Wright the conceptual tools he needed to integrate his scientific values with the demands of the field he was in. Over the course of his 40-year career in measurement, Wright wrote the first software for estimating Rasch model parameters and continuously improved it; he adapted new estimation algorithms for Rasch’s models and was involved in the articulation of new models; he applied the models to hundreds of data sets using his software; he vigorously invested himself in students and colleagues; he founded new professional societies, meetings, and journals;  and he never stopped learning how to think anew about measurement and the meaning of numbers. Through it all, there was always a yardstick handy as a simple way of conveying the basic requirements of measurement as we intuitively understand it in physical terms.

Those of us who spend a lot of time working with these ideas and trying them out on lots of different kinds of data forget or never realize how skewed our experience is relative to everyone else’s. I guess a person lives in a different world when you have the sustained luxury of working with very large databases, as I have had, and you see the constancy and stability of well-designed measures and calibrations over time, across instruments, and over repeated samples ranging from 30 to several million.

When you have that experience, it becomes a basic description of reasonable expectation to read the work of a colleague and see him say that “when the key features of a statistical model relevant to the analysis of social science data are the same as those of the laws of physics, then those features are difficult to ignore” (Andrich, 1988, p. 22). After calibrating dozens of instruments over 25 years, some of them many times over, it just seems like the plainest statement of the obvious to see the same guy say “Our measurement principles should be the same for properties of rocks as for the properties of people. What we say has to be consistent with physical measurement” (Andrich, 1998, p. 3).

And I find myself wishing more people held the opinion expressed by two other colleagues, that “scientific measures in the social sciences must hold to the same standards as do measures in the physical sciences if they are going to lead to the same quality of generalizations” (Bond & Fox, 2001, p. 2). When these sentiments are taken to their logical conclusion in a practical application, the real value of “attempting for reading comprehension what Newtonian mechanics achieved for astronomy” (Burdick & Stenner, 1996) becomes apparent. Rasch’s analogy of the structure of his model for reading tests and Newton’s Second Law can be restated relative to any physical law expressed as universal conditionals among variable triplets; a theory of the variable measured capable of predicting item calibrations provides the causal story for the observed variation (Burdick, Stone, & Stenner, 2006; DeBoeck & Wilson, 2004).

Knowing what I know, from the mathematical principles I’ve been trained in and from the extensive experimental work I’ve done, it seems amazing that so little attention is actually paid to tools and concepts that receive daily lip service as to their central importance in every facet of life, from health care to education to economics to business. Measurement technology rose up decades ago in preparation for the demands of today’s challenges. It is just plain weird the way we’re not using it to anything anywhere near its potential.

I’m convinced, though, that the problem is not a matter of persuasive rhetoric applied to the minds of the right people. Rather, someone, hopefully me, has got to configure the right combination of players in the right situation at the right time and at the right place to create a new form of real value that can’t be created any other way. Like they say, money talks. Persuasion is all well and good, but things will really take off only when people see that better measurement can aid in removing inefficiencies from the management of human, social, and natural capital, that better measurement is essential to creating sustainable and socially responsible policies and practices, and that better measurement means new sources of profitability.  I’m convinced that advanced measurement techniques are really nothing more than a new form of IT or communications technology. They will fit right into the existing networks and multiply their efficiencies many times over.

And when they do, we may be in a position to finally

“confront the remarkable fact that throughout the gigantic range of physical knowledge numerical laws assume a remarkably simple form provided fundamental measurement has taken place. Although the authors cannot explain this fact to their own satisfaction, the extension to behavioral science is obvious: we may have to await fundamental measurement before we will see any real progress in quantitative laws of behavior. In short, ordinal scales (even continuous ordinal scales) are perhaps not good enough and it may not be possible to live forever with a dozen different procedures for quantifying the same piece of behavior, each making strong but untestable and basically unlikely assumptions which result in nonlinear plots of one scale against another. Progress in physics would have been impossibly difficult without fundamental measurement and the reader who believes that all that is at stake in the axiomatic treatment of measurement is a possible criterion for canonizing one scaling procedure at the expense of others is missing the point” (Ramsay, Bloxom, and Cramer, 1975, p. 262).

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Contesting the Claim, Part I: Are Rasch Measures Really as Objective as Physical Measures?

July 21, 2009

Psychometricians, statisticians, metrologists, and measurement theoreticians tend to be pretty unassuming kinds of people. They’re unobtrusive and retiring, by and large. But there is one thing some of them are prone to say that will raise the ire of others in a flash, and the poor innocent geek will suddenly be subjected to previously unknown forms and degrees of social exclusion.

What is that one thing? “Instruments calibrated by fitting data to a Rasch model measure with the same kind of objectivity as is obtained with physical measures.” That’s one version. Another could be along these lines: “When data fit a Rasch model, we’ve discovered a pattern in human attitudes or behaviors so regular that it is conceptually equivalent to a law of nature.”

Maybe it is the implication of objectivity as something that must be politically incorrect that causes the looks of horror and recoiling retreats in the nonmetrically inclined when they hear things like this. Maybe it is the ingrained cultural predisposition to thinking such claims outrageously preposterous that makes those unfamiliar with 80 years of developments and applications so dismissive. Maybe it’s just fear of the unknown, or a desire not to have to be responsible for knowing something important that hardly anyone else knows.

Of course, it could just be a simple misunderstanding. When people hear the word “objective” do most of them have an image of an object in mind? Does objectivity connote physical concreteness to most people? That doesn’t hold up well for me, since we can be objective about events and things people do without any confusions involving being able to touch and feel what’s at issue.

No, I think something else is going on. I think it has to do with the persistent idea that objectivity requires a disconnected, alienated point of view, one that ignores the mutual implication of subject and object in favor of analytically tractable formulations of problems that, though solvable, are irrelevant to anything important or real. But that is hardly the only available meaning of objectivity, and it isn’t anywhere near the best. It certainly is not what is meant in the world of measurement theory and practice.

It’s better to think of objectivity as something having to do with things like the object of a conversation, or an object of linguistic reference: “chair” as referring to the entire class of all forms of seating technology, for instance. In these cases, we know right away that we’re dealing with what might be considered a heuristic ideal, an abstraction. It also helps to think of objectivity in terms of fairness and justice. After all, don’t we want our educational, health care, and social services systems to respect the equality of all individuals and their rights?

That is not, of course, how measurement theoreticians in psychology have always thought about objectivity. In fact, it was only 70-80 years ago that most psychologists gave up on objective measurement because they couldn’t find enough evidence of concrete phenomena to support the claims to objectivity they wanted to make (Michell, 1999). The focus on the reflex arc led a lot of psychologists into psychophysics, and the effects of operant conditioning led others to behaviorism. But a lot of the problems studied in these fields, though solvable, turned out to be uninteresting and unrelated to the larger issues of life demanding attention.

And so, with no physical entity that could be laid end-to-end and concatenated in the way weights are in a balance scale, psychologists just redefined measurement to suit what they perceived to be the inherent limits of their subject matter. Measurement didn’t have to be just ratio or interval, it could also be ordinal and even nominal. The important thing was to get numbers that could be statistically manipulated. That would provide more than enough credibility, or obfuscation, to create the appearance of legitimate science.

But while mainstream psychology was focused on hunting for statistically significant p-values, there were others trying to figure out if attitudes, abilities, and behaviors could be measured in a rigorously meaningful way.

Louis Thurstone, a former electrical engineer turned psychologist, was among the first to formulate the problem. Writing in 1928, Thurstone rightly focused on the instrument as the focus of attention:

The scale must transcend the group measured.–One crucial experimental test must be applied to our method of measuring attitudes before it can be accepted as valid. A measuring instrument must not be seriously affected in its measuring function by the object of measurement. To the extent that its measuring function is so affected, the validity of the instrument is impaired or limited. If a yardstick measured differently because of the fact that it was a rug, a picture, or a piece of paper that was being measured, then to that extent the trustworthiness of that yardstick as a measuring device would be impaired. Within the range of objects for which the measuring instrument is intended, its function must be independent of the object of measurement”  (Thurstone, 1959, p. 228).

Thurstone aptly captures what is meant when it is said that attitudes, abilities, or behaviors can be measured with the same kind of objectivity as is obtained in the natural sciences. Objectivity is realized when a test, survey, or assessment functions the same way no matter who is being measured, and, conversely (Thurstone took this up, too), an attitude, ability, or behavior exhibits the same amount of what is measured no matter which instrument is used.

This claim, too, may seem to some to be so outrageously improbable as to be worthy of rejecting out of hand. After all, hasn’t everyone learned how the fact of being measured changes the measure? Thing is, this is just as true in physics and ecology as it is in psychiatry or sociology, and the natural sciences haven’t abandoned their claims to objectivity. So what’s up?

What’s up is that all sciences now have participant observers. The old Cartesian duality of the subject-object split still resides in various rhetorical choices and affects our choices and behaviors, but, in actual practice, scientific methods have always had to deal with the way questions imply particular answers.

And there’s more. Qualitative methods have grown out of some of the deep philosophical introspections of the twentieth century, such as phenomenology, hermeneutics, deconstruction, postmodernism, etc. But most researchers who are adopting qualitative methods over quantitative ones don’t know that the philosophers legitimating the new focuses on narrative, interpretation, and the construction of meaning did quite a lot of very good thinking about mathematics and quantitative reasoning. Much of my own published work engages with these philosophers to find new ways of thinking about measurement (Fisher, 2004, for instance). And there are some very interesting connections to be made that show quantification does not necessarily have to involve a positivist, subject-object split.

So where does that leave us? Well, with probability. Not in the sense of statistical hypothesis testing, but in the sense of calibrating instruments with known probabilistic characteristics. If the social sciences are ever to be scientific, null hypothesis significance tests are going to have to be replaced with universally uniform metrics embodying and deploying the regularities of natural laws, as is the case in the physical sciences. Various arguments on this issue have been offered for decades (Cohen, 1994; Meehl, 1967, 1978; Goodman, 1999; Guttman, 1985; Rozeboom, 1960). The point is not to proscribe allowable statistics based on scale type  (Velleman & Wilkinson, 1993). Rather, we need to shift and simplify the focus of inference from the statistical analysis of data to the calibration and distribution of instruments that support distributed cognition, unify networks, lubricate markets, and coordinate collective thinking and acting (Fisher, 2000, 2009). Persuasion will likely matter far less in resolving the matter than an ability to create new value, efficiencies, and profits.

In 1964, Luce and Tukey gave us another way of stating what Thurstone was getting at:

“The axioms of conjoint measurement apply naturally to problems of classical physics and permit the measurement of conventional physical quantities on ratio scales…. In the various fields, including the behavioral and biological sciences, where factors producing orderable effects and responses deserve both more useful and more fundamental measurement, the moral seems clear: when no natural concatenation operation exists, one should try to discover a way to measure factors and responses such that the ‘effects’ of different factors are additive.”

In other words, if we cannot find some physical thing that we can make add up the way numbers do, as we did with length, weight, volts, temperature, time, etc., then we ought to ask questions in a way that allows the answers to reveal the kind of patterns we expect to see when things do concatenate. What Thurstone and others working in his wake have done is to see that we could possibly do some things virtually in terms of abstract relations that we cannot do actually in terms of concrete relations.

The concept is no more difficult to comprehend than understanding the difference between playing solitaire with actual cards and writing a computer program to play solitaire with virtual cards. Either way, the same relationships hold.

A Danish mathematician, Georg Rasch, understood this. Working in the 1950s with data from psychological and reading tests, Rasch worked from his training in the natural sciences and mathematics to arrive at a conception of measurement that would apply in the natural and human sciences equally well. He realized that

“…the acceleration of a body cannot be determined; the observation of it is admittedly liable to … ‘errors of measurement’, but … this admittance is paramount to defining the acceleration per se as a parameter in a probability distribution — e.g., the mean value of a Gaussian distribution — and it is such parameters, not the observed estimates, which are assumed to follow the multiplicative law [acceleration = force / mass, or mass * acceleration = force].

“Thus, in any case an actual observation can be taken as nothing more than an accidental response, as it were, of an object — a person, a solid body, etc. — to a stimulus — a test, an item, a push, etc. — taking place in accordance with a potential distribution of responses — the qualification ‘potential’ referring to experimental situations which cannot possibly be [exactly] reproduced.

“In the cases considered [earlier in the book] this distribution depended on one relevant parameter only, which could be chosen such as to follow the multiplicative law.

“Where this law can be applied it provides a principle of measurement on a ratio scale of both stimulus parameters and object parameters, the conceptual status of which is comparable to that of measuring mass and force. Thus, … the reading accuracy of a child … can be measured with the same kind of objectivity as we may tell its weight …” (Rasch, 1960, p. 115).

Rasch’s model not only sets the parameters for data sufficient to the task of measurement, it lays out the relationships that must be found in data for objective results to be possible. Rasch studied with Ronald Fisher in London in 1935, expanded his understanding of statistical sufficiency with him, and then applied it in his measurement work, but not in the way that most statisticians understand it. Yes, in the context of group-level statistics, sufficiency concerns the reproducibility of a normal distribution when all that is known are the mean and the standard deviation. But sufficiency is something quite different in the context of individual-level measurement. Here, counts of correct answers or sums of ratings serve as sufficient statistics  for any statistical model’s parameters when they contain all of the information needed to establish that the parameters are independent of one another, and are not interacting in ways that keep them tied together. So despite his respect for Ronald Fisher and the concept of sufficiency, Rasch’s work with models and methods that worked equally well with many different kinds of distributions led him to jokingly suggest (Andersen, 1995, p. 385) that all textbooks mentioning the normal distribution should be burned!

In plain English, all that we’re talking about here is what Thurstone said: the ruler has to work the same way no matter what or who it is measuring, and we have to get the same results for what or who we are measuring no matter which ruler we use. When parameters are not separable, when they stick together because some measures change depending on which questions are asked or because some calibrations change depending on who answers them, we have encountered a “failure of invariance” that tells us something is wrong. If we are to persist in our efforts to determine if something objective exists and can be measured, we need to investigate these interactions and explain them. Maybe there was a data entry error. Maybe a form was misprinted. Maybe a question was poorly phrased. Maybe we have questions that address different constructs all mixed together. Maybe math word problems work like reading test items for students who can’t read the language they’re written in.  Standard statistical modeling ignores these potential violations of construct validity in favor of adding more parameters to the model.

But that’s another story for another time. Tomorrow we’ll take a closer look at sufficiency, in both conceptual and practical terms. Cited references are always available on request, but I’ll post them in a couple of days.

The “Standard Model”, Part I: Natural Law, Economics, Measurement, and Capital

July 14, 2009

In the late 18th and early 19th centuries, scientists took Newton’s successful study of gravitation and the laws of motion as a model for the conduct of any other field of investigation that would purport to be a science. Heilbron (1993) documents how the “Standard Model” evolved and eventually informed the quantitative study of areas of physical nature that had previously been studied only qualitatively, such as cohesion, affinity, heat, light, electricity, and magnetism. Referred to as the “six imponderables,” scientists were widely influenced in experimental practice by the idea that satisfactory understandings of these fundamental forces would be obtained only when they could be treated mathematically in a manner analogous, for instance, with the relations of force, mass, and acceleration in Newton’s Second Law of Motion.

The basic concept is that each parameter in the model has to be measurable independently of the other two, and that any combination of two parameters has to predict the third.  These relationships are demonstrably causal, not just unexplained associations. So force has to be the product of mass and acceleration; mass has to be force divided by acceleration; and acceleration has to be force divided by mass.

The ideal of a mathematical model incorporating these kinds of relations not only guided much of 19th century science, the effects of acceleration and force on mass were a vital consideration for Einstein in his formulation of the relation of mass and energy relative to the speed of light, with the result that energy is now separated from mass in the context of relativity theory (Jammer, 1999, pp. 41-42). He realized that, in the same way humans experience nothing unpleasant or destructive as body mass (or, as is now held, its energy) increases when accelerated to the relatively high speeds of trains, so, too, might we experience similar changes in the relation of mass and energy relative to the speed of light. The basic intellectual accomplishment, however, was one in a still-growing history of analogies from the Standard Model, which itself deeply indebted to the insights of Plato and Euclid in geometry and arithmetic (Fisher, 1992).

Working an independent line of research, historians of economics and econometrics have documented another extension of the Standard Model. The analogies to the new field of energetics made in the period of 1850-1880, and the use of the balance scale as a model by early economists, such as Stanley Jevons and Irving Fisher, are too widespread to ignore.  Mirowski (1988, p. 2) says that, in Walras’ first effort at formulating a mathematical expression of economic relations, he “attempted to implement a Newtonian model of market relations, postulating that ‘the price of things is in inverse ratio to the quantity offered and in direct ratio to the quantity demanded.'”

Jevons similarly studied energetics, in his case, with Michael Faraday, in the 1850s. Pareto also trained as an engineer; he made “a direct extrapolation of the path-independence of equilibrium energy states in rational mechanics and thermodynamics” to “the path-independence of the realization of utility” (Mirowski, 1988, p. 21).

The concept of equilibrium models stems from this work, and was also extensively elaborated in the analogies Jan Tinbergen was well known for drawing between economic phenomena and James Clerk Maxwell’s encapsulation of Newton’s second law. In making these analogies, Tinbergen was deliberately employing Maxwell’s own method of analogy for guiding his thinking (Boumans, 2005, p. 24).

In his 1934-35 studies with Frisch in Oslo and with Ronald Fisher in London, the Danish mathematician Georg Rasch (Andrich, 1997; Wright, 1980) made the acquaintance of a number of Tinbergen’s students, such as Tjalling Koopmans (Bjerkholt 2001, p. 9), from whom he may have heard of Tinbergen’s use of Maxwell’s method of analogy (Fisher, 2008). Rasch employs such an analogy in the presentation of his measurement model (1960, p. 115), pointing out

“…the acceleration of a body cannot be determined; the observation of it is admittedly liable to … ‘errors of measurement’, but … this admittance is paramount to defining the acceleration per se as a parameter in a probability distribution — e.g., the mean value of a Gaussian distribution — and it is such parameters, not the observed estimates, which are assumed to follow the multiplicative law [acceleration = force / mass].
Thus, in any case an actual observation can be taken as nothing more than an accidental response, as it were, of an object — a person, a solid body, etc. — to a stimulus — a test, an item, a push, etc. — taking place in accordance with a potential distribution of responses — the qualification ‘potential’ referring to experimental situations which cannot possibly be [exactly] reproduced.
In the cases considered [earlier in the book] this distribution depended on one relevant parameter only, which could be chosen such as to follow the multiplicative law.
Where this law can be applied it provides a principle of measurement on a ratio scale of both stimulus parameters and object parameters, the conceptual status of which is comparable to that of measuring mass and force. Thus, … the reading accuracy of a child … can be measured with the same kind of objectivity as we may tell its weight ….”

What Rasch provides in the models that incorporate this structure is a portable way of applying Maxwell’s method of analogy from the Standard Model. Data fitting a Rasch model show a pattern of associations suggesting that richer causal explanatory processes may be at work, but model fit alone cannot, of course, provide a construct theory in and of itself (Burdick, Stone, & Stenner, 2006; Wright, 1994). This echoes Tinbergen’s repeated emphasis on the difference between the mathematical model and the substantive meaning of the relationships it represents.

It also shows appreciation for the reason why Ludwig Boltzmann was so enamored of Maxwell’s method of analogy. As Boumans (1993, p. 136; also see Boumans, 2005, p. 28), “it allowed him to continue to develop mechanical explanations without having to assert, for example, that a gas ‘really’ consists of molecules that ‘really’ interact with one another according to a particular force law. If a scientific theory is only an image or a picture of nature, one need not worry about developing ‘the only true theory,’ and one can be content to portray the phenomena as simply and clearly as possible.” Rasch (1980, pp. 37-38) similarly held that a model is meant to be useful, not true.

Part II continues soon with more on Rasch’s extrapolation of the Standard Model, and references cited.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Publications Documenting Score, Rating, Percentage Contrasts with Real Measures

July 7, 2009

A few brief and easy introductions to the contrast between scores, ratings, and percentages vs measures include:

Linacre, J. M. (1992, Autumn). Why fuss about statistical sufficiency? Rasch Measurement Transactions, 6(3), 230 [http://www.rasch.org/rmt/rmt63c.htm].

Linacre, J. M. (1994, Summer). Likert or Rasch? Rasch Measurement Transactions, 8(2), 356 [http://www.rasch.org/rmt/rmt82d.htm].

Wright, B. D. (1992, Summer). Scores are not measures. Rasch Measurement Transactions, 6(1), 208 [http://www.rasch.org/rmt/rmt61n.htm].

Wright, B. D. (1989). Rasch model from counting right answers: Raw scores as sufficient statistics. Rasch Measurement Transactions, 3(2), 62 [http://www.rasch.org/rmt/rmt32e.htm].

Wright, B. D. (1993). Thinking with raw scores. Rasch Measurement Transactions, 7(2), 299-300 [http://www.rasch.org/rmt/rmt72r.htm].

Wright, B. D. (1999). Common sense for measurement. Rasch Measurement Transactions, 13(3), 704-5  [http://www.rasch.org/rmt/rmt133h.htm].

Longer and more technical comparisons include:

Andrich, D. (1989). Distinctions between assumptions and requirements in measurement in the social sciences. In J. A. Keats, R. Taft, R. A. Heath & S. H. Lovibond (Eds.), Mathematical and Theoretical Systems: Proceedings of the 24th International Congress of Psychology of the International Union of Psychological Science, Vol. 4 (pp. 7-16). North-Holland: Elsevier Science Publishers.

van Alphen, A., Halfens, R., Hasman, A., & Imbos, T. (1994). Likert or Rasch? Nothing is more applicable than good theory. Journal of Advanced Nursing, 20, 196-201.

Wright, B. D., & Linacre, J. M. (1989). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70(12), 857-867 [http://www.rasch.org/memo44.htm].

Zhu, W. (1996). Should total scores from a rating scale be used directly? Research Quarterly for Exercise and Sport, 67(3), 363-372.

The following lists provide some key resources. The lists are intended to be representative, not comprehensive.  There are many works in addition to these that document the claims in yesterday’s table. Many of these books and articles are highly technical.  Good introductions can be found in Bezruczko (2005), Bond and Fox (2007), Smith and Smith (2004), Wilson (2005), Wright and Stone (1979), Wright and Masters (1982), Wright and Linacre (1989), and elsewhere. The www.rasch.org web site has comprehensive and current information on seminars, consultants, software, full text articles, professional association meetings, etc.

Books and Journal Issues

Andrich, D. (1988). Rasch models for measurement. Sage University Paper Series on Quantitative Applications in the Social Sciences, vol. series no. 07-068. Beverly Hills, California: Sage Publications.

Andrich, D., & Douglas, G. A. (Eds.). (1982). Rasch models for measurement in educational and psychological research [Special issue]. Education Research and Perspectives, 9(1), 5-118. [Full text available at www.rasch.org.]

Bezruczko, N. (Ed.). (2005). Rasch measurement in health sciences. Maple Grove, MN: JAM Press.

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2d edition. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Choppin, B. (1985). In Memoriam: Bruce Choppin (T. N. Postlethwaite ed.) [Special issue]. Evaluation in Education: An International Review Series, 9(1).

DeBoeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. Statistics for Social and Behavioral Sciences). New York: Springer-Verlag.

Embretson, S. E., & Hershberger, S. L. (Eds.). (1999). The new rules of measurement: What every psychologist and educator should know. Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Engelhard, G., Jr., & Wilson, M. (1996). Objective measurement: Theory into practice, Vol. 3. Norwood, New Jersey: Ablex.

Fischer, G. H., & Molenaar, I. (1995). Rasch models: Foundations, recent developments, and applications. New York: Springer-Verlag.

Fisher, W. P., Jr., & Wright, B. D. (Eds.). (1994). Applications of Probabilistic Conjoint Measurement [Special Issue]. International Journal of Educational Research, 21(6), 557-664.

Garner, M., Draney, K., Wilson, M., Engelhard, G., Jr., & Fisher, W. P., Jr. (Eds.). (2009). Advances in Rasch measurement, Vol. One. Maple Grove, MN: JAM Press.

Granger, C. V., & Gresham, G. E. (Eds). (1993, August). New Developments in Functional Assessment [Special Issue]. Physical Medicine and Rehabilitation Clinics of North America, 4(3), 417-611.

Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, Illinois: MESA Press.

Liu, X., & Boone, W. (2006). Applications of Rasch measurement in science education. Maple Grove, MN: JAM Press.

Masters, G. N. (2007). Special issue: Programme for International Student Assessment (PISA). Journal of Applied Measurement, 8(3), 235-335.

Masters, G. N., & Keeves, J. P. (Eds.). (1999). Advances in measurement in educational research and assessment. New York: Pergamon.

Osborne, J. W. (Ed.). (2007). Best practices in quantitative methods. Thousand Oaks, CA: Sage.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Smith, E. V., Jr., & Smith, R. M. (Eds.) (2004). Introduction to Rasch measurement. Maple Grove, MN: JAM Press.

Smith, E. V., Jr., & Smith, R. M. (2007). Rasch measurement: Advanced and specialized applications. Maple Grove, MN: JAM Press.

Smith, R. M. (Ed.). (1997, June). Outcome Measurement [Special Issue]. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 261-428.

Smith, R. M. (1999). Rasch measurement models. Maple Grove, MN: JAM Press.

von Davier, M. (2006). Multivariate and mixture distribution Rasch models. New York: Springer.

Wilson, M. (1992). Objective measurement: Theory into practice, Vol. 1. Norwood, New Jersey: Ablex.

Wilson, M. (1994). Objective measurement: Theory into practice, Vol. 2. Norwood, New Jersey: Ablex.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wilson, M., Draney, K., Brown, N., & Duckor, B. (Eds.). (2009). Advances in Rasch measurement, Vol. Two (p. in press). Maple Grove, MN: JAM Press.

Wilson, M., & Engelhard, G. (2000). Objective measurement: Theory into practice, Vol. 5. Westport, Connecticut: Ablex Publishing.

Wilson, M., Engelhard, G., & Draney, K. (Eds.). (1997). Objective measurement: Theory into practice, Vol. 4. Norwood, New Jersey: Ablex.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1999). Measurement essentials. Wilmington, DE: Wide Range, Inc. [http://www.rasch.org/memos.htm#measess].

Key Articles

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-73.

Andrich, D. (2002). Understanding resistance to the data-model relationship in Rasch’s paradigm: A reflection for the next generation. Journal of Applied Measurement, 3(3), 325-59.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Beltyukova, S. A., Stone, G. E., & Fox, C. M. (2008). Magnitude estimation and categorical rating scaling in social sciences: A theoretical and psychometric controversy. Journal of Applied Measurement, 9(2), 151-159.

Choppin, B. (1968). An item bank using sample-free calibration. Nature, 219, 870-872.

Embretson, S. E. (1996, September). Item Response Theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, 20(3), 201-212.

Engelhard, G. (2008, July). Historical perspectives on invariant measurement: Guttman, Rasch, and Mokken. Measurement: Interdisciplinary Research & Perspectives, 6(3), 155-189.

Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.

Fischer, G. H. (1981, March). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.

Fischer, G. H. (1989). Applying the principles of specific objectivity and of generalizability to the measurement of change. Psychometrika, 52(4), 565-587.

Fisher, W. P., Jr. (1997). Physical disability construct convergence across instruments: Towards a universal metric. Journal of Outcome Measurement, 1(2), 87-113.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2009, July). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), in press.

Grosse, M. E., & Wright, B. D. (1986, Sep). Setting, evaluating, and maintaining certification standards with the Rasch model. Evaluation & the Health Professions, 9(3), 267-285.

Hall, W. J., Wijsman, R. A., & Ghosh, J. K. (1965). The relationship between sufficiency and invariance with applications in sequential analysis. Annals of Mathematical Statistics, 36, 575-614.

Kamata, A. (2001, March). Item analysis by the Hierarchical Generalized Linear Model. Journal of Educational Measurement, 38(1), 79-93.

Karabatsos, G., & Ullrich, J. R. (2002). Enumerating and testing conjoint measurement models. Mathematical Social Sciences, 43, 487-505.

Linacre, J. M. (1997). Instantaneous measurement and diagnosis. Physical Medicine and Rehabilitation State of the Art Reviews, 11(2), 315-324.

Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3(1), 85-106.

Lunz, M. E., & Bergstrom, B. A. (1991). Comparability of decision for computer adaptive and written examinations. Journal of Allied Health, 20(1), 15-23.

Lunz, M. E., Wright, B. D., & Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3/4, 331-345.

Masters, G. N. (1985, March). Common-person equating with the Rasch model. Applied Psychological Measurement, 9(1), 73-82.

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3-62.

Pelton, T., & Bunderson, V. (2003). The recovery of the density scale using a stochastic quasi-realization of additive conjoint measurement. Journal of Applied Measurement, 4(3), 269-81.

Rasch, G. (1961). On general laws and the meaning of measurement in psychology. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability (pp. 321-333 [http://www.rasch.org/memo1960.pdf]). Berkeley, California: University of California Press.

Rasch, G. (1966). An individualistic approach to item analysis. In P. F. Lazarsfeld & N. W. Henry (Eds.), Readings in mathematical social science (pp. 89-108). Chicago, Illinois: Science Research Associates.

Rasch, G. (1966, July). An informal report on the present state of a theory of objectivity in comparisons. Unpublished paper [http://www.rasch.org/memo1966.pdf].

Rasch, G. (1966). An item analysis which takes individual differences into account. British Journal of Mathematical and Statistical Psychology, 19, 49-57.

Rasch, G. (1968, September 6). A mathematical theory of objectivity and its consequences for model construction. [Unpublished paper [http://www.rasch.org/memo1968.pdf]], Amsterdam, the Netherlands: Institute of Mathematical Statistics, European Branch.

Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 14, 58-94.

Romanoski, J. T., & Douglas, G. (2002). Rasch-transformed raw scores and two-way ANOVA: A simulation analysis. Journal of Applied Measurement, 3(4), 421-430.

Smith, R. M. (1996). A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling, 3(1), 25-40.

Smith, R. M. (2000). Fit analysis in latent trait measurement models. Journal of Applied Measurement, 1(2), 199-218.

Stenner, A. J., & Smith III, M. (1982). Testing construct theories. Perceptual and Motor Skills, 55, 415-426.

Stenner, A. J. (1994). Specific objectivity – local and general. Rasch Measurement Transactions, 8(3), 374 [http://www.rasch.org/rmt/rmt83e.htm].

Stone, G. E., Beltyukova, S. A., & Fox, C. M. (2008). Objective standard setting for judge-mediated examinations. International Journal of Testing, 8(2), 180-196.

Stone, M. H. (2003). Substantive scale construction. Journal of Applied Measurement, 4(3), 282-97.

Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13(2), 181-208.

Wright, B. D. (1968). Sample-free test calibration and person measurement. In Proceedings of the 1967 invitational conference on testing problems (pp. 85-101 [http://www.rasch.org/memo1.htm]). Princeton, New Jersey: Educational Testing Service.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1980). Foreword, Afterword. In Probabilistic models for some intelligence and attainment tests, by Georg Rasch (pp. ix-xix, 185-199. http://www.rasch.org/memo63.htm). Chicago, Illinois: University of Chicago Press.

Wright, B. D. (1984). Despair and hope for educational measurement. Contemporary Education Review, 3(1), 281-288 [http://www.rasch.org/memo41.htm].

Wright, B. D. (1985). Additivity in psychological measurement. In E. Roskam (Ed.), Measurement and personality assessment. North Holland: Elsevier Science Ltd.

Wright, B. D. (1996). Comparing Rasch measurement and factor analysis. Structural Equation Modeling, 3(1), 3-24.

Wright, B. D. (1997, June). Fundamental measurement for outcome evaluation. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 261-88.

Wright, B. D. (1997, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45, 52 [http://www.rasch.org/memo62.htm].

Wright, B. D. (1999). Fundamental measurement for psychology. In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every educator and psychologist should know (pp. 65-104 [http://www.rasch.org/memo64.htm]). Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Wright, B. D., & Bell, S. R. (1984, Winter). Item banks: What, why, how. Journal of Educational Measurement, 21(4), 331-345 [http://www.rasch.org/memo43.htm].

Wright, B. D., & Linacre, J. M. (1989). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70(12), 857-867 [http://www.rasch.org/memo44.htm].

Wright, B. D., & Mok, M. (2000). Understanding Rasch measurement: Rasch models overview. Journal of Applied Measurement, 1(1), 83-106.

Model Applications

Adams, R. J., Wu, M. L., & Macaskill, G. (1997). Scaling methodology and procedures for the mathematics and science scales. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study Technical Report: Vol. 2: Implementation and Analysis – Primary and Middle School Years. Boston: Center for the Study of Testing, Evaluation, and Educational Policy.

Andrich, D., & Van Schoubroeck, L. (1989, May). The General Health Questionnaire: A psychometric analysis using latent trait theory. Psychological Medicine, 19(2), 469-485.

Beltyukova, S. A., Stone, G. E., & Fox, C. M. (2004). Equating student satisfaction measures. Journal of Applied Measurement, 5(1), 62-9.

Bergstrom, B. A., & Lunz, M. E. (1999). CAT for certification and licensure. In F. Drasgow & J. B. Olson-Buchanan (Eds.), Innovations in computerized assessment (pp. 67-91). Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc., Publishers.

Bond, T. G. (1994). Piaget and measurement II: Empirical validation of the Piagetian model. Archives de Psychologie, 63, 155-185.

Bunderson, C. V., & Newby, V. A. (2009). The relationships among design experiments, invariant measurement scales, and domain theories. Journal of Applied Measurement, 10(2), 117-137.

Cavanagh, R. F., & Romanoski, J. T. (2006, October). Rating scale instruments and measurement. Learning Environments Research, 9(3), 273-289.

Cipriani, D., Fox, C., Khuder, S., & Boudreau, N. (2005). Comparing Rasch analyses probability estimates to sensitivity, specificity and likelihood ratios when examining the utility of medical diagnostic tests. Journal of Applied Measurement, 6(2), 180-201.

Dawson, T. L. (2004, April). Assessing intellectual development: Three approaches, one sequence. Journal of Adult Development, 11(2), 71-85.

DeSalvo, K., Fisher, W. P. Jr., Tran, K., Bloser, N., Merrill, W., & Peabody, J. W. (2006, March). Assessing measurement properties of two single-item general health measures. Quality of Life Research, 15(2), 191-201.

Engelhard, G., Jr. (1992). The measurement of writing ability with a many-faceted Rasch model. Applied Measurement in Education, 5(3), 171-191.

Engelhard, G., Jr. (1997). Constructing rater and task banks for performance assessment. Journal of Outcome Measurement, 1(1), 19-33.

Fisher, W. P., Jr. (1998). A research program for accountable and patient-centered health status measures. Journal of Outcome Measurement, 2(3), 222-239.

Fisher, W. P., Jr., Harvey, R. F., Taylor, P., Kilgore, K. M., & Kelly, C. K. (1995, February). Rehabits: A common language of functional assessment. Archives of Physical Medicine and Rehabilitation, 76(2), 113-122.

Heinemann, A. W., Gershon, R., & Fisher, W. P., Jr. (2006). Development and application of the Orthotics and Prosthetics User Survey: Applications and opportunities for health care quality improvement. Journal of Prosthetics and Orthotics, 18(1), 80-85 [http://www.oandp.org/jpo/library/2006_01S_080.asp].

Heinemann, A. W., Linacre, J. M., Wright, B. D., Hamilton, B. B., & Granger, C. V. (1994). Prediction of rehabilitation outcomes with disability measures. Archives of Physical Medicine and Rehabilitation, 75(2), 133-143.

Hobart, J. C., Cano, S. J., O’Connor, R. J., Kinos, S., Heinzlef, O., Roullet, E. P., C., et al. (2003). Multiple Sclerosis Impact Scale-29 (MSIS-29):  Measurement stability across eight European countries. Multiple Sclerosis, 9, S23.

Hobart, J. C., Cano, S. J., Zajicek, J. P., & Thompson, A. J. (2007, December). Rating scales as outcome measures for clinical trials in neurology: Problems, solutions, and recommendations. Lancet Neurology, 6, 1094-1105.

Lai, J., Fisher, A., Magalhaes, L., & Bundy, A. C. (1996). Construct validity of the sensory integration and praxis tests. Occupational Therapy Journal of Research, 16(2), 75-97.

Lee, N. P., & Fisher, W. P., Jr. (2005). Evaluation of the Diabetes Self Care Scale. Journal of Applied Measurement, 6(4), 366-81.

Ludlow, L. H., & Haley, S. M. (1995, December). Rasch model logits: Interpretation, use, and transformation. Educational and Psychological Measurement, 55(6), 967-975.

Markward, N. J., & Fisher, W. P., Jr. (2004). Calibrating the genome. Journal of Applied Measurement, 5(2), 129-41.

Massof, R. W. (2007, August). An interval-scaled scoring algorithm for visual function questionnaires. Optometry & Vision Science, 84(8), E690-E705.

Massof, R. W. (2008, July-August). Editorial: Moving toward scientific measurements of quality of life. Ophthalmic Epidemiology, 15, 209-211.

Masters, G. N., Adams, R. J., & Lokan, J. (1994). Mapping student achievement. International Journal of Educational Research, 21(6), 595-610.

Mead, R. J. (2009). The ISR: Intelligent Student Reports. Journal of Applied Measurement, 10(2), 208-224.

Pelton, T., & Bunderson, V. (2003). The recovery of the density scale using a stochastic quasi-realization of additive conjoint measurement. Journal of Applied Measurement, 4(3), 269-81.

Smith, E. V., Jr. (2000). Metric development and score reporting in Rasch measurement. Journal of Applied Measurement, 1(3), 303-26.

Smith, R. M., & Taylor, P. (2004). Equating rehabilitation outcome scales: Developing common metrics. Journal of Applied Measurement, 5(3), 229-42.

Solloway, S., & Fisher, W. P., Jr. (2007). Mindfulness in measurement: Reconsidering the measurable in mindfulness. International Journal of Transpersonal Studies, 26, 58-81 [http://www.transpersonalstudies.org/volume_26_2007.html].

Stenner, A. J. (2001). The Lexile Framework: A common metric for matching readers and texts. California School Library Journal, 25(1), 41-2.

Wolfe, E. W., Ray, L. M., & Harris, D. C. (2004, October). A Rasch analysis of three measures of teacher perception generated from the School and Staffing Survey. Educational and Psychological Measurement, 64(5), 842-860.

Wolfe, F., Hawley, D., Goldenberg, D., Russell, I., Buskila, D., & Neumann, L. (2000, Aug). The assessment of functional impairment in fibromyalgia (FM): Rasch analyses of 5 functional scales and the development of the FM Health Assessment Questionnaire. Journal of Rheumatology, 27(8), 1989-99.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

W

endt, A., & Tatum, D. S. (2005). Credentialing health care professionals. In N. Bezruczko (Ed.), Rasch measurement in health sciences (pp. 161-75). Maple Grove, MN: JAM Press.