Posts Tagged ‘quantitative methods’

Psychology and the social sciences: An atheoretical, scattered, and disconnected body of research

February 16, 2019

A new article in Nature Human Behaviour (NHB) points toward the need for better theory and more rigorous mathematical models in psychology and the social sciences (Muthukrishna & Henrich, 2019). The authors rightly say that the lack of an overarching cumulative theoretical framework makes it very difficult to see whether new results fit well with previous work, or if something surprising has come to light. Mathematical models are especially emphasized as being of value in specifying clear and precise expectations.

The point that the social sciences and psychology need better theories and models is painfully obvious. But there are in fact thousands of published studies and practical real world applications that not only provide, but indeed often surpass, the kinds of predictive theories and mathematical models called for in the NHB article. The article not only makes no mention of any of this work, its argument is framed entirely in a statistical context instead of the more appropriate context of measurement science.

The concept of reliability provides an excellent point of entry. Most behavioral scientists think of reliability statistically, as a coefficient with a positive numeric value usually between 0.00 and 1.00. The tangible sense of reliability as indicating exactly how predictable an outcome is does not usually figure in most researchers’ thinking. But that sense of the specific predictability of results has been the focus of attention in social and psychological measurement science for decades.

For instance, the measurement of time is reliable in the sense that the position of the sun relative to the earth can be precisely predicted from geographic location, the time of day, and the day of the year. The numbers and words assigned to noon time are closely associated with the Sun being at the high point in the sky (though there are political variations by season and location across time zones).

That kind of a reproducible association is rarely sought in psychology and the social sciences, but it is far from nonexistent. One can discern different degrees to which that kind of association is included in models of measured constructs. Though most behavioral research doesn’t mention the connection between linear amounts of a measured phenomenon and a reproducible numeric representation of it (level 0), quite a significant body of work focuses on that connection (level 1). The disappointing thing about that level 1 work is that the relentless obsession with statistical methods prevents most researchers from connecting a reproducible quantity with a single expression of it in a standard unit, and with an associated uncertainty term (level 2). That is, level 1 researchers conceive measurement in statistical terms, as a product of data analysis. Even when results across data sets are highly correlated and could be equated to a common metric, level 1 researchers do not leverage that source of potential value for simplified communication and accumulated comparability.

And then, for their part, level 2 researchers usually do not articulate theories about the measured constructs, by augmenting the mathematical data model with an explanatory model predicting variation (level 3). Level 2 researchers are empirically grounded in data, and can expand their network of measures only by gathering more data and analyzing it in ways that bring it into their standard unit’s frame of reference.

Level 3 researchers, however, have come to see what makes their measures tick. They understand the mechanisms that make their questions vary. They can write new questions to their theoretical specifications, test those questions by asking them of a relevant sample, and produce the predicted calibrations. For instance, reading comprehension is well established to be a function of the difference between a person’s reading ability and the complexity of the text they encounter (see articles by Stenner in the list below). We have built our entire educational system around this idea, as we deliberately introduce children first to the alphabet, then to the most common words, then to short sentences, and then to ever longer and more complicated text. But stating the construct model, testing it against data, calibrating a unit to which all tests and measures can be traced, and connecting together all the books, articles, tests, curricula, and students is a process that began (in English and Spanish) only in the 1980s. The process still is far from finished, and most reading research still does not use the common metric.

In this kind of theory-informed context, new items can be automatically generated on the fly at the point of measurement. Those items and inferences made from them are validated by the consistency of the responses and the associated expression of the expected probability of success, agreement, etc. The expense of constant data gathering and analysis can be cut to a very small fraction of what it is at levels 0-2.

Level 3 research methods are not widely known or used, but they are not new. They are gaining traction as their use by national metrology institutes globally grows. As high profile critiques of social and psychological research practices continue to emerge, perhaps more attention will be paid to this important body of work. A few key references are provided below, and virtually every post in this blog pertains to these issues.


Baghaei, P. (2008). The Rasch model as a construct validation tool. Rasch Measurement Transactions, 22(1), 1145-6 [].

Bergstrom, B. A., & Lunz, M. E. (1994). The equivalence of Rasch item calibrations and ability estimates across modes of administration. In M. Wilson (Ed.), Objective measurement: Theory into practice, Vol. 2 (pp. 122-128). Norwood, New Jersey: Ablex.

Cano, S., Pendrill, L., Barbic, S., & Fisher, W. P., Jr. (2018). Patient-centred outcome metrology for healthcare decision-making. Journal of Physics: Conference Series, 1044, 012057.

Dimitrov, D. M. (2010). Testing for factorial invariance in the context of construct validation. Measurement & Evaluation in Counseling & Development, 43(2), 121-149.

Embretson, S. E. (2010). Measuring psychological constructs: Advances in model-based approaches. Washington, DC: American Psychological Association.

Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.

Fischer, G. H. (1983). Logistic latent trait models with linear constraints. Psychometrika, 48(1), 3-26.

Fisher, W. P., Jr. (1992). Reliability statistics. Rasch Measurement Transactions, 6(3), 238 [].

Fisher, W. P., Jr. (2008). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-1163 [].

Fisher, W. P., Jr., & Stenner, A. J. (2016). Theory-based metrological traceability in education: A reading measurement network. Measurement, 92, 489-496.

Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37(4), 827-833.

Hattie, J. (1985). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9(2), 139-64.

Hobart, J. C., Cano, S. J., Zajicek, J. P., & Thompson, A. J. (2007). Rating scales as outcome measures for clinical trials in neurology: Problems, solutions, and recommendations. Lancet Neurology, 6, 1094-1105.

Irvine, S. H., Dunn, P. L., & Anderson, J. D. (1990). Towards a theory of algorithm-determined cognitive test construction. British Journal of Psychology, 81, 173-195.

Kline, T. L., Schmidt, K. M., & Bowles, R. P. (2006). Using LinLog and FACETS to model item components in the LLTM. Journal of Applied Measurement, 7(1), 74-91.

Lunz, M. E., & Linacre, J. M. (2010). Reliability of performance examinations: Revisited. In M. Garner, G. Engelhard, Jr., W. P. Fisher, Jr. & M. Wilson (Eds.), Advances in Rasch Measurement, Vol. 1 (pp. 328-341). Maple Grove, MN: JAM Press.

Mari, L., & Wilson, M. (2014). An introduction to the Rasch measurement approach for metrologists. Measurement, 51, 315-327.

Markward, N. J., & Fisher, W. P., Jr. (2004). Calibrating the genome. Journal of Applied Measurement, 5(2), 129-141.

Maul, A., Mari, L., Torres Irribarra, D., & Wilson, M. (2018). The quality of measurement results in terms of the structural features of the measurement process. Measurement, 116, 611-620.

Muthukrishna, M., & Henrich, J. (2019). A problem in theory. Nature Human Behaviour, 1-9.

Obiekwe, J. C. (1999, August 1). Application and validation of the linear logistic test model for item difficulty prediction in the context of mathematics problems. Dissertation Abstracts International: Section B: The Sciences & Engineering, 60(2-B), 0851.

Pendrill, L. (2014). Man as a measurement instrument [Special Feature]. NCSLi Measure: The Journal of Measurement Science, 9(4), 22-33.

Pendrill, L., & Fisher, W. P., Jr. (2015). Counting and quantification: Comparing psychometric and metrological perspectives on visual perceptions of number. Measurement, 71, 46-55.

Pendrill, L., & Petersson, N. (2016). Metrology of human-based and other qualitative measurements. Measurement Science and Technology, 27(9), 094003.

Sijtsma, K. (2009). Correcting fallacies in validity, reliability, and classification. International Journal of Testing, 8(3), 167-194.

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74(1), 107-120.

Stenner, A. J. (2001). The necessity of construct theory. Rasch Measurement Transactions, 15(1), 804-5 [].

Stenner, A. J., Fisher, W. P., Jr., Stone, M. H., & Burdick, D. S. (2013). Causal Rasch models. Frontiers in Psychology: Quantitative Psychology and Measurement, 4(536), 1-14.

Stenner, A. J., & Horabin, I. (1992). Three stages of construct definition. Rasch Measurement Transactions, 6(3), 229 [].

Stenner, A. J., Stone, M. H., & Fisher, W. P., Jr. (2018). The unreasonable effectiveness of theory based instrument calibration in the natural sciences: What can the social sciences learn? Journal of Physics Conference Series, 1044(012070).

Stone, M. H. (2003). Substantive scale construction. Journal of Applied Measurement, 4(3), 282-297.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wilson, M. R. (2013). Using the concept of a measurement system to characterize measurement models used in psychometrics. Measurement, 46, 3766-3774.

Wright, B. D., & Stone, M. H. (1979). Chapter 5: Constructing a variable. In Best test design: Rasch measurement (pp. 83-128). Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1999). Measurement essentials. Wilmington, DE: Wide Range, Inc. [].

Wright, B. D., Stone, M., & Enos, M. (2000). The evolution of meaning in practice. Rasch Measurement Transactions, 14(1), 736 [].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at
Permissions beyond the scope of this license may be available at


Dispelling Myths about Measurement in Psychology and the Social Sciences

August 27, 2013

Seven common assumptions about measurement and method in psychology and the social sciences stand as inconsistent anomalies in the experience of those who have taken the trouble to challenge them. As evidence, theory, and instrumentation accumulate, will we see a revolutionary break and disruptive change across multiple social and economic levels and areas as a result? Will there be a slower, more gradual transition to a new paradigm? Or will the status quo simply roll on, oblivious to the potential for new questions and new directions? We shall see.

1. Myth: Qualitative data and methods cannot really be integrated with quantitative data and methods because of opposing philosophical assumptions.

Fact: Qualitative methods incorporate a critique of quantitative methods that leads to a more scientific theory and practice of measurement.

2. Myth: Statistics is the logic of measurement.

Fact: Statistics did not emerge as a discipline until the 19th century, while measurement, of course, has been around for millennia. Measurement is modeled at the individual level within a single variable whereas statistics model at the population level between variables. Data are fit to prescriptive measurement models using the Garbage-In, Garbage-Out (GIGO) Principle, while descriptive statistical models are fit to data.

3. Myth: Linear measurement from ordinal test and survey data is impossible.

Fact: Ordinal data have been used as a basis for invariant linear measures for decades.

4. Myth: Scientific laws like Newton’s laws of motion cannot be successfully formulated, tested, or validated in psychology and the social sciences.

Fact: Mathematical laws of human behavior and cognition in the same form as Newton’s laws are formulated, tested, and validated in numerous Rasch model applications.

5. Myth: Experimental manipulations of psychological and social phenomena are inherently impossible or unethical.

Fact: Decades of research across multiple fields have successfully shown how theory-informed interventions on items/indicators/questions can result in predictable, consistent, and substantively meaningful quantitative changes.

6. Myth: “Real” measurement is impossible in psychology and the social sciences.

Fact: Success in predictive theory, instrument calibration, and in maintaining stable units of comparison over time are all evidence supporting the viability of meaningful uniform units of measurement in psychology and the social sciences.

7. Myth: Efficient economic markets can incorporate only manufactured and liquid capital, and property. Human, social, and natural capital, being intangible, have permanent status as market externalities as they cannot be measured well enough to enable accountability, pricing, or transferable representations (common currency instruments).

Fact: The theory and methods necessary for establishing an Intangible Assets Metric System are in hand. What’s missing is the awareness of the scientific, human, social, and economic value that would be returned from the admittedly very large investments that would be required.

References and examples are available in other posts in this blog, in my publications, or on request.

Contesting the Claim, Part I: Are Rasch Measures Really as Objective as Physical Measures?

July 21, 2009

Psychometricians, statisticians, metrologists, and measurement theoreticians tend to be pretty unassuming kinds of people. They’re unobtrusive and retiring, by and large. But there is one thing some of them are prone to say that will raise the ire of others in a flash, and the poor innocent geek will suddenly be subjected to previously unknown forms and degrees of social exclusion.

What is that one thing? “Instruments calibrated by fitting data to a Rasch model measure with the same kind of objectivity as is obtained with physical measures.” That’s one version. Another could be along these lines: “When data fit a Rasch model, we’ve discovered a pattern in human attitudes or behaviors so regular that it is conceptually equivalent to a law of nature.”

Maybe it is the implication of objectivity as something that must be politically incorrect that causes the looks of horror and recoiling retreats in the nonmetrically inclined when they hear things like this. Maybe it is the ingrained cultural predisposition to thinking such claims outrageously preposterous that makes those unfamiliar with 80 years of developments and applications so dismissive. Maybe it’s just fear of the unknown, or a desire not to have to be responsible for knowing something important that hardly anyone else knows.

Of course, it could just be a simple misunderstanding. When people hear the word “objective” do most of them have an image of an object in mind? Does objectivity connote physical concreteness to most people? That doesn’t hold up well for me, since we can be objective about events and things people do without any confusions involving being able to touch and feel what’s at issue.

No, I think something else is going on. I think it has to do with the persistent idea that objectivity requires a disconnected, alienated point of view, one that ignores the mutual implication of subject and object in favor of analytically tractable formulations of problems that, though solvable, are irrelevant to anything important or real. But that is hardly the only available meaning of objectivity, and it isn’t anywhere near the best. It certainly is not what is meant in the world of measurement theory and practice.

It’s better to think of objectivity as something having to do with things like the object of a conversation, or an object of linguistic reference: “chair” as referring to the entire class of all forms of seating technology, for instance. In these cases, we know right away that we’re dealing with what might be considered a heuristic ideal, an abstraction. It also helps to think of objectivity in terms of fairness and justice. After all, don’t we want our educational, health care, and social services systems to respect the equality of all individuals and their rights?

That is not, of course, how measurement theoreticians in psychology have always thought about objectivity. In fact, it was only 70-80 years ago that most psychologists gave up on objective measurement because they couldn’t find enough evidence of concrete phenomena to support the claims to objectivity they wanted to make (Michell, 1999). The focus on the reflex arc led a lot of psychologists into psychophysics, and the effects of operant conditioning led others to behaviorism. But a lot of the problems studied in these fields, though solvable, turned out to be uninteresting and unrelated to the larger issues of life demanding attention.

And so, with no physical entity that could be laid end-to-end and concatenated in the way weights are in a balance scale, psychologists just redefined measurement to suit what they perceived to be the inherent limits of their subject matter. Measurement didn’t have to be just ratio or interval, it could also be ordinal and even nominal. The important thing was to get numbers that could be statistically manipulated. That would provide more than enough credibility, or obfuscation, to create the appearance of legitimate science.

But while mainstream psychology was focused on hunting for statistically significant p-values, there were others trying to figure out if attitudes, abilities, and behaviors could be measured in a rigorously meaningful way.

Louis Thurstone, a former electrical engineer turned psychologist, was among the first to formulate the problem. Writing in 1928, Thurstone rightly focused on the instrument as the focus of attention:

The scale must transcend the group measured.–One crucial experimental test must be applied to our method of measuring attitudes before it can be accepted as valid. A measuring instrument must not be seriously affected in its measuring function by the object of measurement. To the extent that its measuring function is so affected, the validity of the instrument is impaired or limited. If a yardstick measured differently because of the fact that it was a rug, a picture, or a piece of paper that was being measured, then to that extent the trustworthiness of that yardstick as a measuring device would be impaired. Within the range of objects for which the measuring instrument is intended, its function must be independent of the object of measurement”  (Thurstone, 1959, p. 228).

Thurstone aptly captures what is meant when it is said that attitudes, abilities, or behaviors can be measured with the same kind of objectivity as is obtained in the natural sciences. Objectivity is realized when a test, survey, or assessment functions the same way no matter who is being measured, and, conversely (Thurstone took this up, too), an attitude, ability, or behavior exhibits the same amount of what is measured no matter which instrument is used.

This claim, too, may seem to some to be so outrageously improbable as to be worthy of rejecting out of hand. After all, hasn’t everyone learned how the fact of being measured changes the measure? Thing is, this is just as true in physics and ecology as it is in psychiatry or sociology, and the natural sciences haven’t abandoned their claims to objectivity. So what’s up?

What’s up is that all sciences now have participant observers. The old Cartesian duality of the subject-object split still resides in various rhetorical choices and affects our choices and behaviors, but, in actual practice, scientific methods have always had to deal with the way questions imply particular answers.

And there’s more. Qualitative methods have grown out of some of the deep philosophical introspections of the twentieth century, such as phenomenology, hermeneutics, deconstruction, postmodernism, etc. But most researchers who are adopting qualitative methods over quantitative ones don’t know that the philosophers legitimating the new focuses on narrative, interpretation, and the construction of meaning did quite a lot of very good thinking about mathematics and quantitative reasoning. Much of my own published work engages with these philosophers to find new ways of thinking about measurement (Fisher, 2004, for instance). And there are some very interesting connections to be made that show quantification does not necessarily have to involve a positivist, subject-object split.

So where does that leave us? Well, with probability. Not in the sense of statistical hypothesis testing, but in the sense of calibrating instruments with known probabilistic characteristics. If the social sciences are ever to be scientific, null hypothesis significance tests are going to have to be replaced with universally uniform metrics embodying and deploying the regularities of natural laws, as is the case in the physical sciences. Various arguments on this issue have been offered for decades (Cohen, 1994; Meehl, 1967, 1978; Goodman, 1999; Guttman, 1985; Rozeboom, 1960). The point is not to proscribe allowable statistics based on scale type  (Velleman & Wilkinson, 1993). Rather, we need to shift and simplify the focus of inference from the statistical analysis of data to the calibration and distribution of instruments that support distributed cognition, unify networks, lubricate markets, and coordinate collective thinking and acting (Fisher, 2000, 2009). Persuasion will likely matter far less in resolving the matter than an ability to create new value, efficiencies, and profits.

In 1964, Luce and Tukey gave us another way of stating what Thurstone was getting at:

“The axioms of conjoint measurement apply naturally to problems of classical physics and permit the measurement of conventional physical quantities on ratio scales…. In the various fields, including the behavioral and biological sciences, where factors producing orderable effects and responses deserve both more useful and more fundamental measurement, the moral seems clear: when no natural concatenation operation exists, one should try to discover a way to measure factors and responses such that the ‘effects’ of different factors are additive.”

In other words, if we cannot find some physical thing that we can make add up the way numbers do, as we did with length, weight, volts, temperature, time, etc., then we ought to ask questions in a way that allows the answers to reveal the kind of patterns we expect to see when things do concatenate. What Thurstone and others working in his wake have done is to see that we could possibly do some things virtually in terms of abstract relations that we cannot do actually in terms of concrete relations.

The concept is no more difficult to comprehend than understanding the difference between playing solitaire with actual cards and writing a computer program to play solitaire with virtual cards. Either way, the same relationships hold.

A Danish mathematician, Georg Rasch, understood this. Working in the 1950s with data from psychological and reading tests, Rasch worked from his training in the natural sciences and mathematics to arrive at a conception of measurement that would apply in the natural and human sciences equally well. He realized that

“…the acceleration of a body cannot be determined; the observation of it is admittedly liable to … ‘errors of measurement’, but … this admittance is paramount to defining the acceleration per se as a parameter in a probability distribution — e.g., the mean value of a Gaussian distribution — and it is such parameters, not the observed estimates, which are assumed to follow the multiplicative law [acceleration = force / mass, or mass * acceleration = force].

“Thus, in any case an actual observation can be taken as nothing more than an accidental response, as it were, of an object — a person, a solid body, etc. — to a stimulus — a test, an item, a push, etc. — taking place in accordance with a potential distribution of responses — the qualification ‘potential’ referring to experimental situations which cannot possibly be [exactly] reproduced.

“In the cases considered [earlier in the book] this distribution depended on one relevant parameter only, which could be chosen such as to follow the multiplicative law.

“Where this law can be applied it provides a principle of measurement on a ratio scale of both stimulus parameters and object parameters, the conceptual status of which is comparable to that of measuring mass and force. Thus, … the reading accuracy of a child … can be measured with the same kind of objectivity as we may tell its weight …” (Rasch, 1960, p. 115).

Rasch’s model not only sets the parameters for data sufficient to the task of measurement, it lays out the relationships that must be found in data for objective results to be possible. Rasch studied with Ronald Fisher in London in 1935, expanded his understanding of statistical sufficiency with him, and then applied it in his measurement work, but not in the way that most statisticians understand it. Yes, in the context of group-level statistics, sufficiency concerns the reproducibility of a normal distribution when all that is known are the mean and the standard deviation. But sufficiency is something quite different in the context of individual-level measurement. Here, counts of correct answers or sums of ratings serve as sufficient statistics  for any statistical model’s parameters when they contain all of the information needed to establish that the parameters are independent of one another, and are not interacting in ways that keep them tied together. So despite his respect for Ronald Fisher and the concept of sufficiency, Rasch’s work with models and methods that worked equally well with many different kinds of distributions led him to jokingly suggest (Andersen, 1995, p. 385) that all textbooks mentioning the normal distribution should be burned!

In plain English, all that we’re talking about here is what Thurstone said: the ruler has to work the same way no matter what or who it is measuring, and we have to get the same results for what or who we are measuring no matter which ruler we use. When parameters are not separable, when they stick together because some measures change depending on which questions are asked or because some calibrations change depending on who answers them, we have encountered a “failure of invariance” that tells us something is wrong. If we are to persist in our efforts to determine if something objective exists and can be measured, we need to investigate these interactions and explain them. Maybe there was a data entry error. Maybe a form was misprinted. Maybe a question was poorly phrased. Maybe we have questions that address different constructs all mixed together. Maybe math word problems work like reading test items for students who can’t read the language they’re written in.  Standard statistical modeling ignores these potential violations of construct validity in favor of adding more parameters to the model.

But that’s another story for another time. Tomorrow we’ll take a closer look at sufficiency, in both conceptual and practical terms. Cited references are always available on request, but I’ll post them in a couple of days.