(The following reply was sent in response to an invitation from researchers at the University of Queensland in Brisbane, Australia to participate in a survey on the replication crisis in psychology.)
Thank you for alerting me to your important survey, and for providing an opportunity to address issues of results replications in psychology. Given the content of the survey, it seems appropriate to offer an alternative perspective on the nature of the situation.
Initially, I had a look at the first question in your survey on the replication crisis in psychology and closed the page. It does not seem to me the question can be properly answered given the information provided. Later I went back and responded as reasonably as I could, given the entire survey is biased toward the standard misconceptions of psychological measurement, namely, that ordinal scores gathered with the aim of applying descriptive statistics are definitive, and that quantitative methods have no need for hypotheses, models, proofs, or evidence of meaningful interval units of comparison.
To my mind, the replication crisis in psychology is in part a function of ignoring the distinction between statistical models and scientific models (Cohen, 1994; Duncan, 1992; Fisher, 2010b; Meehl, 1967; Michell, 1986; Wilson, 2013a; Zhu, 2012). The statistical motivation for making models probabilistic concerns sampling; the scientific motivation concerns the individual response process. As Duncan and Stenbeck (1988, pp. 24-25) put it,
“The main point to emphasize here is that the postulate of probabilistic response must be clearly distinguished in both concept and research design from the stochastic variation of data that arises from random sampling of a heterogeneous population. The distinction is completely blurred in our conventional statistical training and practice of data analysis, wherein the stochastic aspects of the statistical model are most easily justified by the idea of sampling from a population distribution. We seldom stop to wonder if sampling is the only reason for making the model stochastic. The perverse consequence of doing good statistics is, therefore, to suppress curiosity about the actual processes that generate the data.”
This distinction between scientific and statistical models is old and worn. It often seems that the mainstream will never pick up on it, despite the fact that, insofar as the individual-level response process’s sum of counts or ratings is treated inferentially as a sufficient statistic (i.e., as a score to which no outside information is added), then an identified scientific measurement model of a particular form is assumed, whether or not the researcher/analyst is aware of it or actually applies it (Andersen, 1977, 1999; Fischer, 1981; San Martin, Gonzalez, & Tuerlinckx, 2009).
Forty-three years ago, the situation was described by Wright (1977, p. 114):
“Unweighted scores are appropriate for person measurement if and only if what happens when a person responds to an item can be usefully approximated by a Rasch model…. Ironically, for anyone who claims skepticism about ‘the assumptions’ of the Rasch model, those who use unweighted scores are, however unwittingly, counting on the Rasch model to see them through. Whether this is useful in practice is a question not for more theorizing, but for empirical study.”
Insofar as measurement results are replicable, they converge on a common construct and unit definition, and support collective learning processes, the coherence of communities of research and practice, and the emergence of metrological standards (Barbic, et al., 2019; Cano, et al., 2019; Fisher, 1997a/b, 2004, 2009, 2010a, 2012, 2017a; Fisher & Stenner, 2016; Mari & Wilson, 2014, 2020; Pendrill, 2014, 2019; Pendrill & Fisher, 2015; Wilson, 2013b).
Researchers’ subjective guesses as to what measured constructs look like and how they behave tend to be borne out, more or less, in ways that allow us all to learn from each other, if and when we take the trouble to prepare, scale, and present our results in the form required to make that happen (for guidance in this regard, see Fisher & Wright, 1994; Smith, 2005; Stone, Wright, & Stenner, 1999; Wilson, 2005, 2009, 2018; Wilson, et al., 2012; Wright & Stone, 1979, 1999, 2003).
You would never know it from the kind of research assumed in your online survey, but the successful replication of results naturally should and does lead to the detailed mapping of variables (constructs), and the definition of unit standards that bring the thing measured into language as common metrics and shared objects of reference.
This is not a new idea, or an unproven one (Luce & Tukey, 1964; Narens & Luce, 1986; Rasch, 1960; Thurstone, 1928; Wright, 1997; among many others). Proofs of the form of the model following from the sufficiency of the scores are cited above, and experimental proofs of the utility of the models for designing and calibrating interval unit measures are provided in thousands of peer reviewed publications. Explanatory scientific models predicting item calibrations have been in development and practical in use for decades (Embretson, 2010; Fischer, 1973; Latimer, 1982; Prien, 1989; Stenner & Smith, 1982; Stenner, et al., 2013; Wright & Stone, 1979; among many others).
Preconceptions and unexamined assumptions about measurement blind many researchers and limit their vision of what’s possible to conventional repetitions of more of the same, even when the methods used do not work and have been shown ineffectual repeatedly for decades. In this regard, it is worth noting, contra widespread assumptions, that another difference between statistical and scientific models is the reductionist whole-is-the-sum-of-the-parts perspective of the former, and the emergent whole-is-greater-than-the-sum-of-the-parts perspective of the latter (Fisher, 2004, 2017b, 2019b; Fisher & Stenner, 2018). In contrast to the lack of vision and imagination resulting from the myopia of statistical methods, I think it is essential that we seek a capacity to extend everyday language so as to inform locally situated dialogues and negotiations via the mediations of meaningful common metrics integrating concrete things with formal concepts, as has been routinely the case in a wide range of applications for quite some time (Chien, et al., 2009; Masters, 1994; Masters, et al., 1994, 1999; Wilson, 2018; Wilson, et al, 2012; Wright, et al., 1980; among many others).
Sweden’s national metrology institute (the Research Institute of Sweden, RISE) is aggressively taking up research in this domain (Cano, et al., 2019; Fisher, 2019a; Pendrill, 2014, 2019; Pendrill & Fisher, 2015), as are a number of other national metrology institutes globally who have been involved over the last decade in the meetings of the International Measurement Confederation (IMEKO; Fisher, 2008, 2010c, 2012a). An IMEKO Joint Symposium hosted by myself and Mark Wilson at UC Berkeley in 2016 had nearly equal numbers of psychometricians and metrology engineers (Wilson & Fisher, 2016). This and later Joint Symposia have included enough full length papers for special issues of IMEKO’s Measurement journal (Wilson & Fisher, 2018, 2019).
Though psychology and the social sciences seem hopelessly stuck on continued use of statistical significance tests and ordinal scores as the paradigm of measurement, having garnered the attention of metrologists, a sound basis has emerged for hope that new directions will be explored on broader scales and to greater depths. The new partnerships being sought out and research initiatives being proposed at RISE, for instance, promise to enhance awareness across fields as to the challenges and opportunities at hand.
References
Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.
Andersen, E. B. (1999). Sufficient statistics in educational measurement. In G. N. Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 122-125). New York: Pergamon.
Andrich, D. (2010). Sufficiency and conditional estimation of person parameters in the polytomous Rasch model. Psychometrika, 75(2), 292-308.
Barbic, S., Cano, S. J., Tee, K., & Mathias, S. (2019). Patient-centered outcome measurement in psychiatry: How metrology can optimize health services and outcomes. TMQ_Techniques, Methodologies and Quality, 10(Special Issue on Health Metrology), 10-19.
Cano, S., Pendrill, L., Melin, J., & Fisher, W. P., Jr. (2019). Towards consensus measurement standards for patient-centered outcomes. Measurement, 141, 62-69.
Chien, T.-W., Wang, W.-C., Wang, H.-Y., & Lin, H.-J. (2009). Online assessment of patients’ views on hospital performances using Rasch model’s KIDMAP diagram. BMC Health Services Research, 9, 135.
Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist, 49, 997-1003.
Duncan, O. D. (1992, September). What if? Contemporary Sociology, 21(5), 667-668.
Duncan, O. D., & Stenbeck, M. (1988). Panels and cohorts: Design and model in the study of voting turnout. In C. C. Clogg (Ed.), Sociological Methodology 1988 (pp. 1-35). Washington, DC: American Sociological Association.
Embretson, S. E. (2010). Measuring psychological constructs: Advances in model-based approaches. Washington, DC: American Psychological Association.
Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.
Fischer, G. H. (1981). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.
Fisher, W. P., Jr. (1997a). Physical disability construct convergence across instruments: Towards a universal metric. Journal of Outcome Measurement, 1(2), 87-113 [http://jampress.org/JOM_V1N2.pdf]
Fisher, W. P., Jr. (1997b). What scale-free measurement means to health outcomes research. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 357-373.
Fisher, W. P., Jr. (2004). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-454.
Fisher, W. P., Jr. (2008). Notes on IMEKO symposium. Rasch Measurement Transactions, 22(1), 1147 [http://www.rasch.org/rmt/rmt221.pdf].
Fisher, W. P., Jr. (2009). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement, 42(9), 1278-1287.
Fisher, W. P., Jr. (2010a). The standard model in the history of the natural sciences, econometrics, and the social sciences. Journal of Physics Conference Series, 238(012016).
Fisher, W. P., Jr. (2010b). Statistics and measurement: Clarifying the differences. Rasch Measurement Transactions, 23(4), 1229-1230 [http://www.rasch.org/rmt/rmt234.pdf].
Fisher, W. P., Jr. (2010c). Unifying the language of measurement. Rasch Measurement Transactions, 24(2), 1278-1281 [http://www.rasch.org/rmt/rmt242.pdf].
Fisher, W. P., Jr. (2012a). 2011 IMEKO conference proceedings available online. Rasch Measurement Transactions, 25(4), 1349 [http://www.rasch.org/rmt/rmt254.pdf].
Fisher, W. P., Jr. (2012b, June 1). What the world needs now: A bold plan for new standards [Third place, 2011 NIST/SES World Standards Day paper competition]. Standards Engineering, 64(3), 1 & 3-5 [http://ssrn.com/abstract=2083975].
Fisher, W. P., Jr. (2017a). Metrology, psychometrics, and new horizons for innovation. 18th International Congress of Metrology, Paris, 09007. doi: 10.1051/metrology/201709007
Fisher, W. P., Jr. (2017b). A practical approach to modeling complex adaptive flows in psychology and social science. Procedia Computer Science, 114, 165-174. Retrieved from https://doi.org/10.1016/j.procs.2017.09.027
Fisher, W. P., Jr. (2018). Update on Rasch in metrology. Rasch Measurement Transactions, 32(1), 1685-1687.
Fisher, W. P., Jr. (2019a). News from Sweden’s National Metrology Institute. Rasch Measurement Transactions, 32(3), 1719-1723.
Fisher, W. P., Jr. (2019b). A nondualist social ethic: Fusing subject and object horizons in measurement. TMQ–Techniques, Methodologies, and Quality [Special Issue on Health Metrology], 10, 21-40.
Fisher, W. P., Jr., & Stenner, A. J. (2016). Theory-based metrological traceability in education: A reading measurement network. Measurement, 92, 489-496.
Fisher, W. P., Jr., & Stenner, A. J. (2018). On the complex geometry of individuality and growth: Cook’s 1914 ‘Curves of Life’ and reading measurement. Journal of Physics Conference Series, 1065, 072040.
Fisher, W. P., Jr., & Wright, B. D. (Eds.). (1994). Applications of probabilistic conjoint measurement. International Journal of Educational Research, 21(6), 557-664.
Green, K. E., & Smith, R. M. (1987). A comparison of two methods of decomposing item difficulties. Journal of Educational Statistics, 12(4), 369-381.
Latimer, S. L. (1982). Using the Linear Logistic Test Model to investigate a discourse-based model of reading comprehension. Education Research and Perspectives, 9(1), 73-94 [http://www.rasch.org/erp7.htm].
Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint measurement: A new kind of fundamental measurement. Journal of Mathematical Psychology, 1(1), 1-27.
Mari, L., & Wilson, M. (2014, May). An introduction to the Rasch measurement approach for metrologists. Measurement, 51, 315-327.
Mari, L., & Wilson, M. (2020). Measurement across the sciences [in press]. Cham: Springer.
Masters, G. N. (1994). KIDMAP – a history. Rasch Measurement Transactions, 8(2), 366 [http://www.rasch.org/rmt/rmt82k.htm].
Masters, G. N., Adams, R. J., & Lokan, J. (1994). Mapping student achievement. International Journal of Educational Research, 21(6), 595-610.
Masters, G. N., Adams, R. J., & Wilson, M. (1999). Charting of student progress. In G. N. Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 254-267). New York: Pergamon.
Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.
Michell, J. (1986). Measurement scales and statistics: A clash of paradigms. Psychological Bulletin, 100, 398-407.
Narens, L., & Luce, R. D. (1986). Measurement: The theory of numerical assignments. Psychological Bulletin, 99(2), 166-180.
Pendrill, L. (2014). Man as a measurement instrument [Special Feature]. NCSLi Measure: The Journal of Measurement Science, 9(4), 22-33.
Pendrill, L. (2019). Quality assured measurement: Unification across social and physical sciences. Cham: Springer.
Pendrill, L., & Fisher, W. P., Jr. (2015). Counting and quantification: Comparing psychometric and metrological perspectives on visual perceptions of number. Measurement, 71, 46-55.
Prien, B. (1989). How to predetermine the difficulty of items of examinations and standardized tests. Studies in Educational Evaluation, 15, 309-317.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.
San Martin, E., Gonzalez, J., & Tuerlinckx, F. (2009). Identified parameters, parameters of interest, and their relationships. Measurement: Interdisciplinary Research and Perspectives, 7(2), 97-105.
Smith, E. V., Jr. (2005). Representing treatment effects with variable maps. In N. Bezruczko (Ed.), Rasch measurement in health sciences (pp. 247-259). Maple Grove, MN: JAM Press.
Stenner, A. J., Fisher, W. P., Jr., Stone, M. H., & Burdick, D. S. (2013). Causal Rasch models. Frontiers in Psychology: Quantitative Psychology and Measurement, 4(536), 1-14.
Stenner, A. J., & Smith, M., III. (1982). Testing construct theories. Perceptual and Motor Skills, 55, 415-426.
Stone, M. H., Wright, B., & Stenner, A. J. (1999). Mapping variables. Journal of Outcome Measurement, 3(4), 308-322. [http://jampress.org/JOM_V3N4.pdf]
Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, XXXIII, 529-544. (Rpt. in L. L. Thurstone, (1959). The measurement of values (pp. 215-233). Chicago, Illinois: University of Chicago Press, Midway Reprint Series.)
Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.
Wilson, M. R. (2013a). Seeking a balance between the statistical and scientific elements in psychometrics. Psychometrika, 78(2), 211-236.
Wilson, M. R. (2013b). Using the concept of a measurement system to characterize measurement models used in psychometrics. Measurement, 46, 3766-3774.
Wilson, M. R. (2009). Measuring progressions: Assessment structures underlying a learning progression. Journal of Research in Science Teaching, 46, 716-730.
Wilson, M. (2018). Making measurement important for education: The crucial role of classroom assessment. Educational Measurement: Issues and Practice, 37(1), 5-20.
Wilson, M., Bejar, I., Scalise, K., Templin, J., Wiliam, D., & Torres Irribarra, D. (2012). Perspectives on methodological issues. In P. Griffin, B. McGaw & E. Care (Eds.), Assessment and teaching of 21st century skills (pp. 67-141). Dordrecht: Springer Netherlands.
Wilson, M., & Fisher, W. (2016). Preface: 2016 IMEKO TC1-TC7-TC13 Joint Symposium: Metrology across the Sciences: Wishful Thinking? Journal of Physics Conference Series, 772(1), 011001.
Wilson, M., & Fisher, W. (2018). Preface of special issue Metrology across the Sciences: Wishful Thinking? Measurement, 127, 577.
Wilson, M., & Fisher, W. (2019). Preface of special issue, Psychometric Metrology. Measurement, 145, 190.
Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].
Wright, B. D. (1997). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45, 52 [http://www.rasch.org/memo62.htm].
Wright, B. D., Mead, R. J., & Ludlow, L. H. (1980). KIDMAP: person-by-item interaction mapping (Tech. Rep. No. MESA Memorandum #29). Chicago:. MESA Press [http://www.rasch.org/memo29.pdf].
Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, Illinois: MESA Press
Wright, B. D., & Stone, M. H. (1999). Measurement essentials. Wilmington, DE: Wide Range, Inc. [http://www.rasch.org/measess/me-all.pdf].
Wright, B. D., & Stone, M. H. (2003). Five steps to science: Observing, scoring, measuring, analyzing, and applying. Rasch Measurement Transactions, 17(1), 912-913 [http://www.rasch.org/rmt/rmt171j.htm].
Zhu, W. (2012). Sadly, the earth is still round (p< 0.05). Journal of Sport and Health Science, 1(1), 9-11.
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.