Archive for the ‘Reproducibility’ Category

Remarks on a Survey Concerning Results Replications in Psychology

May 22, 2020

(The following reply was sent in response to an invitation from researchers at the University of Queensland in Brisbane, Australia to participate in a survey on the replication crisis in psychology.)

Thank you for alerting me to your important survey, and for providing an opportunity to address issues of results replications in psychology. Given the content of the survey, it seems appropriate to offer an alternative perspective on the nature of the situation.

Initially, I had a look at the first question in your survey on the replication crisis in psychology and closed the page. It does not seem to me the question can be properly answered given the information provided. Later I went back and responded as reasonably as I could, given the entire survey is biased toward the standard misconceptions of psychological measurement, namely, that ordinal scores gathered with the aim of applying descriptive statistics are definitive, and that quantitative methods have no need for hypotheses, models, proofs, or evidence of meaningful interval units of comparison.

To my mind, the replication crisis in psychology is in part a function of ignoring the distinction between statistical models and scientific models (Cohen, 1994; Duncan, 1992; Fisher, 2010b; Meehl, 1967; Michell, 1986; Wilson, 2013a; Zhu, 2012). The statistical motivation for making models probabilistic concerns sampling; the scientific motivation concerns the individual response process. As Duncan and Stenbeck (1988, pp. 24-25) put it,

“The main point to emphasize here is that the postulate of probabilistic response must be clearly distinguished in both concept and research design from the stochastic variation of data that arises from random sampling of a heterogeneous population. The distinction is completely blurred in our conventional statistical training and practice of data analysis, wherein the stochastic aspects of the statistical model are most easily justified by the idea of sampling from a population distribution. We seldom stop to wonder if sampling is the only reason for making the model stochastic. The perverse consequence of doing good statistics is, therefore, to suppress curiosity about the actual processes that generate the data.”

This distinction between scientific and statistical models is old and worn. It often seems that the mainstream will never pick up on it, despite the fact that, insofar as the individual-level response process’s sum of counts or ratings is treated inferentially as a sufficient statistic (i.e., as a score to which no outside information is added), then an identified scientific measurement model of a particular form is assumed, whether or not the researcher/analyst is aware of it or actually applies it (Andersen, 1977, 1999; Fischer, 1981; San Martin, Gonzalez, & Tuerlinckx, 2009).

Forty-three years ago, the situation was described by Wright (1977, p. 114):

“Unweighted scores are appropriate for person measurement if and only if what happens when a person responds to an item can be usefully approximated by a Rasch model…. Ironically, for anyone who claims skepticism about ‘the assumptions’ of the Rasch model, those who use unweighted scores are, however unwittingly, counting on the Rasch model to see them through. Whether this is useful in practice is a question not for more theorizing, but for empirical study.”

Insofar as measurement results are replicable, they converge on a common construct and unit definition, and support collective learning processes, the coherence of communities of research and practice, and the emergence of metrological standards (Barbic, et al., 2019; Cano, et al., 2019; Fisher, 1997a/b, 2004, 2009, 2010a, 2012, 2017a; Fisher & Stenner, 2016; Mari & Wilson, 2014, 2020; Pendrill, 2014, 2019; Pendrill & Fisher, 2015; Wilson, 2013b).

Researchers’ subjective guesses as to what measured constructs look like and how they behave tend to be borne out, more or less, in ways that allow us all to learn from each other, if and when we take the trouble to prepare, scale, and present our results in the form required to make that happen (for guidance in this regard, see Fisher & Wright, 1994; Smith, 2005; Stone, Wright, & Stenner, 1999; Wilson, 2005, 2009, 2018; Wilson, et al., 2012; Wright & Stone, 1979, 1999, 2003).

You would never know it from the kind of research assumed in your online survey, but the successful replication of results naturally should and does lead to the detailed mapping of variables (constructs), and the definition of unit standards that bring the thing measured into language as common metrics and shared objects of reference.

This is not a new idea, or an unproven one (Luce & Tukey, 1964; Narens & Luce, 1986; Rasch, 1960; Thurstone, 1928; Wright, 1997; among many others). Proofs of the form of the model following from the sufficiency of the scores are cited above, and experimental proofs of the utility of the models for designing and calibrating interval unit measures are provided in thousands of peer reviewed publications. Explanatory scientific models predicting item calibrations have been in development and practical in use for decades (Embretson, 2010; Fischer, 1973; Latimer, 1982; Prien, 1989; Stenner & Smith, 1982; Stenner, et al., 2013; Wright & Stone, 1979; among many others).

Preconceptions and unexamined assumptions about measurement blind many researchers and limit their vision of what’s possible to conventional repetitions of more of the same, even when the methods used do not work and have been shown ineffectual repeatedly for decades. In this regard, it is worth noting, contra widespread assumptions, that another difference between statistical and scientific models is the reductionist whole-is-the-sum-of-the-parts perspective of the former, and the emergent whole-is-greater-than-the-sum-of-the-parts perspective of the latter (Fisher, 2004, 2017b, 2019b; Fisher & Stenner, 2018). In contrast to the lack of vision and imagination resulting from the myopia of statistical methods, I think it is essential that we seek a capacity to extend everyday language so as to inform locally situated dialogues and negotiations via the mediations of meaningful common metrics integrating concrete things with formal concepts, as has been routinely the case in a wide range of applications for quite some time (Chien, et al., 2009; Masters, 1994; Masters, et al., 1994, 1999; Wilson, 2018; Wilson, et al, 2012; Wright, et al., 1980; among many others).

Sweden’s national metrology institute (the Research Institute of Sweden, RISE) is aggressively taking up research in this domain (Cano, et al., 2019; Fisher, 2019a; Pendrill, 2014, 2019; Pendrill & Fisher, 2015), as are a number of other national metrology institutes globally who have been involved over the last decade in the meetings of the International Measurement Confederation (IMEKO; Fisher, 2008, 2010c, 2012a). An IMEKO Joint Symposium hosted by myself and Mark Wilson at UC Berkeley in 2016 had nearly equal numbers of psychometricians and metrology engineers (Wilson & Fisher, 2016). This and later Joint Symposia have included enough full length papers for special issues of IMEKO’s Measurement journal (Wilson & Fisher, 2018, 2019).

Though psychology and the social sciences seem hopelessly stuck on continued use of statistical significance tests and ordinal scores as the paradigm of measurement, having garnered the attention of metrologists, a sound basis has emerged for hope that new directions will be explored on broader scales and to greater depths. The new partnerships being sought out and research initiatives being proposed at RISE, for instance, promise to enhance awareness across fields as to the challenges and opportunities at hand.

References

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.

Andersen, E. B. (1999). Sufficient statistics in educational measurement. In G. N. Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 122-125). New York: Pergamon.

Andrich, D. (2010). Sufficiency and conditional estimation of person parameters in the polytomous Rasch model. Psychometrika, 75(2), 292-308.

Barbic, S., Cano, S. J., Tee, K., & Mathias, S. (2019). Patient-centered outcome measurement in psychiatry: How metrology can optimize health services and outcomes. TMQ_Techniques, Methodologies and Quality, 10(Special Issue on Health Metrology), 10-19.

Cano, S., Pendrill, L., Melin, J., & Fisher, W. P., Jr. (2019). Towards consensus measurement standards for patient-centered outcomes. Measurement, 141, 62-69.

Chien, T.-W., Wang, W.-C., Wang, H.-Y., & Lin, H.-J. (2009). Online assessment of patients’ views on hospital performances using Rasch model’s KIDMAP diagram. BMC Health Services Research, 9, 135.

Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist, 49, 997-1003.

Duncan, O. D. (1992, September). What if? Contemporary Sociology, 21(5), 667-668.

Duncan, O. D., & Stenbeck, M. (1988). Panels and cohorts: Design and model in the study of voting turnout. In C. C. Clogg (Ed.), Sociological Methodology 1988 (pp. 1-35). Washington, DC: American Sociological Association.

Embretson, S. E. (2010). Measuring psychological constructs: Advances in model-based approaches. Washington, DC: American Psychological Association.

Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.

Fischer, G. H. (1981). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.

Fisher, W. P., Jr. (1997a). Physical disability construct convergence across instruments: Towards a universal metric. Journal of Outcome Measurement, 1(2), 87-113 [http://jampress.org/JOM_V1N2.pdf]

Fisher, W. P., Jr. (1997b). What scale-free measurement means to health outcomes research. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 357-373.

Fisher, W. P., Jr. (2004). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-454.

Fisher, W. P., Jr. (2008). Notes on IMEKO symposium. Rasch Measurement Transactions, 22(1), 1147 [http://www.rasch.org/rmt/rmt221.pdf].

Fisher, W. P., Jr. (2009). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement, 42(9), 1278-1287.

Fisher, W. P., Jr. (2010a). The standard model in the history of the natural sciences, econometrics, and the social sciences. Journal of Physics Conference Series, 238(012016).

Fisher, W. P., Jr. (2010b). Statistics and measurement: Clarifying the differences. Rasch Measurement Transactions, 23(4), 1229-1230  [http://www.rasch.org/rmt/rmt234.pdf].

Fisher, W. P., Jr. (2010c). Unifying the language of measurement. Rasch Measurement Transactions, 24(2), 1278-1281  [http://www.rasch.org/rmt/rmt242.pdf].

Fisher, W. P., Jr. (2012a). 2011 IMEKO conference proceedings available online. Rasch Measurement Transactions, 25(4), 1349 [http://www.rasch.org/rmt/rmt254.pdf].

Fisher, W. P., Jr. (2012b, June 1). What the world needs now: A bold plan for new standards [Third place, 2011 NIST/SES World Standards Day paper competition]. Standards Engineering, 64(3), 1 & 3-5 [http://ssrn.com/abstract=2083975].

Fisher, W. P., Jr. (2017a). Metrology, psychometrics, and new horizons for innovation. 18th International Congress of Metrology, Paris, 09007. doi: 10.1051/metrology/201709007

Fisher, W. P., Jr. (2017b). A practical approach to modeling complex adaptive flows in psychology and social science. Procedia Computer Science, 114, 165-174. Retrieved from https://doi.org/10.1016/j.procs.2017.09.027

Fisher, W. P., Jr. (2018). Update on Rasch in metrology. Rasch Measurement Transactions, 32(1), 1685-1687.

Fisher, W. P., Jr. (2019a). News from Sweden’s National Metrology Institute. Rasch Measurement Transactions, 32(3), 1719-1723.

Fisher, W. P., Jr. (2019b). A nondualist social ethic: Fusing subject and object horizons in measurement. TMQ–Techniques, Methodologies, and Quality [Special Issue on Health Metrology], 10, 21-40.

Fisher, W. P., Jr., & Stenner, A. J. (2016). Theory-based metrological traceability in education: A reading measurement network. Measurement, 92, 489-496.

Fisher, W. P., Jr., & Stenner, A. J. (2018). On the complex geometry of individuality and growth: Cook’s 1914 ‘Curves of Life’ and reading measurement. Journal of Physics Conference Series, 1065, 072040.

Fisher, W. P., Jr., & Wright, B. D. (Eds.). (1994). Applications of probabilistic conjoint measurement. International Journal of Educational Research, 21(6), 557-664.

Green, K. E., & Smith, R. M. (1987). A comparison of two methods of decomposing item difficulties. Journal of Educational Statistics, 12(4), 369-381.

Latimer, S. L. (1982). Using the Linear Logistic Test Model to investigate a discourse-based model of reading comprehension. Education Research and Perspectives, 9(1), 73-94 [http://www.rasch.org/erp7.htm].

Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint measurement: A new kind of fundamental measurement. Journal of Mathematical Psychology, 1(1), 1-27.

Mari, L., & Wilson, M. (2014, May). An introduction to the Rasch measurement approach for metrologists. Measurement, 51, 315-327.

Mari, L., & Wilson, M. (2020). Measurement across the sciences [in press]. Cham: Springer.

Masters, G. N. (1994). KIDMAP – a history. Rasch Measurement Transactions, 8(2), 366 [http://www.rasch.org/rmt/rmt82k.htm].

Masters, G. N., Adams, R. J., & Lokan, J. (1994). Mapping student achievement. International Journal of Educational Research, 21(6), 595-610.

Masters, G. N., Adams, R. J., & Wilson, M. (1999). Charting of student progress. In G. N. Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 254-267). New York: Pergamon.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.

Michell, J. (1986). Measurement scales and statistics: A clash of paradigms. Psychological Bulletin, 100, 398-407.

Narens, L., & Luce, R. D. (1986). Measurement: The theory of numerical assignments. Psychological Bulletin, 99(2), 166-180.

Pendrill, L. (2014). Man as a measurement instrument [Special Feature]. NCSLi Measure: The Journal of Measurement Science, 9(4), 22-33.

Pendrill, L. (2019). Quality assured measurement: Unification across social and physical sciences. Cham: Springer.

Pendrill, L., & Fisher, W. P., Jr. (2015). Counting and quantification: Comparing psychometric and metrological perspectives on visual perceptions of number. Measurement, 71, 46-55.

Prien, B. (1989). How to predetermine the difficulty of items of examinations and standardized tests. Studies in Educational Evaluation, 15, 309-317.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

San Martin, E., Gonzalez, J., & Tuerlinckx, F. (2009). Identified parameters, parameters of interest, and their relationships. Measurement: Interdisciplinary Research and Perspectives, 7(2), 97-105.

Smith, E. V., Jr. (2005). Representing treatment effects with variable maps. In N. Bezruczko (Ed.), Rasch measurement in health sciences (pp. 247-259). Maple Grove, MN: JAM Press.

Stenner, A. J., Fisher, W. P., Jr., Stone, M. H., & Burdick, D. S. (2013). Causal Rasch models. Frontiers in Psychology: Quantitative Psychology and Measurement, 4(536), 1-14.

Stenner, A. J., & Smith, M., III. (1982). Testing construct theories. Perceptual and Motor Skills, 55, 415-426.

Stone, M. H., Wright, B., & Stenner, A. J. (1999). Mapping variables. Journal of Outcome Measurement, 3(4), 308-322. [http://jampress.org/JOM_V3N4.pdf]

Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, XXXIII, 529-544. (Rpt. in L. L. Thurstone, (1959). The measurement of values (pp. 215-233). Chicago, Illinois: University of Chicago Press, Midway Reprint Series.)

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wilson, M. R. (2013a). Seeking a balance between the statistical and scientific elements in psychometrics. Psychometrika, 78(2), 211-236.

Wilson, M. R. (2013b). Using the concept of a measurement system to characterize measurement models used in psychometrics. Measurement, 46, 3766-3774.

Wilson, M. R. (2009). Measuring progressions: Assessment structures underlying a learning progression. Journal of Research in Science Teaching, 46, 716-730.

Wilson, M. (2018). Making measurement important for education: The crucial role of classroom assessment. Educational Measurement: Issues and Practice, 37(1), 5-20.

Wilson, M., Bejar, I., Scalise, K., Templin, J., Wiliam, D., & Torres Irribarra, D. (2012). Perspectives on methodological issues. In P. Griffin, B. McGaw & E. Care (Eds.), Assessment and teaching of 21st century skills (pp. 67-141). Dordrecht: Springer Netherlands.

Wilson, M., & Fisher, W. (2016). Preface: 2016 IMEKO TC1-TC7-TC13 Joint Symposium: Metrology across the Sciences: Wishful Thinking? Journal of Physics Conference Series, 772(1), 011001.

Wilson, M., & Fisher, W. (2018). Preface of special issue Metrology across the Sciences: Wishful Thinking? Measurement, 127, 577.

Wilson, M., & Fisher, W. (2019). Preface of special issue, Psychometric Metrology. Measurement, 145, 190.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1997). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45, 52 [http://www.rasch.org/memo62.htm].

Wright, B. D., Mead, R. J., & Ludlow, L. H. (1980). KIDMAP: person-by-item interaction mapping (Tech. Rep. No. MESA Memorandum #29). Chicago:. MESA Press  [http://www.rasch.org/memo29.pdf].

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, Illinois: MESA Press

Wright, B. D., & Stone, M. H. (1999). Measurement essentials. Wilmington, DE: Wide Range, Inc. [http://www.rasch.org/measess/me-all.pdf].

Wright, B. D., & Stone, M. H. (2003). Five steps to science: Observing, scoring, measuring, analyzing, and applying. Rasch Measurement Transactions, 17(1), 912-913 [http://www.rasch.org/rmt/rmt171j.htm].

Zhu, W. (2012). Sadly, the earth is still round (p< 0.05). Journal of Sport and Health Science, 1(1), 9-11.

Creative Commons License

LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Psychology and the social sciences: An atheoretical, scattered, and disconnected body of research

February 16, 2019

A new article in Nature Human Behaviour (NHB) points toward the need for better theory and more rigorous mathematical models in psychology and the social sciences (Muthukrishna & Henrich, 2019). The authors rightly say that the lack of an overarching cumulative theoretical framework makes it very difficult to see whether new results fit well with previous work, or if something surprising has come to light. Mathematical models are especially emphasized as being of value in specifying clear and precise expectations.

The point that the social sciences and psychology need better theories and models is painfully obvious. But there are in fact thousands of published studies and practical real world applications that not only provide, but indeed often surpass, the kinds of predictive theories and mathematical models called for in the NHB article. The article not only makes no mention of any of this work, its argument is framed entirely in a statistical context instead of the more appropriate context of measurement science.

The concept of reliability provides an excellent point of entry. Most behavioral scientists think of reliability statistically, as a coefficient with a positive numeric value usually between 0.00 and 1.00. The tangible sense of reliability as indicating exactly how predictable an outcome is does not usually figure in most researchers’ thinking. But that sense of the specific predictability of results has been the focus of attention in social and psychological measurement science for decades.

For instance, the measurement of time is reliable in the sense that the position of the sun relative to the earth can be precisely predicted from geographic location, the time of day, and the day of the year. The numbers and words assigned to noon time are closely associated with the Sun being at the high point in the sky (though there are political variations by season and location across time zones).

That kind of a reproducible association is rarely sought in psychology and the social sciences, but it is far from nonexistent. One can discern different degrees to which that kind of association is included in models of measured constructs. Though most behavioral research doesn’t mention the connection between linear amounts of a measured phenomenon and a reproducible numeric representation of it (level 0), quite a significant body of work focuses on that connection (level 1). The disappointing thing about that level 1 work is that the relentless obsession with statistical methods prevents most researchers from connecting a reproducible quantity with a single expression of it in a standard unit, and with an associated uncertainty term (level 2). That is, level 1 researchers conceive measurement in statistical terms, as a product of data analysis. Even when results across data sets are highly correlated and could be equated to a common metric, level 1 researchers do not leverage that source of potential value for simplified communication and accumulated comparability.

And then, for their part, level 2 researchers usually do not articulate theories about the measured constructs, by augmenting the mathematical data model with an explanatory model predicting variation (level 3). Level 2 researchers are empirically grounded in data, and can expand their network of measures only by gathering more data and analyzing it in ways that bring it into their standard unit’s frame of reference.

Level 3 researchers, however, have come to see what makes their measures tick. They understand the mechanisms that make their questions vary. They can write new questions to their theoretical specifications, test those questions by asking them of a relevant sample, and produce the predicted calibrations. For instance, reading comprehension is well established to be a function of the difference between a person’s reading ability and the complexity of the text they encounter (see articles by Stenner in the list below). We have built our entire educational system around this idea, as we deliberately introduce children first to the alphabet, then to the most common words, then to short sentences, and then to ever longer and more complicated text. But stating the construct model, testing it against data, calibrating a unit to which all tests and measures can be traced, and connecting together all the books, articles, tests, curricula, and students is a process that began (in English and Spanish) only in the 1980s. The process still is far from finished, and most reading research still does not use the common metric.

In this kind of theory-informed context, new items can be automatically generated on the fly at the point of measurement. Those items and inferences made from them are validated by the consistency of the responses and the associated expression of the expected probability of success, agreement, etc. The expense of constant data gathering and analysis can be cut to a very small fraction of what it is at levels 0-2.

Level 3 research methods are not widely known or used, but they are not new. They are gaining traction as their use by national metrology institutes globally grows. As high profile critiques of social and psychological research practices continue to emerge, perhaps more attention will be paid to this important body of work. A few key references are provided below, and virtually every post in this blog pertains to these issues.

References

Baghaei, P. (2008). The Rasch model as a construct validation tool. Rasch Measurement Transactions, 22(1), 1145-6 [http://www.rasch.org/rmt/rmt221a.htm].

Bergstrom, B. A., & Lunz, M. E. (1994). The equivalence of Rasch item calibrations and ability estimates across modes of administration. In M. Wilson (Ed.), Objective measurement: Theory into practice, Vol. 2 (pp. 122-128). Norwood, New Jersey: Ablex.

Cano, S., Pendrill, L., Barbic, S., & Fisher, W. P., Jr. (2018). Patient-centred outcome metrology for healthcare decision-making. Journal of Physics: Conference Series, 1044, 012057.

Dimitrov, D. M. (2010). Testing for factorial invariance in the context of construct validation. Measurement & Evaluation in Counseling & Development, 43(2), 121-149.

Embretson, S. E. (2010). Measuring psychological constructs: Advances in model-based approaches. Washington, DC: American Psychological Association.

Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.

Fischer, G. H. (1983). Logistic latent trait models with linear constraints. Psychometrika, 48(1), 3-26.

Fisher, W. P., Jr. (1992). Reliability statistics. Rasch Measurement Transactions, 6(3), 238 [http://www.rasch.org/rmt/rmt63i.htm].

Fisher, W. P., Jr. (2008). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-1163 [http://www.rasch.org/rmt/rmt221.pdf].

Fisher, W. P., Jr., & Stenner, A. J. (2016). Theory-based metrological traceability in education: A reading measurement network. Measurement, 92, 489-496.

Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37(4), 827-833.

Hattie, J. (1985). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9(2), 139-64.

Hobart, J. C., Cano, S. J., Zajicek, J. P., & Thompson, A. J. (2007). Rating scales as outcome measures for clinical trials in neurology: Problems, solutions, and recommendations. Lancet Neurology, 6, 1094-1105.

Irvine, S. H., Dunn, P. L., & Anderson, J. D. (1990). Towards a theory of algorithm-determined cognitive test construction. British Journal of Psychology, 81, 173-195.

Kline, T. L., Schmidt, K. M., & Bowles, R. P. (2006). Using LinLog and FACETS to model item components in the LLTM. Journal of Applied Measurement, 7(1), 74-91.

Lunz, M. E., & Linacre, J. M. (2010). Reliability of performance examinations: Revisited. In M. Garner, G. Engelhard, Jr., W. P. Fisher, Jr. & M. Wilson (Eds.), Advances in Rasch Measurement, Vol. 1 (pp. 328-341). Maple Grove, MN: JAM Press.

Mari, L., & Wilson, M. (2014). An introduction to the Rasch measurement approach for metrologists. Measurement, 51, 315-327.

Markward, N. J., & Fisher, W. P., Jr. (2004). Calibrating the genome. Journal of Applied Measurement, 5(2), 129-141.

Maul, A., Mari, L., Torres Irribarra, D., & Wilson, M. (2018). The quality of measurement results in terms of the structural features of the measurement process. Measurement, 116, 611-620.

Muthukrishna, M., & Henrich, J. (2019). A problem in theory. Nature Human Behaviour, 1-9.

Obiekwe, J. C. (1999, August 1). Application and validation of the linear logistic test model for item difficulty prediction in the context of mathematics problems. Dissertation Abstracts International: Section B: The Sciences & Engineering, 60(2-B), 0851.

Pendrill, L. (2014). Man as a measurement instrument [Special Feature]. NCSLi Measure: The Journal of Measurement Science, 9(4), 22-33.

Pendrill, L., & Fisher, W. P., Jr. (2015). Counting and quantification: Comparing psychometric and metrological perspectives on visual perceptions of number. Measurement, 71, 46-55.

Pendrill, L., & Petersson, N. (2016). Metrology of human-based and other qualitative measurements. Measurement Science and Technology, 27(9), 094003.

Sijtsma, K. (2009). Correcting fallacies in validity, reliability, and classification. International Journal of Testing, 8(3), 167-194.

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74(1), 107-120.

Stenner, A. J. (2001). The necessity of construct theory. Rasch Measurement Transactions, 15(1), 804-5 [http://www.rasch.org/rmt/rmt151q.htm].

Stenner, A. J., Fisher, W. P., Jr., Stone, M. H., & Burdick, D. S. (2013). Causal Rasch models. Frontiers in Psychology: Quantitative Psychology and Measurement, 4(536), 1-14.

Stenner, A. J., & Horabin, I. (1992). Three stages of construct definition. Rasch Measurement Transactions, 6(3), 229 [http://www.rasch.org/rmt/rmt63b.htm].

Stenner, A. J., Stone, M. H., & Fisher, W. P., Jr. (2018). The unreasonable effectiveness of theory based instrument calibration in the natural sciences: What can the social sciences learn? Journal of Physics Conference Series, 1044(012070).

Stone, M. H. (2003). Substantive scale construction. Journal of Applied Measurement, 4(3), 282-297.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wilson, M. R. (2013). Using the concept of a measurement system to characterize measurement models used in psychometrics. Measurement, 46, 3766-3774.

Wright, B. D., & Stone, M. H. (1979). Chapter 5: Constructing a variable. In Best test design: Rasch measurement (pp. 83-128). Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1999). Measurement essentials. Wilmington, DE: Wide Range, Inc. [http://www.rasch.org/measess/me-all.pdf].

Wright, B. D., Stone, M., & Enos, M. (2000). The evolution of meaning in practice. Rasch Measurement Transactions, 14(1), 736 [http://www.rasch.org/rmt/rmt141g.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

How Evidence-Based Decision Making Suffers in the Absence of Theory and Instrument: The Power of a More Balanced Approach

January 28, 2010

The Basis of Evidence in Theory and Instrument

The ostensible point of basing decisions in evidence is to have reasons for proceeding in one direction versus any other. We want to be able to say why we are proceeding as we are. When we give evidence-based reasons for our decisions, we typically couch them in terms of what worked in past experience. That experience might have been accrued over time in practical applications, or it might have been deliberately arranged in one or more experimental comparisons and tests of concisely stated hypotheses.

At its best, generalizing from past experience to as yet unmet future experiences enables us to navigate life and succeed in ways that would not be possible if we could not learn and had no memories. The application of a lesson learned from particular past events to particular future events involves a very specific inferential process. To be able to recognize repeated iterations of the same things requires the accumulation of patterns of evidence. Experience in observing such patterns allows us to develop confidence in our understanding of what that pattern represents in terms of pleasant or painful consequences. When we are able to conceptualize and articulate an idea of a pattern, and when we are then able to recognize a new occurrence of that pattern, we have an idea of it.

Evidence-based decision making is then a matter of formulating expectations from repeatedly demonstrated and routinely reproducible patterns of observations that lend themselves to conceptual representations, as ideas expressed in words. Linguistic and cultural frameworks selectively focus attention by projecting expectations and filtering observations into meaningful patterns represented by words, numbers, and other symbols. The point of efforts aimed at basing decisions in evidence is to try to go with the flow of this inferential process more deliberately and effectively than might otherwise be the case.

None of this is new or controversial. However, the inferential step from evidence to decision always involves unexamined and unjustified assumptions. That is, there is always an element of metaphysical faith behind the expectation that any given symbol or word is going to work as a representation of something in the same way that it has in the past. We can never completely eliminate this leap of faith, since we cannot predict the future with 100% confidence. We can, however, do a lot to reduce the size of the leap, and the risks that go with it, by questioning our assumptions in experimental research that tests hypotheses as to the invariant stability and predictive utility of the representations we make.

Theoretical and Instrumental Assumptions Hidden Behind the Evidence

For instance, evidence as to the effectiveness of an intervention or treatment is often expressed in terms of measures commonly described as quantitative. But it is unusual for any evidence to be produced justifying that description in terms of something that really adds up in the way numbers do. So we often find ourselves in situations in which our evidence is much less meaningful, reliable, and valid than we suppose it to be.

Quantitative measures are often valued as the hallmark of rational science. But their capacity to live up to this billing depends on the quality of the inferences that can be supported. Very few researchers thoroughly investigate the quality of their measures and justify the inferences they make relative to that quality.

Measurement presumes a reproducible pattern of evidence that can serve as the basis for a decision concerning how much of something has been observed. It naturally follows that we often base measurement in counts of some kind—successes, failures, ratings, frequencies, etc. The counts, scores, or sums are then often transformed into percentages by dividing them into the maximum possible that could be obtained. Sometimes the scores are averaged for each person measured, and/or for each item or question on the test, assessment, or survey. These scores and percentages are then almost universally fed directly into decision processes or statistical analyses with no further consideration.

The reproducible pattern of evidence on which decisions are based is presumed to exist between the measures, not within them. In other words, the focus is on the group or population statistics, not on the individual measures. Attention is typically focused on the tip of the iceberg, the score or percentage, not on the much larger, but hidden, mass of information beneath it. Evidence is presumed to be sufficient to the task when the differences between groups of scores are of a consistent size or magnitude, but is this sufficient?

Going Past Assumptions to Testable Hypotheses

In other words, does not science require that evidence be explained by theory, and embodied in instrumentation that provides a shared medium of observation? As shown in the blue lines in the Figure below,

  • theory, whether or not it is explicitly articulated, inevitably influences both what counts as valid data and the configuration of the medium of its representation, the instrument;
  • data, whether or not it is systematically gathered and evaluated, inevitably influences both the medium of its representation, the instrument, and the implicit or explicit theory that explains its properties and justifies its applications; and
  • instruments, whether or not they are actually calibrated from a mapping of symbols and substantive amounts, inevitably influence data gathering and the image of the object explained by theory.

The rhetoric of evidence-based decision making skips over the roles of theory and instrumentation, drawing a direct line from data to decision. In leaving theory laxly formulated, we allow any story that makes a bit of sense and is communicated by someone with a bit of charm or power to carry the day. In not requiring calibrated instrumentation, we allow any data that cross the threshold into our awareness to serve as an acceptable basis for decisions.

What we want, however, is to require meaningful measures that really provide the evidence needed for instruments that exhibit invariant calibrations and for theories that provide predictive explanatory control over the variable. As shown in the Figure, we want data that push theory away from the instrument, theory that separates the data and instrument, and instruments that get in between the theory and data.

We all know to distrust too close a correspondence between theory and data, but we too rarely understand or capitalize on the role of the instrument in mediating the theory-data relation. Similarly, when the questions used as a medium for making observations are obviously biased to produce responses conforming overly closely with a predetermined result, we see that the theory and the instrument are too close for the data to serve as an effective mediator.

Finally, the situation predominating in the social sciences is one in which both construct and measurement theories are nearly nonexistent, which leaves data completely dependent on the instrument it came from. In other words, because counts of correct answers or sums of ratings are mistakenly treated as measures, instruments fully determine and restrict the range of measurement to that defined by the numbers of items and rating categories. Once the instrument is put in play, changes to it would make new data incommensurable with old, so, to retain at least the appearance of comparability, the data structure then fully determines and restricts the instrument.

What we want, though, is a situation in which construct and measurement theories work together to make the data autonomous of the particular instrument it came from. We want a theory that explains what is measured well enough for us to be able to modify existing instruments, or create entirely new ones, that give the same measures for the same amounts as the old instruments. We want to be able to predict item calibrations from the properties of the items, we want to obtain the same item calibrations across data sets, and we want to be able to predict measures on the basis of the observed responses (data) no matter which items or instrument was used to produce them.

Most importantly, we want a theory and practice of measurement that allows us to take missing data into account by providing us with the structural invariances we need as media for predicting the future from the past. As Ben Wright (1997, p. 34) said, any data analysis method that requires complete data to produce results disqualifies itself automatically as a viable basis for inference because we never have complete data—any practical system of measurement has to be positioned so as to be ready to receive, process, and incorporate all of the data we have yet to gather. This goal is accomplished to varying degrees in Rasch measurement (Rasch, 1960; Burdick, Stone, & Stenner, 2006; Dawson, 2004). Stenner and colleagues (Stenner, Burdick, Sanford, & Burdick, 2006) provide a trajectory of increasing degrees to which predictive theory is employed in contemporary measurement practice.

The explanatory and predictive power of theory is embodied in instruments that focus attention on recording observations of salient phenomena. These observations become data that inform the calibration of instruments, which then are used to gather further data that can be used in practical applications and in checks on the calibrations and the theory.

“Nothing is so practical as a good theory” (Lewin, 1951, p. 169). Good theory makes it possible to create symbolic representations of things that are easy to think with. To facilitate clear thinking, our words, numbers, and instruments must be transparent. We have to be able to look right through them at the thing itself, with no concern as to distortions introduced by the instrument, the sample, the observer, the time, the place, etc. This happens only when the structure of the instrument corresponds with invariant features of the world. And where words effect this transparency to an extent, it is realized most completely when we can measure in ways that repeatedly give the same results for the same amounts in the same conditions no matter which instrument, sample, operator, etc. is involved.

Where Might Full Mathematization Lead?

The attainment of mathematical transparency in measurement is remarkable for the way it focuses attention and constrains the imagination. It is essential to appreciate the context in which this focusing occurs, as popular opinion is at odds with historical research in this regard. Over the last 60 years, historians of science have come to vigorously challenge the widespread assumption that technology is a product of experimentation and/or theory (Kuhn, 1961/1977; Latour, 1987, 2005; Maas, 2001; Mendelsohn, 1992; Rabkin, 1992; Schaffer, 1992; Heilbron, 1993; Hankins & Silverman, 1999; Baird, 2002). Neither theory nor experiment typically advances until a key technology is widely available to end users in applied and/or research contexts. Rabkin (1992) documents multiple roles played by instruments in the professionalization of scientific fields. Thus, “it is not just a clever historical aphorism, but a general truth, that ‘thermodynamics owes much more to the steam engine than ever the steam engine owed to thermodynamics’” (Price, 1986, p. 240).

The prior existence of the relevant technology comes to bear on theory and experiment again in the common, but mistaken, assumption that measures are made and experimentally compared in order to discover scientific laws. History shows that measures are rarely made until the relevant law is effectively embodied in an instrument (Kuhn, 1961/1977, pp. 218-9): “…historically the arrow of causality is largely from the technology to the science” (Price, 1986, p. 240). Instruments do not provide just measures; rather they produce the phenomenon itself in a way that can be controlled, varied, played with, and learned from (Heilbron, 1993, p. 3; Hankins & Silverman, 1999; Rabkin, 1992). The term “technoscience” has emerged as an expression denoting recognition of this priority of the instrument (Baird, 1997; Ihde & Selinger, 2003; Latour, 1987).

Because technology often dictates what, if any, phenomena can be consistently produced, it constrains experimentation and theorizing by focusing attention selectively on reproducible, potentially interpretable effects, even when those effects are not well understood (Ackermann, 1985; Daston & Galison, 1992; Ihde, 1998; Hankins & Silverman, 1999; Maasen & Weingart, 2001). Criteria for theory choice in this context stem from competing explanatory frameworks’ experimental capacities to facilitate instrument improvements, prediction of experimental results, and gains in the efficiency with which a phenomenon is produced.

In this context, the relatively recent introduction of measurement models requiring additive, invariant parameterizations (Rasch, 1960) provokes speculation as to the effect on the human sciences that might be wrought by the widespread availability of consistently reproducible effects expressed in common quantitative languages. Paraphrasing Price’s comment on steam engines and thermodynamics, might it one day be said that as yet unforeseeable advances in reading theory will owe far more to the Lexile analyzer (Stenner, et al., 2006) than ever the Lexile analyzer owed reading theory?

Kuhn (1961/1977) speculated that the second scientific revolution of the early- to mid-nineteenth century followed in large part from the full mathematization of physics, i.e., the emergence of metrology as a professional discipline focused on providing universally accessible, theoretically predictable, and evidence-supported uniform units of measurement (Roche, 1998). Kuhn (1961/1977, p. 220) specifically suggests that a number of vitally important developments converged about 1840 (also see Hacking, 1983, p. 234). This was the year in which the metric system was formally instituted in France after 50 years of development (it had already been obligatory in other nations for 20 years at that point), and metrology emerged as a professional discipline (Alder, 2002, p. 328, 330; Heilbron, 1993, p. 274; Kula, 1986, p. 263). Daston (1992) independently suggests that the concept of objectivity came of age in the period from 1821 to 1856, and gives examples illustrating the way in which the emergence of strong theory, shared metric standards, and experimental data converged in a context of particular social mores to winnow out unsubstantiated and unsupportable ideas and contentions.

Might a similar revolution and new advances in the human sciences follow from the introduction of evidence-based, theoretically predictive, instrumentally mediated, and mathematical uniform measures? We won’t know until we try.

Figure. The Dialectical Interactions and Mutual Mediations of Theory, Data, and Instruments

Figure. The Dialectical Interactions and Mutual Mediations of Theory, Data, and Instruments

Acknowledgment. These ideas have been drawn in part from long consideration of many works in the history and philosophy of science, primarily Ackermann (1985), Ihde (1991), and various works of Martin Heidegger, as well as key works in measurement theory and practice. A few obvious points of departure are listed in the references.

References

Ackermann, J. R. (1985). Data, instruments, and theory: A dialectical approach to understanding science. Princeton, New Jersey: Princeton University Press.

Alder, K. (2002). The measure of all things: The seven-year odyssey and hidden error that transformed the world. New York: The Free Press.

Aldrich, J. (1989). Autonomy. Oxford Economic Papers, 41, 15-34.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Baird, D. (1997, Spring-Summer). Scientific instrument making, epistemology, and the conflict between gift and commodity economics. Techné: Journal of the Society for Philosophy and Technology, 3-4, 25-46. Retrieved 08/28/2009, from http://scholar.lib.vt.edu/ejournals/SPT/v2n3n4/baird.html.

Baird, D. (2002, Winter). Thing knowledge – function and truth. Techné: Journal of the Society for Philosophy and Technology, 6(2). Retrieved 19/08/2003, from http://scholar.lib.vt.edu/ejournals/SPT/v6n2/baird.html.

Burdick, D. S., Stone, M. H., & Stenner, A. J. (2006). The Combined Gas Law and a Rasch Reading Law. Rasch Measurement Transactions, 20(2), 1059-60 [http://www.rasch.org/rmt/rmt202.pdf].

Carroll-Burke, P. (2001). Tools, instruments and engines: Getting a handle on the specificity of engine science. Social Studies of Science, 31(4), 593-625.

Daston, L. (1992). Baconian facts, academic civility, and the prehistory of objectivity. Annals of Scholarship, 8, 337-363. (Rpt. in L. Daston, (Ed.). (1994). Rethinking objectivity (pp. 37-64). Durham, North Carolina: Duke University Press.)

Daston, L., & Galison, P. (1992, Fall). The image of objectivity. Representations, 40, 81-128.

Dawson, T. L. (2004, April). Assessing intellectual development: Three approaches, one sequence. Journal of Adult Development, 11(2), 71-85.

Galison, P. (1999). Trading zone: Coordinating action and belief. In M. Biagioli (Ed.), The science studies reader (pp. 137-160). New York, New York: Routledge.

Hacking, I. (1983). Representing and intervening: Introductory topics in the philosophy of natural science. Cambridge: Cambridge University Press.

Hankins, T. L., & Silverman, R. J. (1999). Instruments and the imagination. Princeton, New Jersey: Princeton University Press.

Heelan, P. A. (1983, June). Natural science as a hermeneutic of instrumentation. Philosophy of Science, 50, 181-204.

Heelan, P. A. (1998, June). The scope of hermeneutics in natural science. Studies in History and Philosophy of Science Part A, 29(2), 273-98.

Heidegger, M. (1977). Modern science, metaphysics, and mathematics. In D. F. Krell (Ed.), Basic writings [reprinted from M. Heidegger, What is a thing? South Bend, Regnery, 1967, pp. 66-108] (pp. 243-282). New York: Harper & Row.

Heidegger, M. (1977). The question concerning technology. In D. F. Krell (Ed.), Basic writings (pp. 283-317). New York: Harper & Row.

Heilbron, J. L. (1993). Weighing imponderables and other quantitative science around 1800. Historical studies in the physical and biological sciences), 24(Supplement), Part I, pp. 1-337.

Hessenbruch, A. (2000). Calibration and work in the X-ray economy, 1896-1928. Social Studies of Science, 30(3), 397-420.

Ihde, D. (1983). The historical and ontological priority of technology over science. In D. Ihde, Existential technics (pp. 25-46). Albany, New York: State University of New York Press.

Ihde, D. (1991). Instrumental realism: The interface between philosophy of science and philosophy of technology. (The Indiana Series in the Philosophy of Technology). Bloomington, Indiana: Indiana University Press.

Ihde, D. (1998). Expanding hermeneutics: Visualism in science. Northwestern University Studies in Phenomenology and Existential Philosophy). Evanston, Illinois: Northwestern University Press.

Ihde, D., & Selinger, E. (Eds.). (2003). Chasing technoscience: Matrix for materiality. (Indiana Series in Philosophy of Technology). Bloomington, Indiana: Indiana University Press.

Kuhn, T. S. (1961/1977). The function of measurement in modern physical science. Isis, 52(168), 161-193. (Rpt. In T. S. Kuhn, The essential tension: Selected studies in scientific tradition and change (pp. 178-224). Chicago: University of Chicago Press, 1977).

Kula, W. (1986). Measures and men (R. Screter, Trans.). Princeton, New Jersey: Princeton University Press (Original work published 1970).

Lapre, M. A., & Van Wassenhove, L. N. (2002, October). Learning across lines: The secret to more efficient factories. Harvard Business Review, 80(10), 107-11.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. New York, New York: Cambridge University Press.

Latour, B. (2005). Reassembling the social: An introduction to Actor-Network-Theory. (Clarendon Lectures in Management Studies). Oxford, England: Oxford University Press.

Lewin, K. (1951). Field theory in social science: Selected theoretical papers (D. Cartwright, Ed.). New York: Harper & Row.

Maas, H. (2001). An instrument can make a science: Jevons’s balancing acts in economics. In M. S. Morgan & J. Klein (Eds.), The age of economic measurement (pp. 277-302). Durham, North Carolina: Duke University Press.

Maasen, S., & Weingart, P. (2001). Metaphors and the dynamics of knowledge. (Vol. 26. Routledge Studies in Social and Political Thought). London: Routledge.

Mendelsohn, E. (1992). The social locus of scientific instruments. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 5-22). Bellingham, WA: SPIE Optical Engineering Press.

Polanyi, M. (1964/1946). Science, faith and society. Chicago: University of Chicago Press.

Price, D. J. d. S. (1986). Of sealing wax and string. In Little Science, Big Science–and Beyond (pp. 237-253). New York, New York: Columbia University Press.

Rabkin, Y. M. (1992). Rediscovering the instrument: Research, industry, and education. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 57-82). Bellingham, Washington: SPIE Optical Engineering Press.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Roche, J. (1998). The mathematics of measurement: A critical history. London: The Athlone Press.

Schaffer, S. (1992). Late Victorian metrology and its instrumentation: A manufactory of Ohms. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 23-56). Bellingham, WA: SPIE Optical Engineering Press.

Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2006). How accurate are Lexile text measures? Journal of Applied Measurement, 7(3), 307-22.

Thurstone, L. L. (1959). The measurement of values. Chicago: University of Chicago Press, Midway Reprint Series.

Wright, B. D. (1997, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45, 52 [http://www.rasch.org/memo62.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Draft Legislation on Development and Adoption of an Intangible Assets Metric System

November 19, 2009

In my opinion, more could be done to effect meaningful and effective health care reform with legislation like that proposed below, which has fewer than 3,800 words, than will ever be possible with the 2,074 pages in Congress’s current health care reform bill. What’s more, creating the infrastructure for human, social, and natural capital markets in this way would not only cost a tiny fraction of the projected $847 billion bill being debated, it would be an investment that would pay returns many times larger than the initial investment. See previous posts in this blog for more info on how and why this is so.

The draft legislation below is adapted from The Metric Conversion Act (Title 15 U.S.C. Chapter6 §(204) 205a – 205k). The viability of a metric system for human, social, and natural capital is indicated by the realized state of scientific rigor in the measurement of human, social, and natural capital (Fisher, 2009b). The need for such a system is indicated by the current crisis’s pointed economic demands that all forms of capital be unified within a common econometric and financial framework (Fisher, 2009a). It is equally demanded by the moral and philosophical requirements of fair play and meaningfulness (Fisher, 2004). The day is fast approaching when a metric system for intangible assets will be recognized as the urgent need that it is (Fisher, 2009c).

At some point in the near future, it can be expected that a table showing how to interpret the units of the Intangible Assets Metric System will be published in the Federal Register, just as the International System units have been.

For those unfamiliar with the state of the art in measurement, these may seem like wildly unrealistic goals. Those wondering how a reasonable person might arrive at such opinions are urged to consult other posts in this blog, and the references cited in them. The advantages of an intangible assets metric system for sustainable and socially responsible economic policies and practices are nothing short of profound. As Georg Rasch (1980, p. xx) said in reference to the stringent demands of his measurement models, “this is a huge challenge, but once the problem has been formulated it does seem possible to meet it.” We are less likely to attain goals that we do not actively formulate. In the spirit of John Dewey’s student, Chiang Mon-Lin, what we need are “wild hypotheses and careful tests.” There is no wilder idea with greater potential impact for redefining profit as the reduction of waste, and for thereby mitigating human suffering, sociopolitical discontent, and environmental degradation.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2009a). Bringing human, social, and natural capital to life: Practical consequences and opportunities. In M. Wilson, K. Draney, N. Brown, B. Duckor (Eds.), Advances in Rasch Measurement, Vol. Two (p. in press). Maple Grove, MN: JAM Press.

Fisher, W. P., Jr. (2009b, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), 42(9), 1278-1287.

Fisher, W. P. J. (2009c). NIST Critical national need idea White Paper: Metrological infrastructure for human, social, and natural capital (Tech. Rep.). New Orleans: LivingCapitalMetrics.com.

Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Title xx U.S.C. Chapter x §(100) 101a – 101k
METRIC SYSTEM FOR INTANGIBLE ASSETS DEVELOPMENT LAW
(Pub. L. 10-xxx, §x, Intangible Assets Metrics Development Act, July 25, 2010)

§ 100. New metric system development authorized. – A new national effort is hereby initiated throughout the United States of America focusing on building and realizing the benefits of a metric system for the intangible assets known as human, social, and natural capital.

§ 101a. Congressional statement of findings. – The Congress finds as follows:

(1) The United States was an original signatory party to the 1875 Treaty of the Meter (20 Stat. 709), which established the General Conference of Weights and Measures, the International Committee of Weights and Measures and the International Bureau of Weights and Measures.

(2) The use of metric measurement standards in the United States was authorized by law in 1866; with the Metric Conversion Act of 1975 this Nation established a national policy of committing itself and taking steps to facilitate conversion to the metric system.

(3) World trade is dependent on the metric system of measurement; continuing trends toward globalization demand expansion of the metric system to include vital economic resources shown scientifically measurable in research conducted over the last 80 years.

(4) Industries and consumers in the United States are often at competitive disadvantages when dealing in domestic and international markets because no existing systems for measuring intangible assets (human, social, and natural capital) are expressed in standardized, universally uniform metrics. The end result is that education, health care, human resource, and other markets are unable to reward quality; supply and demand are unmatched, consumers make decisions with no or insufficient information; and quality cannot be systematically improved.

(5) The inherent simplicity of the metric system of measurement and standardization of weights and measures has led to major cost savings in certain industries which have converted to that system; similar savings are expected to follow from the development and implementation of a metric system for intangible assets.

(6) The Federal Government has a responsibility to develop procedures and techniques to assist industry, especially small business, as it voluntarily seeks to adopt a new metric system of measurement for intangible assets that have always required management but which have not yet been uniformly and systematically measured.

(7) A new metric system of measurement for human, social, and natural capital can provide substantial advantages to the Federal Government in its own operations.

§ 101b. Declaration of policy. – It is therefore the declared policy of the United States-

(1) to support the development and implementation of a new metric system of intangibles assets measurement as the preferred system of weights and measures for United States trade and commerce involving human, social, and natural capital;

(2) to require that each Federal agency,by a date certain and to the extent economically feasible by the end of the fiscal year 2011, use the new metric system of intangibles measurement in its procurements, grants, and other business-related activities, except to the extent that such use is impractical or is likely to cause significant inefficiencies or loss of markets to United States firms, such as when foreign competitors are producing competing products in non-metric units; and

(3) to seek out ways to increase understanding of the new metric system of intangibles measurement through educational information and guidance and in Government publications.

§ 101c. Definitions

As used in this subchapter, the term-

(1) ‘Board’ means the United States Intangible Assets Metrics Board, established under section 101d of this Title;

(2) ‘engineering standard’ means a standard which prescribes (A) a concise set of conditions and requirements that must be satisfied by a material, product, process, procedure, convention, or test method; and (B) the physical, functional, performance and/or conformance characteristics thereof;

(3) ‘international standard or recommendation’ means an engineering standard or recommendation which is (A) formulated and promulgated by an international organization and (B) recommended for adoption by individual nations as a national standard;

(4) ‘metric system of measurement’ means the International System of Units as established by the General Conference of Weights and Measures in 1960 and as interpreted or modified for the United States by the Secretary of Commerce;

(5) ‘full and open competition’ has the same meaning as defined in section 403 of title 41;

(6) ‘total installed price’ means the price of purchasing a product or material, trimming or otherwise altering some or all of that product or material, if necessary to fit with other building components,and then installing that product or material into a Federal facility;

(7) ‘hard-metric’ means measurement, design, and manufacture using the metric system of measurement, but does not include measurement,design, and manufacture using English system measurement units which are subsequently reexpressed in the metric system of measurement;

(8) ‘cost or pricing data or price analysis’ has the meaning given such terms in section 254b of title 41; and

(9) ‘Federal facility’ means any public building (as defined under section 612 of title 40) and shall include any Federal building or construction project: (A) on lands in the public domain;(B) on lands used in connection with Federal programs for agriculture research, recreation, and conservation programs; (C) on or used  in connection with river, harbor, flood control, reclamation, or power projects; (D) on or used in connection with housing and residential projects; (E) on military installations (including any fort, camp,post, naval training station, airfield, proving ground, military supply depot, military school, any similar facility of the Department of Defense); (F) on installations of the Department of Veterans Affairs used for hospital or domiciliary purposes; or (G) on lands used in connection with Federal prisons, but does not include (i)any Federal building or construction project the exclusion of which the President deems to be justified in the public interest, or (ii) any construction project or building owned or controlled by a State government, local government, Indian tribe, or any private entity.

§101d. United States Intangible Assets Metrics Board

(a) Establishment. – There is established, in accordance with this section, an independent instrumentality to be known as a United States Intangible Assets Metrics Board.

(b) Membership; Chairman; appointment of members; term of office;vacancies. – The Board shall consist of 17 individuals, as follows:

(1) the Chairman, a qualified individual who shall be appointed by the President, by and with the advice and consent of the Senate;

(2) seventeen members who shall be appointed by the President, by and with the advice and consent of the Senate, on the following basis-

(A) one to be selected from lists of qualified individuals recommended by psychometricians and organizations representative of psychometric interests;

(B) one to be selected from lists of qualified individuals recommended by social scientists, the scientific and technical community, and organizations representative of social scientists and technicians;

(C) one to be selected from lists of qualified individuals recommended by environmental scientists, the scientific and technical community, and organizations representative of environmental scientists and technicians;

(D) one to be selected from a list of qualified individuals recommended by the National Association of Manufacturers or its successor;

(E) one to be selected from lists of qualified individuals recommended by the United States Chamber of Commerce, or its successor, retailers,and other commercial organizations;

(F) two to be selected from lists of qualified individuals recommended by the American Federation of Labor and Congress of Industrial Organizations or its successor, who are representative of workers directly affected by human capital metrics for health, skills, motivations, and productivity, and by other organizations representing labor;

(G) one to be selected from a list of qualified individuals recommended by the National Governors Conference, the National Council of State Legislatures, and organizations representative of State and local government;

(H) two to be selected from lists of qualified individuals recommended by organizations representative of small business;

(I) one to be selected from lists of qualified individuals representative of the human resource management industry;

(J) one to be selected from a list of qualified individuals recommended by the National Conference on Weights and Measures and standards making organizations;

(K) one to be selected from lists of qualified individuals recommended by educators, the educational community, and organizations representative of educational interests; and

(L) four at-large members to represent consumers and other interests deemed suitable by the President and who shall be qualified individuals.

As used in this subsection, each ‘list’ shall include the names of at least three individuals for each applicable vacancy. The terms of office of the members of the Board first taking office shall expire as designated by the President at the time of nomination; five at the end of the second year; five at the end of the fourth year;and six at the end of the sixth year. The term of office of the Chairman of such Board shall be six years. Members, including the Chairman, may be appointed to an additional term of six years, in the same manner as the original appointment. Successors to members of such Board shall be appointed in the same manner as the original members and shall have terms of office expiring six years from the date of expiration of the terms for which their predecessors were appointed. Any individual appointed to fill a vacancy occurring prior to the expiration of any term of office shall be appointed for the remainder of that term. Beginning 45 days after the date of incorporation of the Board, six members of such Board shall constitute a quorum for the transaction of any function of the Board.

(c) Compulsory powers. – Unless otherwise provided by the Congress, the Board shall have no compulsory powers.

(d) Termination. – The Board shall cease to exist when the Congress, by law, determines that its mission has been accomplished.

§101e. – Functions and powers of Board. – It shall be the function of the Board to devise and carry out a broad program of planning, coordination, and public education, consistent with other national policy and interests, with the aim of implementing the policy set forth in this subchapter. In carrying out this program,the Board shall-

(1) consult with and take into account the interests, views, and costs relevant to the inefficiencies that have long plagued the management of unmeasured forms of capital in United States commerce and industry, including small business; science; engineering; labor; education; consumers; government agencies at the Federal, State, and local level; nationally recognized standards developing and coordinating organizations; intangibles metrics development, planning and coordinating groups; and such other individuals or groups as are considered appropriate by the Board to the carrying out of the purposes of this subchapter. The Board shall take into account activities underway in the private and public sectors, so as not to duplicate unnecessarily such activities;

(2) provide for appropriate procedures whereby various groups,under the auspices of the Board, may formulate, and recommend or suggest, to the Board specific programs for coordinating intangibles metrics development in each industry and segment thereof and specific dimensions and configurations in the new metric system and in other measurements for general use. Such programs, dimensions, and configurations shall be consistent with (A) the needs, interests, and capabilities of manufacturers (large and small), suppliers, labor, consumers, educators,and other interested groups, and (B) the national interest;

(3) publicize, in an appropriate manner, proposed programs and provide an opportunity for interested groups or individuals to submit comments on such programs. At the request of interested parties, the Board, in its discretion, may hold hearings with regard to such programs. Such comments and hearings may be considered by the Board;

(4) encourage activities of standardization organizations to develop or revise, as rapidly as practicable, policy and IT standards based on the new intangibles metrics, and to take advantage of opportunities to promote (A) rationalization or simplification of relationships,(B) improvements of design, (C) reduction of size variations, (D) increases in economy, and (E) where feasible, the efficient use of energy and the conservation of natural resources;

(5) encourage the retention, in the new metric language of human, social, and natural capital standards, of those United States policy and IT designs, practices, and conventions that are internationally accepted or that embody superior technology;

(6) consult and cooperate with foreign governments, and intergovernmental organizations, in collaboration with the Department of State, and, through appropriate member bodies, with private international organizations, which are or become concerned with the encouragement and coordination of increased use of intangible assets metrics measurement units or policy and IT standards based on such units, or both. Such consultation shall include efforts, where appropriate, to gain international recognition for intangible assets metrics standards proposed by the United States;

(7) assist the public through information and education programs, to become familiar with the meaning and applicability of metric terms and measures in daily life. Such programs shall include –

(A) public information programs conducted by the Board, through the use of newspapers, magazines, radio, television, the Internet, social networking, and other media, and through talks before appropriate citizens’ groups, and trade and public organizations;

(B) counseling and consultation by the Secretary of Education; the Secretary of Labor; the Administrator of the Small Business Administration; and the Director of the National Science Foundation, with educational associations, State and local educational agencies, labor education committees, apprentice training committees, and other interested groups, in order to assure (i) that the new intangible assets metric system of measurement is included in the curriculum of the Nation’s educational institutions, and (ii) that teachers and other appropriate personnel are properly trained to teach the intangible assets metric system of measurement;

(C) consultation by the Secretary of Commerce with the National Conference of Weights and Measures in order to assure that State and local weights and measures officials are (i) appropriately involved in intangible assets metric development and adoption activities and (ii) assisted in their efforts to bring about timely amendments to weights and measures laws; and

(D) such other public information activities, by any Federal agency in support of this subchapter, as relate to the mission of suchagency;

(8) collect, analyze, and publish information about the extent of usage of intangible assets metric measurements; evaluate the costs and benefits of that usage; and make efforts to minimize any adverse effects resulting from increasing intangible assets metric usage;

(9) conduct research, including appropriate surveys; publish the results of such research; and recommend to the Congress and to the President such action as may be appropriate to deal with any unresolved problems, issues, and questions associated with intangible assets metric development, adoption, or usage, such problems, issues, and questions may include, but are not limited to, the impact on different occupations and industries, possible increased costs to consumers, the impact on society and the economy, effects on small business, the impact on the international trade position of the United States, the appropriateness of and methods for using procurement by the Federal Government as a means to effect development and adoption of the intangible assets metric system, the proper conversion or transition period in particular sectors of society, and consequences for national defense;

(10) submit annually to the Congress and to the President a report on its activities. Each such report shall include a status report on the development and adoption process as well as projections for continued progress in that process. Such report may include recommendations covering any legislation or executive action needed to implement the programs of development and adoption accepted by the Board. The Board may also submit such other reports and recommendations as it deems necessary;and

(11) submit to the President, not later than 1 year after the date of enactment of the Act making appropriations for carrying out this subchapter, a report on the need to provide an effective structural mechanism for adopting intangible assets metric units in statutes, regulations, and other laws at all levels of government, on a coordinated and timely basis, in response to voluntary programs adopted and implemented by various sectors of society under the auspices and with the approval of the Board. If the Board determines that such a need exists, such report shall include recommendations as to appropriate and effective means for establishing and implementing such a mechanism.

§101f. – Duties of Board. – In carrying out its duties under this subchapter, the Board may –

(1) establish an Executive Committee, and such other committees as it deems desirable;

(2) establish such committees and advisory panels as it deems necessary to work with the various sectors of the Nation’s economy and with Federal and State governmental agencies in the development and implementation of detailed development and adoption plans for those sectors. The Board may reimburse,to the extent authorized by law, the members of such committees;

(3) conduct hearings at such times and places as it deems appropriate;

(4) enter into contracts, in accordance with the Federal Property and Administrative Services Act of 1949, as amended (40 U.S.C. 471et seq.), with Federal or State agencies, private firms, institutions, and individuals for the conduct of research or surveys, the preparation of reports, and other activities necessary to the discharge of its duties;

(5) delegate to the Executive Director such authority as it deems advisable; and

(6) perform such other acts as may be necessary to carry out the duties prescribed by this subchapter.

§101g. – Gifts, donations and bequests to Board

(a) Authorization; deposit into Treasury and disbursement. – The Board may accept, hold, administer, and utilize gifts, donations,and bequests of property, both real and personal, and personal services, for the purpose of aiding or facilitating the work of the Board. Gifts and bequests of money, and the proceeds from the sale of any other property received as gifts or requests, shall be deposited in the Treasury in a separate fund and shall be disbursed upon order of the Board.

(b) Federal income, estate, and gift taxation of property. – For purpose of Federal income, estate, and gift taxation, property accepted under subsection (a) of this section shall be considered as a gift or bequest to or for the use of the United States.

(c) Investment of moneys; disbursement of accrued income. – Upon the request of the Board, the Secretary of the Treasury may invest and reinvest, in securities of the United States, any moneys contained in the fund authorized in subsection (a) of this section. Income accruing from such securities, and from any other property acceptedto the credit of such fund, shall be dispersed upon the order ofthe Board.

(d) Reversion to Treasury of unexpended funds. – Funds not expended by the Board as of the date when it ceases to exist, in accordance with section 105d(d) of this title, shall revert to the Treasury of the United States as of such date.

§101h. – Compensation of Board members; travel expenses.- Members of the Board who are not in the regular full-time employ of the United States shall, while attending meetings or conferences of the Board or while otherwise engaged in the business of the Board, be entitled to receive compensation at a rate not to exceed the daily rate currently being paid grade 18 of the General Schedule (under section 5332 of title 5), including travel time. While so serving, on the business of the Board away from their homes or regular places of business, members of the Board may be allowed travel expenses,including per diem in lieu of subsistence, as authorized by section5703 of title 5, for persons employed intermittently in the Government service. Payments under this section shall not render members of the Board employees or of the United States for any purpose. Members of the Board who are in the employ of the United States shall be entitled to travel expenses when traveling on the business of the Board.

§101i. – Personnel

(a) Executive Director; appointment; tenure; duties. – The Board shall appoint a qualified individual to serve as the Executive Director of the Board at the pleasure of the Board. The Executive Director, subject to the direction of the Board, shall be responsible to the Board and shall carry out the intangible assets metric development and adoption program, pursuant to the provisions of this subchapter and the policies established by the Board.

(b) Executive Director; salary. – The Executive Director of the Board shall serve full time and be subject to the provisions of chapter 51 and subchapter III of chapter 53 of title 5. The annual salary of the Executive Director shall not exceed level III of the Executive Schedule under section 5314 of such title.

(c) Staff personnel; appointment and compensation. – The Board may appoint and fix the compensation of such staff personnel as may be necessary to carry out the provisions of this subchapter in accordance with the provisions of chapter 51 and subchapter III of chapter 53 of title 5.

(d) Experts and consultants; employment and compensation; annual review of contracts. – The Board may (1) employ experts and consultants or organizations thereof, as authorized by section 3109 of title5; (2) compensate individuals so employed at rates not in excess of the rate currently being paid grade 18 of the General Schedule under section 5332 of such title, including travel time; and (3) may allow such individuals, while away from their homes or regular places of business, travel expenses (including per diem in lieu of subsistence) as authorized by section 5703 of such title 5 for persons in the Government service employed intermittently: Provided, however, that contracts for such temporary employment may be renewed annually.

§101j. – Financial and administrative services; sourceand reimbursement. – Financial and administrative services, including those related to budgeting, accounting, financial reporting, personnel, and procurement, and such other staff services as maybe needed by the Board, may be obtained by the Board from the Secretary of Commerce or other appropriate sources in the Federal Government. Payment for such services shall be made by the Board, in advance or by reimbursement, from funds of the Board in such amounts as may be agreed upon by the Chairman of the Board and by the source of the services being rendered.

§101k. – Authorization of appropriations; availability.- There are authorized to be appropriated such sums as may be necessary to carry out the provisions of this subchapter. Appropriations to carry out the provisions of this subchapter may remain available for obligation and expenditure for such period or periods as maybe specified in the Acts making such appropriations.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

NIST Call for White Papers

September 22, 2009

As I’ve been preparing the statistics.com course and consulting on a couple of projects, it’s been difficult to make time for postings here. There’s no lack of things to say, that’s for sure! The following is an alert to an opportunity that should not be passed up….

NIST Call for White Papers

The National Institute for Standards and Technology has posted a new Call for White Papers (http://www.nist.gov/tip/call_for_white_papers_sept09.pdf) as part of its mission “to support, promote, and accelerate innovation in the United States through high-risk, high-reward research in areas of critical national need.”

The White Papers are NIST’s mechanism for collaborating with practitioners in the field in the development of new areas of research into fundamental measurement and metrological systems. NIST is specifically seeking out areas of measurement research that are not currently a priority with any federal funding agency and that have the potential for bringing about fundamental transformations in particular scientific areas.

As was evident in its celebration of World Metrology Day last May, NIST is well aware of the human, economic, and scientific value of technical standards. Mathematics becomes the language of science most fully when universally uniform common currencies provide a lingua franca for communicating experimental results, theoretical predictions, and for economic exchanges of quantitative value. When this truth is fully appreciated, it is obvious that metrological standards for human, social, and natural capital are an area of critical national need that could be highly rewarding. Given the decades of supporting research that are on the books, the risks of investing in this research are quite reasonable. This is especially so when considered relative to the rewards that could accrue from order-of-magnitude improvements in the meaningfulness, utility, and efficiency of measurement based in ordinal observations.

The Call for White Papers is not a funding opportunity but a chance to influence the substance of the areas to be focused on in future funding competitions. One might imagine that NIST would be very interested in supporting research exploring the potential for expanding any of a number of existing measurement systems and methodologies into publicly recognized reference standards.

Deadlines over the next year for White Papers are November 9, February 15, May 10, and July 12, though submissions will be accepted any time between November 9, 2009 and September 30, 2010.

A PDF of a White Paper that builds a case for Rasch-based metrological standards and that was submitted to NIST in its previous round is available at http://www.livingcapitalmetrics.com/images/FisherNISTWhitePaper2.pdf.

Further articulations of connections between Rasch measurement and the wider concerns of instruments traceable to reference standards within metrological networks are available in the following, among others:

Fisher, W. P., Jr. (1996, Winter). The Rasch alternative. Rasch Measurement Transactions, 9(4), 466-467 [http://www.rasch.org/rmt/rmt94.htm].

Fisher, W. P., Jr. (1997). Thurstone’s missed opportunity. Rasch20Measurement Transactions, 11(1), 554 [http://www.rasch.org/rmt/rmt111p.htm].

Fisher, W. P., Jr. (2000). Objectivity in psychosocial measurement: What, why, how. Journal of Outcome Measurement, 4(2), 527-563 [http://www.livingcapitalmetrics.com/images/WP_Fisher_Jr_2000.pdf].

Fisher, W. P., Jr. (2008). Vanishing tricks and intellectualist condescension: Measurement, metrology, and the advancement of science. Rasch Measurement Transactions, 21(3), 1118-1121 [http://www.rasch.org/rmt/rmt213c.htm].

Fisher, W. P., Jr. (2009, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), 42(9), 1278-1287.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Reliability Revisited: Distinguishing Consistency from Error

August 28, 2009

When something is meaningful to us, and we understand it, then we can successfully restate it in our own words and predictably reproduce approximately the same representation across situations  as was obtained in the original formulation. When data fit a Rasch model, the implications are (1) that different subsets of items (that is, different ways of composing a series of observations summarized in a sufficient statistic) will all converge on the same pattern of person measures, and (2) that different samples of respondents or examinees will all converge on the same pattern of item calibrations. The meaningfulness of propositions based in these patterns will then not depend on which collection of items (instrument) or sample of persons is obtained, and all instruments might be equated relative to a single universal, uniform metric so that the same symbols reliably represent the same amount of the same thing.

Statistics and research methods textbooks in psychology and the social sciences commonly make statements like the following about reliability: “Reliability is consistency in measurement. The reliability of individual scale items increases with the number of points in the item. The reliability of the complete scale increases with the number of items.” (These sentences are found at the top of p. 371 in Experimental Methods in Psychology, by Gustav Levine and Stanley Parkinson (Lawrence Erlbaum Associates, 1994).) The unproven, perhaps unintended, and likely unfounded implication of these statements is that consistency increases as items are added.

Despite the popularity of doing so, Green, Lissitz, and Mulaik (1977) argue that reliability coefficients are misused when they are interpreted as indicating the extent to which data are internally consistent. “Green et al. (1977) observed that though high ‘internal consistency’ as indexed by a high alpha results when a general factor runs through the items, this does not rule out obtaining high alpha when there is no general factor running through the test items…. They concluded that the chief defect of alpha as an index of dimensionality is its tendency to increase as the number of items increase” (Hattie, 1985, p. 144).

In addressing the internal consistency of data, the implicit but incompletely realized purpose of estimating scale reliability is to evaluate the extent to which sum scores function as sufficient statistics. How limited is reliability as a tool for this purpose? To answer this question, five dichotomous data sets of 23 items and 22 persons were simulated. The first one was constructed so as to be highly likely to fit a Rasch model, with a deliberately orchestrated probabilistic Guttman pattern. The second one was made nearly completely random. The third, fourth, and fifth data sets were modifications of the first one in which increasing numbers of increasingly inconsistent responses were introduced. (The inconsistencies were not introduced in any systematic way apart from inserting contrary responses in the ordered matrix.) The data sets are shown in the Appendix. Tables 1 and 2 summarize the results.

Table 1 shows that the reliability coefficients do in fact decrease, along with the global model fit log-likelihood chi-squares, as the amount of randomness and inconsistency is increased. Contrary to what is implied in Levine and Parkinson’s statements, however, reliability can vary within a given number of items, as it might across different data sets produced from the same test, survey, or assessment, depending on how much structural invariance is present within them.

Two other points about the tables are worthy of note. First, the Rasch-based person separation reliability coefficients drop at a faster rate than Cronbach’s alpha does. This is probably an effect of the individualized error estimates in the Rasch context, which makes its reliability coefficients more conservative than correlation-based, group-level error estimates. (It is worth noting, as well, that the Winsteps and SPSS estimates of Cronbach’s alpha match. They are reported to one fewer decimal places by Winsteps, but the third decimal place is shown for the SPSS values for contrast.)

Second, the fit statistics are most affected by the initial and most glaring introduction of inconsistencies, in data set three. As the randomness in the data increases, the reliabilities continue to drop, but the fit statistics improve, culminating in the case of data set two, where complete randomness results in near-perfect model fit. This is, of course, the situation in which both the instrument and the sample are as well targeted as they can be, since all respondents have about the same measure and all the items about the same calibration; see Wood (1978) for a commentary on this situation, where coin tosses fit a Rasch model.

Table 2 shows the results of the Winsteps Principal Components Analysis of the standardized residuals for all five data sets. Again, the results conform with and support the pattern shown in the reliability coefficients. It is, however, interesting to note that, for data sets 4 and 5, with their Cronbach’s alphas of about .89 and .80, respectively, which are typically deemed quite good, the PCA shows more variance left unexplained than is explained by the Rasch dimension. The PCA is suggesting that two or more constructs might be represented in the data, but this would never be known from Cronbach’s alpha alone.

Alpha alone would indicate the presence of a unidimensional construct for data sets 3, 4 and 5, despite large standard deviations in the fit statistics and even though more than half the variance cannot be explained by the primary dimension. Worse, for the fifth data set, more variance is captured in the first three contrasts than is explained by the Rasch dimension. But with Cronbach’s alpha at .80, most researchers would consider this scale quite satisfactorily unidimensional and internally consistent.

These results suggest that, first, in seeking high reliability, what is sought more fundamentally is fit to a Rasch model (Andrich & Douglas, 1977; Andrich, 1982; Wright, 1977). That is, in addressing the internal consistency of data, the popular conception of reliability is taking on the concerns of construct validity. A conceptually clearer sense of reliability focuses on the extent to which an instrument works as expected every time it is used, in the sense of the way a car can be reliable. For instance, with an alpha of .70, a screening tool would be able to reliably distinguish measures into two statistically distinct groups (Fisher, 1992; Wright, 1996), problematic and typical. Within the limits of this purpose, the tool would meet the need for the repeated production of information capable of meeting the needs of the situation. Applications in research, accountability, licensure/certification, or diagnosis, however, might demand alphas of .95 and the kind of precision that allows for statistically distinct divisions into six or more groups. In these kinds of applications, where experimental designs or practical demands require more statistical power, measurement precision articulates finer degrees of differences. Finely calibrated instruments provide sensitivity over the entire length of the measurement continuum, which is needed for repeated reproductions of the small amounts of change that might accrue from hard to detect treatment effects.

Separating the construct, internal consistency, and unidimensionality issues  from the repeatability and reproducibility of a given degree of measurement precision provides a much-needed conceptual and methodological clarification of reliability. This clarification is routinely made in Rasch measurement applications (Andrich, 1982; Andrich & Douglas, 1977; Fisher, 1992; Linacre, 1993, 1996, 1997). It is reasonable to want to account for inconsistencies in the data in the error estimates and in the reliability coefficients, and so errors and reliabilities are routinely reported in terms of both the modeled expectations and in a fit-inflated form (Wright, 1995). The fundamental value of proceeding from a basis in individual error and fit statistics (Wright, 1996), is that local imprecisions and failures of invariance can be isolated for further study and selective attention.

The results of the simulated data analyses suggest, second, that, used in isolation, reliability coefficients can be misleading. As Green, et al. say, reliability estimates tend to systematically increase as the number of items increases (Fisher, 2008). The simulated data show that reliability coefficients also systematically decrease as inconsistency increases.

The primary problem with relying on reliability coefficients alone as indications of data consistency hinges on their inability to reveal the location of departures from modeled expectations. Most uses of reliability coefficients take place in contexts in which the model remains unstated and expectations are not formulated or compared with observations. The best that can be done in the absence of a model statement and test of data fit to it is to compare the reliability obtained against that expected on the basis of the number of items and response categories, relative to the observed standard deviation in the scores, expressed in logits (Linacre, 1993). One might then raise questions as to targeting, data consistency, etc. in order to explain larger than expected differences.

A more methodical way, however, would be to employ multiple avenues of approach to the evaluation of the data, including the use of model fit statistics and Principal Components Analysis in the evaluation of differential item and person functioning. Being able to see which individual observations depart the furthest from modeled expectation can provide illuminating qualitative information on the meaningfulness of the data, the measures, and the calibrations, or the lack thereof.  This information is crucial to correcting data entry errors, identifying sources of differential item or person functioning, separating constructs and populations, and improving the instrument. The power of the reliability-coefficient-only approach to data quality evaluation is multiplied many times over when the researcher sets up a nested series of iterative dialectics in which repeated data analyses explore various hypotheses as to what the construct is, and in which these analyses feed into revisions to the instrument, its administration, and/or the population sampled.

For instance, following the point made by Smith (1996), it may be expected that the PCA results will illuminate the presence of multiple constructs in the data with greater clarity than the fit statistics, when there are nearly equal numbers of items representing each different measured dimension. But the PCA does not work as well as the fit statistics when there are only a few items and/or people exhibiting inconsistencies.

This work should result in a full circle return to the drawing board (Wright, 1994; Wright & Stone, 2003), such that a theory of the measured construct ultimately provides rigorously precise predictive control over item calibrations, in the manner of the Lexile Framework (Stenner, et al., 2006) or developmental theories of hierarchical complexity (Dawson, 2004). Given that the five data sets employed here were simulations with no associated item content, the invariant stability and meaningfulness of the construct cannot be illustrated or annotated. But such illustration also is implicit in the quest for reliable instrumentation: the evidentiary basis for a delineation of meaningful expressions of amounts of the thing measured. The hope to be gleaned from the successes in theoretical prediction achieved to date is that we might arrive at practical applications of psychosocial measures that are as meaningful, useful, and economically productive as the theoretical applications of electromagnetism, thermodynamics, etc. that we take for granted in the technologies of everyday life.

Table 1

Reliability and Consistency Statistics

22 Persons, 23 Items, 506 Data Points

Data set Intended reliability Winsteps Real/Model Person Separation Reliability Winsteps/SPSS Cronbach’s alpha Winsteps Person Infit/Outfit Average Mn Sq Winsteps Person Infit/Outfit SD Winsteps Real/Model Item Separation Reliability Winsteps Item Infit/Outfit Average Mn Sq Winsteps Item Infit/Outfit SD Log-Likelihood Chi-Sq/d.f./p
First Best .96/.97 .96/.957 1.04/.35 .49/.25 .95/.96 1.08/0.35 .36/.19 185/462/1.00
Second Worst .00/.00 .00/-1.668 1.00/1.00 .05/.06 .00/.00 1.00/1.00 .05/.06 679/462/.0000
Third Good .90/.91 .93/.927 .92/2.21 .30/2.83 .85/.88 .90/2.13 .64/3.43 337/462/.9996
Fourth Fair .86/.87 .89/.891 .96/1.91 .25/2.18 .79/.83 .94/1.68 .53/2.27 444/462/.7226
Fifth Poor .76/.77 .80/.797 .98/1.15 .24/.67 .59/.65 .99/1.15 .41/.84 550/462/.0029
Table 2

Principal Components Analysis

Data set Intended reliability % Raw Variance Explained by Measures/Persons/Items % Raw Variance Captured in First Three Contrasts Total number of loadings > |.40| in first contrast
First Best 76/41/35 12 8
Second Worst 4.3/1.7/2.6 56 15
Third Good 59/34/25 20 14
Fourth Fair 47/27/20 26 13
Fifth Poor 29/17/11 41 15

References

Andrich, D. (1982, June). An index of person separation in Latent Trait Theory, the traditional KR-20 index, and the Guttman scale response pattern. Education Research and Perspectives, 9(1), http://www.rasch.org/erp7.htm.

Andrich, D. & G. A. Douglas. (1977). Reliability: Distinctions between item consistency and subject separation with the simple logistic model. Paper presented at the Annual Meeting of the American Educational Research Association, New York.

Dawson, T. L. (2004, April). Assessing intellectual development: Three approaches, one sequence. Journal of Adult Development, 11(2), 71-85.

Fisher, W. P., Jr. (1992). Reliability statistics. Rasch Measurement Transactions, 6(3), 238  [http://www.rasch.org/rmt/rmt63i.htm].

Fisher, W. P., Jr. (2008, Summer). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-3.

Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977, Winter). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37(4), 827-833.

Hattie, J. (1985, June). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9(2), 139-64.

Levine, G., & Parkinson, S. (1994). Experimental methods in psychology. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Linacre, J. M. (1993). Rasch-based generalizability theory. Rasch Measurement Transactions, 7(1), 283-284; [http://www.rasch.org/rmt/rmt71h.htm].

Linacre, J. M. (1996). True-score reliability or Rasch statistical validity? Rasch Measurement Transactions, 9(4), 455 [http://www.rasch.org/rmt/rmt94a.htm].

Linacre, J. M. (1997). KR-20 or Rasch reliability: Which tells the “Truth?”. Rasch Measurement Transactions, 11(3), 580-1 [http://www.rasch.org/rmt/rmt113l.htm].

Smith, R. M. (1996). A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling, 3(1), 25-40.

Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2006). How accurate are Lexile text measures? Journal of Applied Measurement, 7(3), 307-22.

Wood, R. (1978). Fitting the Rasch model: A heady tale. British Journal of Mathematical and Statistical Psychology, 31, 27-32.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1980). Foreword, Afterword. In Probabilistic models for some intelligence and attainment tests, by Georg Rasch (pp. ix-xix, 185-199. http://www.rasch.org/memo63.htm) [Reprint; original work published in 1960 by the Danish Institute for Educational Research]. Chicago, Illinois: University of Chicago Press.

Wright, B. D. (1994, Summer). Theory construction from empirical observations. Rasch Measurement Transactions, 8(2), 362 [http://www.rasch.org/rmt/rmt82h.htm].

Wright, B. D. (1995, Summer). Which standard error? Rasch Measurement Transactions, 9(2), 436-437 [http://www.rasch.org/rmt/rmt92n.htm].

Wright, B. D. (1996, Winter). Reliability and separation. Rasch Measurement Transactions, 9(4), 472 [http://www.rasch.org/rmt/rmt94n.htm].

Wright, B. D., & Stone, M. H. (2003). Five steps to science: Observing, scoring, measuring, analyzing, and applying. Rasch Measurement Transactions, 17(1), 912-913 [http://www.rasch.org/rmt/rmt171j.htm].

Appendix

Data Set 1

01100000000000000000000

10100000000000000000000

11000000000000000000000

11100000000000000000000

11101000000000000000000

11011000000000000000000

11100100000000000000000

11110100000000000000000

11111010100000000000000

11111101000000000000000

11111111010101000000000

11111111101010100000000

11111111111010101000000

11111111101101010010000

11111111111010101100000

11111111111111010101000

11111111111111101010100

11111111111111110101011

11111111111111111010110

11111111111111111111001

11111111111111111111101

11111111111111111111100

Data Set 2

01101010101010101001001

10100101010101010010010

11010010101010100100101

10101001010101001001000

01101010101010110010011

11011010010101100100101

01100101001001001001010

10110101000110010010100

01011010100100100101001

11101101001001001010010

11011010010101010100100

10110101101010101001001

01101011010000101010010

11010110101001010010100

10101101010000101101010

11011010101010010101010

10110101010101001010101

11101010101010110101011

11010101010101011010110

10101010101010110111001

01010101010101101111101

10101010101011011111100

Data Set 3

01100000000000100000010

10100000000000000010001

11000000000000100000010

11100000000000100000000

11101000000000100010000

11011000000000000000000

11100100000000100000000

11110100000000000000000

11111010100000100000000

11111101000000000000000

11111111010101000000000

11111111101010100000000

11111111111010001000000

11011111111111010010000

11011111111111101100000

11111111111111010101000

11011111111111101010100

11111111111111010101011

11011111111111111010110

11111111111111111111001

11011111111111111111101

10111111111111111111110

Data Set 4

01100000000000100010010

10100000000000000010001

11000000000000100000010

11100000000000100000001

11101000000000100010000

11011000000000000010000

11100100000000100010000

11110100000000000000000

11111010100000100010000

11111101000000000000000

11111011010101000010000

11011110111010100000000

11111111011010001000000

11011111101011110010000

11011111101101101100000

11111111110101010101000

11011111111011101010100

11111111111101110101011

01011111111111011010110

10111111111111111111001

11011111111111011111101

10111111111111011111110

Data Set 5

11100000010000100010011

10100000000000000011001

11000000010000100001010

11100000010000100000011

11101000000000100010010

11011000000000000010011

11100100000000100010000

11110100000000000000011

11111010100000100010000

00000000000011111111111

11111011010101000010000

11011110111010100000000

11111111011010001000000

11011111101011110010000

11011111101101101100000

11111111110101010101000

11011111101011101010100

11111111111101110101011

01011111111111011010110

10111111101111111111001

11011111101111011111101

00111111101111011111110

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

A Tale of Two Industries: Contrasting Quality Assessment and Improvement Frameworks

July 8, 2009

Imagine the chaos that would result if industrial engineers each had their own tool sets calibrated in idiosyncratic metrics with unit sizes that changed depending on the size of what they measured, and they conducted quality improvement studies focusing on statistical significance tests of effect sizes. Furthermore, these engineers ignore the statistical power of their designs, and don’t know when they are finding statistically significant results by pure chance, and when they are not. And finally, they also ignore the substantive meaning of the numbers, so that they never consider the differences they’re studying in terms of varying probabilities of response to the questions they ask.

So when one engineer tries to generalize a result across applications, what happens is that it kind of works sometimes, doesn’t work at all other times, is often ignored, and does not command a compelling response from anyone because they are invested in their own metrics, samples, and results, which are different from everyone else’s. If there is any discussion of the relative merits of the research done, it is easy to fall into acrimonious and heated arguments that cannot be resolved because of the lack of consensus on what constitutes valid data, instrumentation, and theory.

Thus, the engineers put up the appearance of polite decorum. They smile and nod at each other’s local, sample-dependent, and irreproducible results, while they build mini-empires of funding, students, quoting circles, and professional associations on the basis of their personal authority and charisma. As they do so, costs in their industry go spiralling out of control, profits are almost nonexistent, fewer and fewer people can afford their products, smart people are going into other fields, and overall product quality is declining.

Of course, this is the state of affairs in education and health care, not in industrial engineering. In the latter field, the situation is much different. Here, everyone everywhere is very concerned to be sure they are always measuring the same thing as everyone else and in the same unit. Unexpected results of individual measures pop out instantly and are immediately redone. Innovations are more easily generated and disseminated because everyone is thinking together in the same language and seeing effects expressed in the same images. Someone else’s ideas and results can be easily fitted into anyone else’s experience, and the viability of a new way of doing things can be evaluated on the basis of one’s own experience and skills.

Arguments can be quite productive, as consensus on basic values drives the demand for evidence. Associations and successes are defined more in terms of merit earned from productivity and creativity demonstrated through the accumulation of generalized results. Costs in these industries are constantly dropping, profits are steady or increasing, more and more people can afford their products, smart people are coming into the field, and overall product quality is improving.

There is absolutely no reason why education and health care cannot thrive and grow like other industries. It is up to us to show how.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.