Archive for the ‘Construct theory’ Category

Psychology and the social sciences: An atheoretical, scattered, and disconnected body of research

February 16, 2019

A new article in Nature Human Behaviour (NHB) points toward the need for better theory and more rigorous mathematical models in psychology and the social sciences (Muthukrishna & Henrich, 2019). The authors rightly say that the lack of an overarching cumulative theoretical framework makes it very difficult to see whether new results fit well with previous work, or if something surprising has come to light. Mathematical models are especially emphasized as being of value in specifying clear and precise expectations.

The point that the social sciences and psychology need better theories and models is painfully obvious. But there are in fact thousands of published studies and practical real world applications that not only provide, but indeed often surpass, the kinds of predictive theories and mathematical models called for in the NHB article. The article not only makes no mention of any of this work, its argument is framed entirely in a statistical context instead of the more appropriate context of measurement science.

The concept of reliability provides an excellent point of entry. Most behavioral scientists think of reliability statistically, as a coefficient with a positive numeric value usually between 0.00 and 1.00. The tangible sense of reliability as indicating exactly how predictable an outcome is does not usually figure in most researchers’ thinking. But that sense of the specific predictability of results has been the focus of attention in social and psychological measurement science for decades.

For instance, the measurement of time is reliable in the sense that the position of the sun relative to the earth can be precisely predicted from geographic location, the time of day, and the day of the year. The numbers and words assigned to noon time are closely associated with the Sun being at the high point in the sky (though there are political variations by season and location across time zones).

That kind of a reproducible association is rarely sought in psychology and the social sciences, but it is far from nonexistent. One can discern different degrees to which that kind of association is included in models of measured constructs. Though most behavioral research doesn’t mention the connection between linear amounts of a measured phenomenon and a reproducible numeric representation of it (level 0), quite a significant body of work focuses on that connection (level 1). The disappointing thing about that level 1 work is that the relentless obsession with statistical methods prevents most researchers from connecting a reproducible quantity with a single expression of it in a standard unit, and with an associated uncertainty term (level 2). That is, level 1 researchers conceive measurement in statistical terms, as a product of data analysis. Even when results across data sets are highly correlated and could be equated to a common metric, level 1 researchers do not leverage that source of potential value for simplified communication and accumulated comparability.

And then, for their part, level 2 researchers usually do not articulate theories about the measured constructs, by augmenting the mathematical data model with an explanatory model predicting variation (level 3). Level 2 researchers are empirically grounded in data, and can expand their network of measures only by gathering more data and analyzing it in ways that bring it into their standard unit’s frame of reference.

Level 3 researchers, however, have come to see what makes their measures tick. They understand the mechanisms that make their questions vary. They can write new questions to their theoretical specifications, test those questions by asking them of a relevant sample, and produce the predicted calibrations. For instance, reading comprehension is well established to be a function of the difference between a person’s reading ability and the complexity of the text they encounter (see articles by Stenner in the list below). We have built our entire educational system around this idea, as we deliberately introduce children first to the alphabet, then to the most common words, then to short sentences, and then to ever longer and more complicated text. But stating the construct model, testing it against data, calibrating a unit to which all tests and measures can be traced, and connecting together all the books, articles, tests, curricula, and students is a process that began (in English and Spanish) only in the 1980s. The process still is far from finished, and most reading research still does not use the common metric.

In this kind of theory-informed context, new items can be automatically generated on the fly at the point of measurement. Those items and inferences made from them are validated by the consistency of the responses and the associated expression of the expected probability of success, agreement, etc. The expense of constant data gathering and analysis can be cut to a very small fraction of what it is at levels 0-2.

Level 3 research methods are not widely known or used, but they are not new. They are gaining traction as their use by national metrology institutes globally grows. As high profile critiques of social and psychological research practices continue to emerge, perhaps more attention will be paid to this important body of work. A few key references are provided below, and virtually every post in this blog pertains to these issues.

References

Baghaei, P. (2008). The Rasch model as a construct validation tool. Rasch Measurement Transactions, 22(1), 1145-6 [http://www.rasch.org/rmt/rmt221a.htm].

Bergstrom, B. A., & Lunz, M. E. (1994). The equivalence of Rasch item calibrations and ability estimates across modes of administration. In M. Wilson (Ed.), Objective measurement: Theory into practice, Vol. 2 (pp. 122-128). Norwood, New Jersey: Ablex.

Cano, S., Pendrill, L., Barbic, S., & Fisher, W. P., Jr. (2018). Patient-centred outcome metrology for healthcare decision-making. Journal of Physics: Conference Series, 1044, 012057.

Dimitrov, D. M. (2010). Testing for factorial invariance in the context of construct validation. Measurement & Evaluation in Counseling & Development, 43(2), 121-149.

Embretson, S. E. (2010). Measuring psychological constructs: Advances in model-based approaches. Washington, DC: American Psychological Association.

Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.

Fischer, G. H. (1983). Logistic latent trait models with linear constraints. Psychometrika, 48(1), 3-26.

Fisher, W. P., Jr. (1992). Reliability statistics. Rasch Measurement Transactions, 6(3), 238 [http://www.rasch.org/rmt/rmt63i.htm].

Fisher, W. P., Jr. (2008). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-1163 [http://www.rasch.org/rmt/rmt221.pdf].

Fisher, W. P., Jr., & Stenner, A. J. (2016). Theory-based metrological traceability in education: A reading measurement network. Measurement, 92, 489-496.

Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37(4), 827-833.

Hattie, J. (1985). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9(2), 139-64.

Hobart, J. C., Cano, S. J., Zajicek, J. P., & Thompson, A. J. (2007). Rating scales as outcome measures for clinical trials in neurology: Problems, solutions, and recommendations. Lancet Neurology, 6, 1094-1105.

Irvine, S. H., Dunn, P. L., & Anderson, J. D. (1990). Towards a theory of algorithm-determined cognitive test construction. British Journal of Psychology, 81, 173-195.

Kline, T. L., Schmidt, K. M., & Bowles, R. P. (2006). Using LinLog and FACETS to model item components in the LLTM. Journal of Applied Measurement, 7(1), 74-91.

Lunz, M. E., & Linacre, J. M. (2010). Reliability of performance examinations: Revisited. In M. Garner, G. Engelhard, Jr., W. P. Fisher, Jr. & M. Wilson (Eds.), Advances in Rasch Measurement, Vol. 1 (pp. 328-341). Maple Grove, MN: JAM Press.

Mari, L., & Wilson, M. (2014). An introduction to the Rasch measurement approach for metrologists. Measurement, 51, 315-327.

Markward, N. J., & Fisher, W. P., Jr. (2004). Calibrating the genome. Journal of Applied Measurement, 5(2), 129-141.

Maul, A., Mari, L., Torres Irribarra, D., & Wilson, M. (2018). The quality of measurement results in terms of the structural features of the measurement process. Measurement, 116, 611-620.

Muthukrishna, M., & Henrich, J. (2019). A problem in theory. Nature Human Behaviour, 1-9.

Obiekwe, J. C. (1999, August 1). Application and validation of the linear logistic test model for item difficulty prediction in the context of mathematics problems. Dissertation Abstracts International: Section B: The Sciences & Engineering, 60(2-B), 0851.

Pendrill, L. (2014). Man as a measurement instrument [Special Feature]. NCSLi Measure: The Journal of Measurement Science, 9(4), 22-33.

Pendrill, L., & Fisher, W. P., Jr. (2015). Counting and quantification: Comparing psychometric and metrological perspectives on visual perceptions of number. Measurement, 71, 46-55.

Pendrill, L., & Petersson, N. (2016). Metrology of human-based and other qualitative measurements. Measurement Science and Technology, 27(9), 094003.

Sijtsma, K. (2009). Correcting fallacies in validity, reliability, and classification. International Journal of Testing, 8(3), 167-194.

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74(1), 107-120.

Stenner, A. J. (2001). The necessity of construct theory. Rasch Measurement Transactions, 15(1), 804-5 [http://www.rasch.org/rmt/rmt151q.htm].

Stenner, A. J., Fisher, W. P., Jr., Stone, M. H., & Burdick, D. S. (2013). Causal Rasch models. Frontiers in Psychology: Quantitative Psychology and Measurement, 4(536), 1-14.

Stenner, A. J., & Horabin, I. (1992). Three stages of construct definition. Rasch Measurement Transactions, 6(3), 229 [http://www.rasch.org/rmt/rmt63b.htm].

Stenner, A. J., Stone, M. H., & Fisher, W. P., Jr. (2018). The unreasonable effectiveness of theory based instrument calibration in the natural sciences: What can the social sciences learn? Journal of Physics Conference Series, 1044(012070).

Stone, M. H. (2003). Substantive scale construction. Journal of Applied Measurement, 4(3), 282-297.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wilson, M. R. (2013). Using the concept of a measurement system to characterize measurement models used in psychometrics. Measurement, 46, 3766-3774.

Wright, B. D., & Stone, M. H. (1979). Chapter 5: Constructing a variable. In Best test design: Rasch measurement (pp. 83-128). Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1999). Measurement essentials. Wilmington, DE: Wide Range, Inc. [http://www.rasch.org/measess/me-all.pdf].

Wright, B. D., Stone, M., & Enos, M. (2000). The evolution of meaning in practice. Rasch Measurement Transactions, 14(1), 736 [http://www.rasch.org/rmt/rmt141g.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Advertisements

Evaluating Questionnaires as Measuring Instruments

June 23, 2018

An email came in today asking whether three different short (4- and 5-item) questionnaires could be expected to provide reasonable quality measurement. Here’s my response.

—–

Thanks for raising this question. The questionnaire plainly was not designed to provide data suitable for measurement. Though much can be learned about making constructs measurable from data produced by this kind of questionnaire, “Rasch analysis” cannot magically create a silk purse from a sow’s ear (as the old expression goes). Use Linacre’s (1993) generalizability theory nomograph to see what reliabilities are expected for each subscale, given the numbers of items and rating categories, and applying a conservative estimate of the adjusted standard deviations (1.0 logit, for instance). Convert the reliability coefficients into strata (Fisher, 1992, 2008; Wright & Masters, 1982, pp. 92, 105-106) to make the practical meaning of the precision obtained obvious.

So if you have data, analyze it and compare the expected and observed reliabilities. If the uncertainties are quite different, is that because of targeting issues? But before you do that, ask experts in the area to rank order:

  • the courses by relevance to the job;
  • the evaluation criteria from easy to hard; and
  • the skills/competencies in order of importance to job performance.

Then study the correspondence between the rankings and the calibration results. Where do they converge and diverge? Why? What’s unexpected? What can be learned?

Analyze all of the items in each area (student, employer, instructor) together in Winsteps and study each of the three tables 23.x, setting PRCOMP=S. Remember that the total variance explained is not interpreted simply in terms of “more is better” and that the total variance explained is not as important as the ratio of that variance to the variance in the first contrast (see Linacre, 2006, 2008). If the ratio is greater than 3, the scale is essentially unidimensional (though significant problems may remain to be diagnosed and corrected).

Common practice holds that unexplained variance eigenvalues should be less than 1.5, but this overly simplistic rule of thumb (Chou & Wang, 2010; Raîche, 2005) has been contradicted in practice many times, since, even if one or more eigenvalues are over 1.5, theory may say the items belong to the same construct, and the disattenuated correlations of the measures implied by the separate groups of items (provided in tables 23.x) may still approach 1.00, indicating that the same measures are produced across subscales. See Green (1996) and Smith (1996), among others, for more on this.

If subscales within each of the three groups of items are markedly different in the measures they produce, then separate them in different analyses. If these further analyses reveal still more multidimensionalities, it’s time to go back to the drawing board, given how short these scales are. If you define a plausible scale, study the item difficulty orders closely with one or more experts in the area. If there is serious interest in precision measurement and its application to improved management, and not just a bureaucratic need for data to satisfy empty demands for a mere appearance of quality assessment, then trace the evolution of the construct as it changes from less to more across the items.

What, for instance, is the common theme addressed across the courses that makes them all relevant to job performance? The courses were each created with an intention and they were brought together into a curriculum for a purpose. These intentions and purposes are the raw material of a construct theory. Spell out the details of how the courses build competency in translation.

Furthermore, I imagine that this curriculum, by definition, was set up to be effective in training students no matter who is in the courses (within the constraints of the admission criteria), and no matter which particular challenges relevant to job performance are sampled from the universe of all possible challenges. You will recognize these unexamined and unarticulated assumptions as what need to be explicitly stated as hypotheses informing a model of the educational enterprise. This model transforms implicit assumptions into requirements that are never fully satisfied but can be very usefully approximated.

As I’ve been saying for a long time (Fisher, 1989), please do not accept the shorthand language of references to “the Rasch model”, “Rasch scaling”, “Rasch analysis”, etc. Rasch did not invent the form of these models, which are at least as old as Plato. And measurement is not a function of data analysis. Data provide experimental evidence testing model-based hypotheses concerning construct theories. When explanatory theory corroborates and validates data in calibrated instrumentation, the instrument can be applied at the point of use with no need for data analysis, to produce measures, uncertainty (error) estimates, and graphical fit assessments (Connolly, Nachtman, & Pritchett, 1971; Davis, et al., 2008; Fisher, 2006; Fisher, Kilgore, & Harvey, 1995; Linacre, 1997; many others).

So instead of using those common shorthand phrases, please speak directly to the problem of modeling the situation in order to produce a practical tool for managing it.

Further information is available in the references below.

 

Aryadoust, S. V. (2009). Mapping Rasch-based measurement onto the argument-based validity framework. Rasch Measurement Transactions, 23(1), 1192-3 [http://www.rasch.org/rmt/rmt231.pdf].

Chang, C.-H. (1996). Finding two dimensions in MMPI-2 depression. Structural Equation Modeling, 3(1), 41-49.

Chou, Y. T., & Wang, W. C. (2010). Checking dimensionality in item response models with principal component analysis on standardized residuals. Educational and Psychological Measurement, 70, 717-731.

Connolly, A. J., Nachtman, W., & Pritchett, E. M. (1971). Keymath: Diagnostic Arithmetic Test. Circle Pines, Minnesota: American Guidance Service. Retrieved 23 June 2018 from https://images.pearsonclinical.com/images/pa/products/keymath3_da/km3-da-pub-summary.pdf

Davis, A. M., Perruccio, A. V., Canizares, M., Tennant, A., Hawker, G. A., Conaghan, P. G. et al. (2008, May). The development of a short measure of physical function for hip OA HOOS-Physical Function Shortform (HOOS-PS): An OARSI/OMERACT initiative. Osteoarthritis Cartilage, 16(5), 551-559.

Fisher, W. P., Jr. (1989). What we have to offer. Rasch Measurement Transactions, 3(3), 72 [http://www.rasch.org/rmt/rmt33d.htm].

Fisher, W. P., Jr. (1992). Reliability statistics. Rasch Measurement Transactions, 6(3), 238  [http://www.rasch.org/rmt/rmt63i.htm].

Fisher, W. P., Jr. (2006). Survey design recommendations [expanded from Fisher, W. P. Jr. (2000) Popular Measurement, 3(1), pp. 58-59]. Rasch Measurement Transactions, 20(3), 1072-1074 [http://www.rasch.org/rmt/rmt203.pdf].

Fisher, W. P., Jr. (2008). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-1163 [http://www.rasch.org/rmt/rmt221.pdf].

Fisher, W. P., Jr., Harvey, R. F., & Kilgore, K. M. (1995). New developments in functional assessment: Probabilistic models for gold standards. NeuroRehabilitation, 5(1), 3-25.

Green, K. E. (1996). Dimensional analyses of complex data. Structural Equation Modeling, 3(1), 50-61.

Linacre, J. M. (1993). Rasch-based generalizability theory. Rasch Measurement Transactions, 7(1), 283-284; [http://www.rasch.org/rmt/rmt71h.htm].

Linacre, J. M. (1997). Instantaneous measurement and diagnosis. Physical Medicine and Rehabilitation State of the Art Reviews, 11(2), 315-324 [http://www.rasch.org/memo60.htm].

Linacre, J. M. (1998). Detecting multidimensionality: Which residual data-type works best? Journal of Outcome Measurement, 2(3), 266-83.

Linacre, J. M. (1998). Structure in Rasch residuals: Why principal components analysis? Rasch Measurement Transactions, 12(2), 636 [http://www.rasch.org/rmt/rmt122m.htm].

Linacre, J. M. (2003). PCA: Data variance: Explained, modeled and empirical. Rasch Measurement Transactions, 17(3), 942-943 [http://www.rasch.org/rmt/rmt173g.htm].

Linacre, J. M. (2006). Data variance explained by Rasch measures. Rasch Measurement Transactions, 20(1), 1045 [http://www.rasch.org/rmt/rmt201a.htm].

Linacre, J. M. (2008). PCA: Variance in data explained by Rasch measures. Rasch Measurement Transactions, 22(1), 1164 [http://www.rasch.org/rmt/rmt221j.htm].

Raîche, G. (2005). Critical eigenvalue sizes in standardized residual Principal Components Analysis. Rasch Measurement Transactions, 19(1), 1012 [http://www.rasch.org/rmt/rmt191h.htm].

Schumacker, R. E., & Linacre, J. M. (1996). Factor analysis and Rasch. Rasch Measurement Transactions, 9(4), 470 [http://www.rasch.org/rmt/rmt94k.htm].

Smith, E. V., Jr. (2002). Detecting and evaluating the impact of multidimensionality using item fit statistics and principal component analysis of residuals. Journal of Applied Measurement, 3(2), 205-31.

Smith, R. M. (1996). A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling, 3(1), 25-40.

Wright, B. D. (1996). Comparing Rasch measurement and factor analysis. Structural Equation Modeling, 3(1), 3-24.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago, Illinois: MESA Press.

A New Agenda for Measurement Theory and Practice in Education and Health Care

April 15, 2011

Two key issues on my agenda offer different answers to the question “Why do you do things the way you do in measurement theory and practice?”

First, we can take up the “Because of…” answer to this question. We need to articulate an historical account of measurement that does three things:

  1. that builds on Rasch’s use of Maxwell’s method of analogy by employing it and expanding on it in new applications;
  2. that unifies the vocabulary and concepts of measurement across the sciences into a single framework so far as possible by situating probabilistic models of invariant individual-level within-variable phenomena in the context of measurement’s GIGO principle and data-to-model fit, as distinct from the interactions of group-level between-variable phenomena in the context of statistics’ model-to-data fit; and
  3. that stresses the social, collective cognition facilitated by networks of individuals whose point-of-use measurement-informed decisions and behaviors are coordinated and harmonized virtually, at a distance, with no need for communication or negotiation.

We need multiple publications in leading journals on these issues, as well as one or more books that people can cite as a way of making this real and true history of measurement, properly speaking, credible and accepted in the mainstream. This web site http://ssrn.com/abstract=1698919 is a draft article of my own in this vein that I offer for critique; other material is available on request. Anyone who works on this paper with me and makes a substantial contribution to its publication will be added as co-author.

Second, we can take up the “In order that…” answer to the question “Why do you do things the way you do?” From this point of view, we need to broaden the scope of the measurement research agenda beyond data analysis, estimation, models, and fit assessment in three ways:

  1. by emphasizing predictive construct theories that exhibit the fullest possible understanding of what is measured and so enable the routine reproduction of desired proportionate effects efficiently, with no need to analyze data to obtain an estimate;
  2. by defining the standard units to which all calibrated instruments measuring given constructs are traceable; and
  3. by disseminating to front line users on mass scales instruments measuring in publicly available standard units and giving immediate feedback at the point of use.

These two sets of issues define a series of talking points that together constitute a new narrative for measurement in education, psychology, health care, and many other fields. We and others may see our way to organizing new professional societies, new journals, new university-based programs of study, etc. around these principles.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.