Posts Tagged ‘instruments’

Taking the Scales of Justice Seriously as a Model for Sustainable Political Economies

February 28, 2019

We all take standards of measurement for granted as background assumptions that we never have to think about. But as technical, mundane, and boring as these standards are, they define our systems of fair dealing and just relations. The image of blind justice holding a balance scale is a universal ideal being compromised in multiple ways by chaotic forces in today’s complicated world arena.

Even so, astoundingly little effort is being invested in systematically exploring how the scales of justice might be more meaningfully and resiliently embedded within our social, economic, educational, health care, and political institutions. This well may be because the idea that people’s abilities, behaviors, and knowledge could be precisely weighed on a scale, like fruit in a grocery store, seems outrageously immoral, opening the door to treating people like commodities to be bought and sold. And even if the political will for such measures could be found, the regulatory enforcement of legally binding contracts and accounting standards appears so implausibly complicated as to make the whole matter not worth any serious consideration at all.

On the face of it, a literal application of the scales of justice to human affairs echoes ideas discredited so thoroughly and for so long that bringing them up in the here and now seems utterly ridiculous, at least, and perhaps truly dangerous, with no possible result except the crushing reduction of human beings to cogs in a soulless machine.

But what if there is some basic way in which measurement is misunderstood when it is taken to mean people will be treated like mass produced commodities for sale? What if we could measure, legally own, invest in, and profit from our literacy, health, and trustworthiness, in the same way we do with property and material things? What if precision measurement was not a tool for oppressive manipulation but a means of obtaining, sharing, and communicating valuable information? What if local contextual situations can be allowed a latitude of variation that does not negatively compromise navigable continuity?

Circumstances are conspiring to take humanity in new directions. Complex new necessities are nurturing the conception and birth of new innovations. A wealth of diverse possibilities for adaptive experimentation proposed in the past–sometimes the distant past–are finding new life in today’s technological context. And science has changed a lot in the last 100 years. In fact, the public is largely unaware that the old paradigm of mechanical reduction has been completely demolished and replaced with a new paradigm of organic emergence and complex adaptive systems. Even Newtonian mechanics and the basic number theory of arithmetic have had to be reworked. It is also true that very few experts have thought through what the demise of the mechanical root metaphor, and the birth of an organic ecosystem metaphor, means philosophically, socially, historically, and culturally.

Bottom-up manifestations of repeating patterns that can be scaled, measured, quantified, and explained open up a wide array of new opportunities for learning from shared experiences. And, just as humanity has long understood about music, we know now how to contextualize group and individual assessment and survey response patterns in ways that let everyone be what they are, uniquely improvising playful creative performances expressed using high tech instruments tuned to shared standards. A huge amount of conceptual and practical work needs to be done, but there are multiple historical precedents suggesting that betting against human ingenuity would be a losing wager.

Two new projects I’m involved in concerning sustainability impact investing and a metrology center for categorical measures begin a new exploration of the consequences of this paradigm shift for our image of the scales of justice as representing a moral imperative. These projects ask whether more complex combinations of mathematics, experiment, technology, and theory can be overtly conceived and implemented in terms of participatory and democratic social and cognitive ecosystems. If so, we may then find our way to new standards of measurement, new languages, and new forms of social organization sufficient to redefining what we take for granted as satisfying our shared sense of fair dealing and just relations.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Advertisements

Psychology and the social sciences: An atheoretical, scattered, and disconnected body of research

February 16, 2019

A new article in Nature Human Behaviour (NHB) points toward the need for better theory and more rigorous mathematical models in psychology and the social sciences (Muthukrishna & Henrich, 2019). The authors rightly say that the lack of an overarching cumulative theoretical framework makes it very difficult to see whether new results fit well with previous work, or if something surprising has come to light. Mathematical models are especially emphasized as being of value in specifying clear and precise expectations.

The point that the social sciences and psychology need better theories and models is painfully obvious. But there are in fact thousands of published studies and practical real world applications that not only provide, but indeed often surpass, the kinds of predictive theories and mathematical models called for in the NHB article. The article not only makes no mention of any of this work, its argument is framed entirely in a statistical context instead of the more appropriate context of measurement science.

The concept of reliability provides an excellent point of entry. Most behavioral scientists think of reliability statistically, as a coefficient with a positive numeric value usually between 0.00 and 1.00. The tangible sense of reliability as indicating exactly how predictable an outcome is does not usually figure in most researchers’ thinking. But that sense of the specific predictability of results has been the focus of attention in social and psychological measurement science for decades.

For instance, the measurement of time is reliable in the sense that the position of the sun relative to the earth can be precisely predicted from geographic location, the time of day, and the day of the year. The numbers and words assigned to noon time are closely associated with the Sun being at the high point in the sky (though there are political variations by season and location across time zones).

That kind of a reproducible association is rarely sought in psychology and the social sciences, but it is far from nonexistent. One can discern different degrees to which that kind of association is included in models of measured constructs. Though most behavioral research doesn’t mention the connection between linear amounts of a measured phenomenon and a reproducible numeric representation of it (level 0), quite a significant body of work focuses on that connection (level 1). The disappointing thing about that level 1 work is that the relentless obsession with statistical methods prevents most researchers from connecting a reproducible quantity with a single expression of it in a standard unit, and with an associated uncertainty term (level 2). That is, level 1 researchers conceive measurement in statistical terms, as a product of data analysis. Even when results across data sets are highly correlated and could be equated to a common metric, level 1 researchers do not leverage that source of potential value for simplified communication and accumulated comparability.

And then, for their part, level 2 researchers usually do not articulate theories about the measured constructs, by augmenting the mathematical data model with an explanatory model predicting variation (level 3). Level 2 researchers are empirically grounded in data, and can expand their network of measures only by gathering more data and analyzing it in ways that bring it into their standard unit’s frame of reference.

Level 3 researchers, however, have come to see what makes their measures tick. They understand the mechanisms that make their questions vary. They can write new questions to their theoretical specifications, test those questions by asking them of a relevant sample, and produce the predicted calibrations. For instance, reading comprehension is well established to be a function of the difference between a person’s reading ability and the complexity of the text they encounter (see articles by Stenner in the list below). We have built our entire educational system around this idea, as we deliberately introduce children first to the alphabet, then to the most common words, then to short sentences, and then to ever longer and more complicated text. But stating the construct model, testing it against data, calibrating a unit to which all tests and measures can be traced, and connecting together all the books, articles, tests, curricula, and students is a process that began (in English and Spanish) only in the 1980s. The process still is far from finished, and most reading research still does not use the common metric.

In this kind of theory-informed context, new items can be automatically generated on the fly at the point of measurement. Those items and inferences made from them are validated by the consistency of the responses and the associated expression of the expected probability of success, agreement, etc. The expense of constant data gathering and analysis can be cut to a very small fraction of what it is at levels 0-2.

Level 3 research methods are not widely known or used, but they are not new. They are gaining traction as their use by national metrology institutes globally grows. As high profile critiques of social and psychological research practices continue to emerge, perhaps more attention will be paid to this important body of work. A few key references are provided below, and virtually every post in this blog pertains to these issues.

References

Baghaei, P. (2008). The Rasch model as a construct validation tool. Rasch Measurement Transactions, 22(1), 1145-6 [http://www.rasch.org/rmt/rmt221a.htm].

Bergstrom, B. A., & Lunz, M. E. (1994). The equivalence of Rasch item calibrations and ability estimates across modes of administration. In M. Wilson (Ed.), Objective measurement: Theory into practice, Vol. 2 (pp. 122-128). Norwood, New Jersey: Ablex.

Cano, S., Pendrill, L., Barbic, S., & Fisher, W. P., Jr. (2018). Patient-centred outcome metrology for healthcare decision-making. Journal of Physics: Conference Series, 1044, 012057.

Dimitrov, D. M. (2010). Testing for factorial invariance in the context of construct validation. Measurement & Evaluation in Counseling & Development, 43(2), 121-149.

Embretson, S. E. (2010). Measuring psychological constructs: Advances in model-based approaches. Washington, DC: American Psychological Association.

Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.

Fischer, G. H. (1983). Logistic latent trait models with linear constraints. Psychometrika, 48(1), 3-26.

Fisher, W. P., Jr. (1992). Reliability statistics. Rasch Measurement Transactions, 6(3), 238 [http://www.rasch.org/rmt/rmt63i.htm].

Fisher, W. P., Jr. (2008). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-1163 [http://www.rasch.org/rmt/rmt221.pdf].

Fisher, W. P., Jr., & Stenner, A. J. (2016). Theory-based metrological traceability in education: A reading measurement network. Measurement, 92, 489-496.

Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37(4), 827-833.

Hattie, J. (1985). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9(2), 139-64.

Hobart, J. C., Cano, S. J., Zajicek, J. P., & Thompson, A. J. (2007). Rating scales as outcome measures for clinical trials in neurology: Problems, solutions, and recommendations. Lancet Neurology, 6, 1094-1105.

Irvine, S. H., Dunn, P. L., & Anderson, J. D. (1990). Towards a theory of algorithm-determined cognitive test construction. British Journal of Psychology, 81, 173-195.

Kline, T. L., Schmidt, K. M., & Bowles, R. P. (2006). Using LinLog and FACETS to model item components in the LLTM. Journal of Applied Measurement, 7(1), 74-91.

Lunz, M. E., & Linacre, J. M. (2010). Reliability of performance examinations: Revisited. In M. Garner, G. Engelhard, Jr., W. P. Fisher, Jr. & M. Wilson (Eds.), Advances in Rasch Measurement, Vol. 1 (pp. 328-341). Maple Grove, MN: JAM Press.

Mari, L., & Wilson, M. (2014). An introduction to the Rasch measurement approach for metrologists. Measurement, 51, 315-327.

Markward, N. J., & Fisher, W. P., Jr. (2004). Calibrating the genome. Journal of Applied Measurement, 5(2), 129-141.

Maul, A., Mari, L., Torres Irribarra, D., & Wilson, M. (2018). The quality of measurement results in terms of the structural features of the measurement process. Measurement, 116, 611-620.

Muthukrishna, M., & Henrich, J. (2019). A problem in theory. Nature Human Behaviour, 1-9.

Obiekwe, J. C. (1999, August 1). Application and validation of the linear logistic test model for item difficulty prediction in the context of mathematics problems. Dissertation Abstracts International: Section B: The Sciences & Engineering, 60(2-B), 0851.

Pendrill, L. (2014). Man as a measurement instrument [Special Feature]. NCSLi Measure: The Journal of Measurement Science, 9(4), 22-33.

Pendrill, L., & Fisher, W. P., Jr. (2015). Counting and quantification: Comparing psychometric and metrological perspectives on visual perceptions of number. Measurement, 71, 46-55.

Pendrill, L., & Petersson, N. (2016). Metrology of human-based and other qualitative measurements. Measurement Science and Technology, 27(9), 094003.

Sijtsma, K. (2009). Correcting fallacies in validity, reliability, and classification. International Journal of Testing, 8(3), 167-194.

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74(1), 107-120.

Stenner, A. J. (2001). The necessity of construct theory. Rasch Measurement Transactions, 15(1), 804-5 [http://www.rasch.org/rmt/rmt151q.htm].

Stenner, A. J., Fisher, W. P., Jr., Stone, M. H., & Burdick, D. S. (2013). Causal Rasch models. Frontiers in Psychology: Quantitative Psychology and Measurement, 4(536), 1-14.

Stenner, A. J., & Horabin, I. (1992). Three stages of construct definition. Rasch Measurement Transactions, 6(3), 229 [http://www.rasch.org/rmt/rmt63b.htm].

Stenner, A. J., Stone, M. H., & Fisher, W. P., Jr. (2018). The unreasonable effectiveness of theory based instrument calibration in the natural sciences: What can the social sciences learn? Journal of Physics Conference Series, 1044(012070).

Stone, M. H. (2003). Substantive scale construction. Journal of Applied Measurement, 4(3), 282-297.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wilson, M. R. (2013). Using the concept of a measurement system to characterize measurement models used in psychometrics. Measurement, 46, 3766-3774.

Wright, B. D., & Stone, M. H. (1979). Chapter 5: Constructing a variable. In Best test design: Rasch measurement (pp. 83-128). Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1999). Measurement essentials. Wilmington, DE: Wide Range, Inc. [http://www.rasch.org/measess/me-all.pdf].

Wright, B. D., Stone, M., & Enos, M. (2000). The evolution of meaning in practice. Rasch Measurement Transactions, 14(1), 736 [http://www.rasch.org/rmt/rmt141g.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Measurement as a Medium for the Expression of Creative Passions in Education

April 23, 2014

Measurement is often viewed as a purely technical task involving a reduction of complex phenomena to numbers. It is accordingly also experienced as mechanical in nature, and disconnected from the world of life. Educational examinations are often seen as an especially egregious form of inappropriate reduction.

This perspective on measurement is contradicted, however, by the essential roles of calibrated instrumentation, mathematical scales, and high technology in the production of music, which, ironically, is widely considered the most alive, captivating and emotionally powerful of the arts.

The question then arises as to if and how measurement in other areas, such as in education, might be conceived, designed, and practiced as a medium for the expression and fulfillment of creative passions. Key issues involved in substantively realizing a musical metaphor in human and social measurement include capacities to tune instruments, to define common scales, to orchestrate harmonious relationships, to enhance choral grace note effects, and to combine elements in unique but pleasing and recognizable forms.

Practical methods of this kind are in place in hundreds of schools nationally and internationally. With such tools in hand, formative applications of integrated instruction and assessment could be conceived as intuitive media for composing and conducting expressions of creative passions.

Student outcomes in reading, mathematics, and other domains may then come to be seen in terms of portfolios of works akin to those produced by musicians, sculptors, film makers, or painters. Hundreds of thousands of books and millions of articles tuned to the same text complexity scale provide readers an extensive palette of colorful tones and timbres for expressing their desires and capacities for learning. Graphical presentations of individual students’ outcomes, as well as outcomes aggregated by classroom, school, district, etc., may be interpreted and experienced as public performances of artful developmental narratives enabling dramatic performances of personal uniqueness and social generality.

Technical canvases capture, aggregate, and organize literacy performances into special portfolios documenting the play and dance of emerging new understandings. As in any creative process, accidents, errors, and idiosyncratic patterns of strengths and weaknesses may evoke powerful expressions of beauty, and human and social value. Just as members of musical ensembles may complement one another’s skills, using rhythm and harmony to improve each others’ playing abilities in practice, so, too, instruments of formative assessment tuned to the same scale can be used to enhance individual teacher skill levels.

Possibilities for orchestrating such performances across educational, health care, social service, environmental management, and other fields could similarly take advantage of existing instrument calibration and measurement technologies.

Real-life scenarios illustrating the value of better measurement

June 4, 2009

I’ve seen consultants work hospital employees through a game in which the object is to manage the care of various kinds of patients who enter into the system at different points. Patients might have the same conditions, prognosis, payor, and demographics but come in through the ED, a clinic, or emerge from the OR. Others will vary medically but enter at the same point. Real-world odds are used to simulate decisions and events as the game proceeds via random card draws. The variation in decisions and outcomes across groups of decision-maker/players is fascinating.

It just occurred to me to set up a game like this with two major scenarios contrasting around one single variable: the quality of measurement. One inning or half of the game is status quo, where existing ratings and percentages are contrived and set up within the actual constraints of real data to illustrate the dangers of relying on numbers that are not measures. (Contact me for examples of how percentages can and sometimes do mean exactly the opposite of what they appear to mean.)

In this part of the game, we see the kinds of normal and par for the course inefficiencies, errors, outcomes, and costs that everyone expects to see.

In the second half of the game, we set up the same kind of scenario, but this time decisions are informed by meaningfully calibrated and contextualized measures. Everyone in the system has the same frame of reference, and decisions are coordinated virtually by the way the information is harmonized.

I imagine the two parts of the game might be played simultaneously by two equally experienced groups of managers and clinicians. Each group might be given a systems perspective, and would be encouraged to innovate with comparative effectiveness studies. When they have both arrived at their outcomes, tracked on a scorecard, they are debriefed together, the results are compared, and they are informed about the inner workings of the data they worked from.

Part of the point here would be to show that evidence-based decision-making is only worth as much as the evidence in hand. Evidence that is not constructed on the basis of a scientific theory and that is not mediated by calibrated instrumentation is worth much less than evidence that is theoretically justified and read off calibrated instruments.

It might be useful to imagine a seminar or workshop in which these scenarios are explored as illustrations of the way fully formed metrological systems reduce transaction costs and market frictions by greasing the wheels of health care commerce with efficient lubricants. Maybe the contrast could also be brought out in terms of a survey or multiple choice test.

Variations on the scenarios could be constructed for education or human resource contexts, as well.

Just wanted to put this down in writing. What do you think?