Clarifying the Goal: Submitting Rasch-based White Papers to NIST

October 23, 2009 by livingcapitalmetrics

NIST does not currently have any metrological standards (metrics to which all instruments measuring a particular construct are traceable) for anything measured with tests, surveys, rating scale assessments, or rankings–i.e., for anything of core interest in education, psychology, sociology, health status assessment, etc.

The ostensible reason for the lack of these standards is that no one has stepped up to demand them, to demonstrate their feasibility, or argue on behalf of their value. So anything of general interest as something for which we would want univerally uniform and available metrics could be proposed. As can be seen in the NIST call, you have to be able to argue for the viability of a fundamentally new innovation that would produce high returns on the investment in a system of networked, equated, or item banked instruments all measuring in a common metric.

Jack Stenner expressed the opinion some years ago that constructs already measured on mass scales using many different instruments that could conceivably be equated present the most persuasive cases for which strong metrological arguments could be made. I have wondered if that is necessarily true.

The idea is to establish a new division in NIST, managed jointly with the National Institutes of Health and of Education, that focuses on creating a new kind of metric system for informing human, social, and natural capital management, quality improvement, and research.

Because NIST has historically focused on metrological systems in the physical sciences, the immediate goal is only one of informing researchers at NIST as to the viability and potential value to be realized in analogous systems for the psychosocial sciences. No one understands the human, social, and economic value of measurement standards like NIST does.

Work that results in fundamental measures of psychosocial constructs should be proposed as areas deserving of NIST’s support. White Papers describing the “high risk-high reward” potential of Rasch applications might get them to start to consider the possibility of a whole new domain of metrics.

For more info, see www.nist.gov/tip/call_for_white_papers_sept09.pdf, and feel free to reference the arguments I made in the White Paper I submitted (www.livingcapitalmetrics.com/images/FisherNISTWhitePaper2.pdf), or in my recent paper in Measurement: Fisher, W. P., Jr. (2009, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), 42(9), 1278-1287.

How Measurement, Contractual Trust, and Care Combine to Grow Social Capital: Creating Social Bonds We Can Really Trade On

October 14, 2009 by livingcapitalmetrics

Last Saturday, I went to Miami, Florida, at the invitation of Paula Lalinde (see her profile at http://www.linkedin.com/pub/paula-lalinde/11/677/a12) to attend MILITARY 101: Military Life and Combat Trauma As Seen By Troops, Their Families, and Clinicians. This day-long free presentation was sponsored by The Veterans Project of South Florida-SOFAR, in association with The Southeast Florida Association for Psychoanalytic Psychology, The Florida Psychoanalytic Society, the Soldiers & Veterans Initiative, and the Florida BRAIVE Fund. The goals of the session “included increased understanding of the unique experiences and culture related to the military experience during wartime, enhanced understanding of the assessment and treatment of trauma specific difficulties, including posttraumatic stress disorder, common co-occurring conditions, and demands of treatment on trauma clinicians.”

Listening to the speakers on Saturday morning at the Military 101 orientation, I was struck by what seemed to me to be a developmental trajectory implied in the construct of therapy-aided healing. I don’t recall if anyone explicitly mentioned Maslow’s hierarchy but it was certainly implied by the dysfunctionality that attends being pushed down to a basic mode of physical survival.

Also, the various references to the stigma of therapy reminded me of Paula’s arguments as to why a community-based preventative approach would be more accessible and likely more successful than individual programs focused on treating problems. (Echoes here of positive psychology and appreciative inquiry.)

In one part of the program, the ritualized formality of the soldier, family, and support groups’ stated promises to each other suggested a way of operationalizing the community-based approach. The expectations structuring relationships among the parties in this community are usually left largely unstated, unexamined, and unmanaged in all but the broadest, and most haphazard, ways (as most relationships’ expectations usually are). The hierarchy of needs and progressive movement towards greater self-actualization implies a developmental sequence of steps or stages that comprise the actual focus of the implied contracts between the members of the community. This sequence is a measurable continuum along which change can be monitored and managed, with all parties accountable for their contracted role in producing specific outcomes.

The process would begin from the predeployment baseline, taking that level of reliability and basis of trust existing in the community as what we want to maintain, what we might want to get back to, and what we definitely want to build on and surpass, in time. The contract would provide a black-and-white record of expectations. It would embody an image of the desired state of the relationships and it could be returned to repeatedly in communications and in visualizations over time. I’ll come back to this after describing the structure of the relational patterns we can expect to observe over the course of events.

The Saturday morning discussion made repeated reference to the role of chains in the combat experience: the chain of command, and the unit being a chain only as strong as its weakest link. The implication was that normal community life tolerates looser expectations, more informal associations, and involves more in the way of team interactions. The contrast between chains and teams brought to mind work by Wright (1995, 1996a, 1996b; Bainer, 1997) on the way the difficulties of the challenges we face influence how we organize ourselves into groups.

Chains tend to form when the challenge is very difficult and dangerous; here we have mountain climbers roped together, bucket brigades putting out fires, and people stretching out end-to-end over thin ice to rescue someone who’s fallen through. In combat, as was stressed repeatedly last Saturday, the chain is one requiring strict follow-through on orders and promises; lives are at risk and protecting them requires the most rigorous adherence to the most specific details in an operation.

Teams form when the challenge is not difficult and it is possible to coordinate a fluid response of partners whose roles shift in importance as the situation changes. Balls are passed and the lead is taken by each in turn, with others getting out of the way or providing supports that might be vitally important or merely convenient.

A third kind of group, packs, forms when the very nature of the problem is at issue; here, individuals take completely different approaches in an exploratory determination of what is at issue, and how it might be addressed. Examples include the Manhattan Project, for instance, where scientists following personal hunches went in their own directions looking for solutions to complex problems. Wolves and other hunting parties form packs when it is impossible to know where the game might be. And though the old joke says that the best place to look for lost keys is where there’s the most light, if you have others helping you, it’s best to split up and not be looking for them in the same place.

After identifying these three major forms of organization, Wright (1996b) saw that individual groups might transition to and from different modes of organization as the nature of the problem changed. For instance, a 19th-century wagon train of settlers heading through the American West might function well as a team when everyone feels safe traveling along with a cavalry detachment, the road is good, the weather is pleasant, and food and water are plentiful. Given vulnerability to attacks by Native Americans, storms, accidents, lack of game, and/or muddy, rutted roads, however, the team might shift toward a chain formation and circle the wagons, with a later return to the team formation after the danger has passed. In the worst case scenario, disaster breaks the chain into individuals scattered like a pack to fend for themselves, with the limited hope of possibly re-uniting at some later time as a chain or team.

In the current context of the military, it would seem that deployment fragments the team, with the soldier training for a position in the chain of command in which she or he will function as a strong link for the unit. The family and support network can continue to function together and separately as teams to some extent, but the stress may require intermittent chain forms of organization. Further, the separation of the soldier from the family and support would seem to approach a pack level of organization for the three groups taken as a whole.

An initial contract between the parties would describe the functioning of the team at the predeployment stage, recognize the imminent breaking up of the team into chains and packs, and visualize the day when the team would be reformed under conditions in which significant degrees of healing will be required to move out of the pack and chain formations. Perhaps there will be some need and means of countering the forcible boot camp enculturation with medicinal antidote therapies of equal but opposite force. Perhaps some elements of the boot camp experience could be safely modified without compromising the operational chain to set the stage for reintegrating the family and community team.

We would want to be able to draw qualitative information from all three groups as to the nature of their experiences at every stage. I think we would want to focus the information on descriptions of the extent to which each level in Maslow’s hierarchy is realized. This information would be used in the design of an assessment that would map out the changes over time, set up the evaluation framework, and guide interventions toward reforming the team. Given their experience with the healing process, the presenters from last Saturday have obvious capacities for an informed perspective on what’s needed here. And what we build with their input would then also plainly feed back into the kind of presentation they did.

There will likely be signature events in the process that will be used to trigger new additions to the contract, as when the consequences of deployment, trauma, loss, or return relative to Maslow’s hierarchy can be predicted. That is, the contract will be a living document that changes as goals are reached or as new challenges emerge.

This of course is all situated then within the context of measures calibrated and shared across the community to inform contracts, treatment, expectations, etc. following the general metrological principles I outline in my published work (see references).

The idea will be for the consistent production of predictable amounts of impact in the legally binding contractual relationships, such that the benefits produced in terms of individual functionality will attract investments from those in positions to employ those individuals, and from the wider society that wants to improve its overall level of mental health. One could imagine that counselors, social workers, and psychotherapists will sell social capital bonds at prices set by market forces on the basis of information analogous to the information currently available in financial markets, grocery stores, or auto sales lots. Instead of paying taxes, corporations would be required to have minimum levels of social capitalization. These levels might be set relative to the value the organization realizes from the services provided by public schools, hospitals, and governments relative to the production of an educated, motivated, healthy workforce able to get to work on public roads, able to drink public water, and living in a publicly maintained quality environment.

There will be a lot more to say on this latter piece, following up on previous blogs here that take up the topic. The contractual groundwork that sets up the binding obligations for formal agreements is the thought of the day that emerged last weekend at the session in Miami. Good stuff, long way to go, as always….

References
Bainer, D. (1997, Winter). A comparison of four models of group efforts and their implications for establishing educational partnerships. Journal of Research in Rural Education, 13(3), 143-152.

Fisher, W. P., Jr. (1995). Opportunism, a first step to inevitability? Rasch Measurement Transactions, 9(2), 426 [http://www.rasch.org/rmt/rmt92.htm].

Fisher, W. P., Jr. (1996, Winter). The Rasch alternative. Rasch Measurement Transactions, 9(4), 466-467 [http://www.rasch.org/rmt/rmt94.htm].

Fisher, W. P., Jr. (1997a). Physical disability construct convergence across instruments: Towards a universal metric. Journal of Outcome Measurement, 1(2), 87-113.

Fisher, W. P., Jr. (1997b, June). What scale-free measurement means to health outcomes research. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 357-373.

Fisher, W. P., Jr. (1998). A research program for accountable and patient-centered health status measures. Journal of Outcome Measurement, 2(3), 222-239.

Fisher, W. P., Jr. (2000). Objectivity in psychosocial measurement: What, why, how. Journal of Outcome Measurement, 4(2), 527-563 [http://www.livingcapitalmetrics.com/images/WP_Fisher_Jr_2000.pdf].

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2005). Daredevil barnstorming to the tipping point: New aspirations for the human sciences. Journal of Applied Measurement, 6(3), 173-9 [http://www.livingcapitalmetrics.com/images/FisherJAM05.pdf].

Fisher, W. P., Jr. (2008). Vanishing tricks and intellectualist condescension: Measurement, metrology, and the advancement of science. Rasch Measurement Transactions, 21(3), 1118-1121 [http://www.rasch.org/rmt/rmt213c.htm].

Fisher, W. P., Jr. (2009, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), 42(9), 1278-1287.

Wright, B. D. (1995). Teams, packs, and chains. Rasch Measurement Transactions, 9(2), 432 [http://www.rasch.org/rmt/rmt92j.htm].

Wright, B. D. (1996a). Composition analysis: Teams, packs, chains. In G. Engelhard & M. Wilson (Eds.), Objective measurement: Theory into practice, Vol. 3 (pp. 241-264). Norwood, New Jersey: Ablex [http://www.rasch.org/memo67.htm].

Wright, B. D. (1996b). Pack to chain to team. Rasch Measurement Transactions, 10(2), 501 [http://www.rasch.org/rmt/rmt102s.htm].

Posted in response to October 5, 2009 Business Week Viewpoint: “Problems with Obama’s Innovation Strategy”

October 8, 2009 by livingcapitalmetrics

Everything you (Jeneanne Rae, the author of the Viewpoint) say is quite true, but in the end you don’t offer any more in the way of details than the administration has. Everyone repeats the mantra of needing clear accountability metrics but no one is focusing on the need for a metric system and reference standards for all of the new forms of intangible capital we’re trying to manage. The collective mind needs a common language of shared terms and objects to innovate effectively. But our measures of innovation, risk, governance, trust, abilities, health, and environmental quality are all expressed in incommensurable, instrument-dependent units. THEY DON’T HAVE TO BE!! Measurement science has 80 years of experience in calibrating and equating tests, surveys, and assessments into measurement systems that retain their metric properties over time, space, respondent samples, different collections of questions asked, etc. If measurement is so important to management, why aren’t more people talking about investing in the infrastructure we need for managing human, social, and natural capital? For more information, see http://livingcapitalmetrics.wordpress.com, http://www.livingcapitalmetrics.com, or http://www.rasch.org.

Rae’s article as at http://www.businessweek.com/innovate/content/oct2009/id2009105_684520.htm?link_position=link1.

On the alleged difficulty of quantifying this or that

October 5, 2009 by livingcapitalmetrics

That this effect or that phenomenon is “difficult to quantify” is one of those phrases that people use from time to time. But, you know, building a computer is difficult, too. I couldn’t do it, and you probably couldn’t, either. Computers are, however, readily available for purchase and it doesn’t matter if you or I can make our own.

Same thing with measurement. Of course, instrument design and calibration are highly technical endeavors, and despite 80+ years of success, most people seem to think it is impossible to really quantify abstract things like abilities, attitudes, motivations,  trust, outcomes and impacts, or maturational development. But real quantification, the kind that is commonly thought possible only for physical things, has been underway in psychology and the social sciences for a long time. More people need to know this.

As anyone who has read much of this blog knows, I’m not talking about some kind of simplistic survey or assessment process that takes measurement to be a mere assignment of numbers to observations. Instrument calibration takes a lot more thought and effort than is usually invested in it. But it isn’t impossible, not by a long shot.

Just as you would not despair of ever having your own computer just because you cannot make one yourself, those who throw up their hands at the supposed difficulty of quantifying something need to think again. Where there’s a will, there’s a way, and scientifically rigorous methods of determining whether something is measurable are a lot more ready to hand than most people realize.

For more information, see my survey design recommendations on pages 1,072-4 at http://www.rasch.org/rmt/rmt203.pdf and Ben Wright’s 15 steps to measurement at http://www.rasch.org/rmt/rmt141g.htm.

Comments on the National Accounts of Well-Being

October 4, 2009 by livingcapitalmetrics

Well-designed measures of human, social, and natural capital captured in genuine progress indicators and properly put to work on the front lines of education, health care, social services, human and environmental resource management, etc. will harness the profit motive as a driver of growth in human potential, community trust, and environmental quality. But it is a tragic shame that so many well-meaning efforts ignore the decisive advantages of readily available measurement methods. For instance, consider the National Accounts of Well-Being (available at http://www.nationalaccountsofwellbeing.org/learn/download-report.html).

This report’s authors admirably say that “Advances in the measurement of well-being mean that now we can reclaim the true purpose of national accounts as initially conceived and shift towards more meaningful measures of progress and policy effectiveness which capture the real wealth of people’s lived experience” (p. 2).

Of course, as is evident in so many of my posts here and in the focus of my scientific publications, I couldn’t agree more!

But look at p. 61, where the authors say “we acknowledge that we need to be careful about interpreting the distribution of transformed scores. The curvilinear transformation results in scores at one end of the distribution being stretched more than those at the other end. This means that standard deviations, for example, of countries with higher scores, are likely to be distorted upwards. As the results section shows, however, this pattern was not in fact found in our data, so it appears that this distortion does not have too much effect. Furthermore, being overly concerned with the distortion would imply absolute faith that the original scales used in the questions are linear. Such faith would be ill-founded. For example, it is not necessarily the case that the difference between ‘all or almost all of the time’ (a response scored as ‘4’ for some questions) and ‘most of the time’ (scored as ‘3’), is the same as the difference between ‘most of the time’ (‘3’) and ‘some of the time’ (‘2’).”

This is just incredible, that the authors admit so baldly that their numbers don’t add up at the same time that they offer those very same numbers in voluminous masses to a global audience that largely takes them at face value. What exactly does it mean to most people “to be careful about interpreting the distribution of transformed scores”?

More to the point, what does it mean that faith in the linearity of the scales is ill-founded? They are doing arithmetic with those scores! There is no way a constant difference between each number on the scale cannot be assumed! Instead of offering cautions, the creators of anything as visible and important as National Accounts of Well Being ought to do the work needed to construct scales that measure in numbers that add up. Instead of saying they don’t know what the size of the unit of measurement is at different places on the ruler, why don’t they formulate a theory of the thing they want to measure, state testable hypotheses as to the constancy and invariance of the measuring unit, and conduct the experiments? It is not, after all, as though we do not have a mature measurement science that has been doing this kind of thing for more than 80 years.

By its very nature, the act of adding up ratings into a sum, and dividing by the number of ratings included in that sum to produce an average, demands the assumption of a common unit of measurement. But practical science does not function or advance on the basis of untested assumptions. Different numbers that add up to the same sum have to mean the same thing: 1+3+4=8=2+3+3, etc. So the capacity of the measurement system to support meaningful inferences as to the invariance of the unit has to be established experimentally.

There is no way to do arithmetic and compute statistics on ordinal rating data without assuming a constant, additive unit of measurement. Either unrealistic demands are being made on people’s cognitive abilities to stretch and shrink numeric units, or the value of the numbers as a basis for action is seriously and unnecessarily compromised.

A lot can be done to construct linear units of measurement that provide the meaningfulness desired by the developers of the National Accounts of Well-Being.

For explanations and illustrations of why scores and percentages are not measures, see http://livingcapitalmetrics.wordpress.com/2009/07/01/graphic-illustrations-of-why-scores-ratings-and-percentages-are-not-measures-part-one/.

The numerous advantages real measures have over raw ratings are listed at http://livingcapitalmetrics.wordpress.com/2009/07/06/table-comparing-scores-ratings-and-percentages-with-rasch-measures/.

To understand the contrast between dead and living capital as it applies to measures based in ordinal data from tests and rating scales, see http://www.rasch.org/rmt/rmt154j.htm.

For a peer-reviewed scientific paper on the theory and research supporting the viability of a metric system for human, social, and natural capital, see http://dx.doi.org/doi:10.1016/j.measurement.2009.03.014.

Just posted on www.economist.com in response to Sept 26 Schumpeter article

September 29, 2009 by livingcapitalmetrics

Let’s cut through the Gordian Knot to the real issue. That we manage what we measure is as close to an absolute truth as there ever was. What got us into this mess was the inadequacy of the vast majority of our measures. So-called “measures” that only get in the way of management are a sign that new standards, criteria, and methods of measurement are needed. The core issue we face is how to transform socialized externalities into capitalized internalities. Transaction costs are the most important and largest costs in any economic exchange. We reduce and control these via measurement. Human, social, and natural capital transaction costs are virtually uncontrolled and unmeasured. We need a metric system for universally uniform measures of abilities and skills, health, motivation, loyalty and trust, and environmental quality. And we needed it yesterday. But who is working on it? Who is talking about it? Most importantly, who is taking advantage of the huge strides that have been made in measurement science over the last 50 years, strides that have made measurement far more rigorous, practical, and flexible than anyone in business seems to know. As to business being an art, so is music, but music is played on and reproduced by some of the highest technology and finest precision instrumentation around. What we need to do is tune the instruments of the management arts and sciences so that we can harmonize our relationships, get with the beat, and sing the melodies we feel in our hearts and souls. For more information, see http://www.livingcapitalmetrics.com, or my blog at http://livingcapitalmetrics.wordpress.com.

Three demands of meaningful measurement

September 28, 2009 by livingcapitalmetrics

The core issue in measurement is meaningfulness. There are three major aspects of meaningfulness to take into account in measurement. These have to do with the constancy of the unit, interpreting the size of differences in measures, and evaluating the coherence of the units and differences.

First, raw scores (counts of right answers or other events, sums of ratings, or rankings) do not stand for anything that adds up the way they do (see previous blogs for more on this). Any given raw score unit can be 4-5 times larger than another, depending on where they fall in the range. Meaningful measurement demands a constant unit. Instrument scaling methods provide it.

Second, meaningful measurement requires that we be able to say just what any quantitative amount of difference is supposed to represent. What does a difference between two measures stand for in the way of what is and isn’t done at those two levels? Is the difference within the range of error, and so random? Is the difference many times more than the error, and so repeatedly reproducible and constant? Meaningful measurement demands that we be able to make reliable distinctions.

Third, meaningful measurement demands that the items work together to measure the same thing. If reliable distinctions can be made between measures, what is the one thing that all of the items tap into? If the data exhibit a consistency that is shared across items and across persons, what is the nature of that consistency? Meaningful measurement posits a model of what data must look like to be interpretable and coherent, and then it evaluates data in light of that model.

When a constant unit is in hand, when the limits of randomness relative to stable differences are known, and when individual responses are consistent with one another, then, and only then, is measurement meaningful. Inconstant units, unknown amounts of random variation, and inconsistent data can never amount to the science we need for understanding and managing skills, abilities, health, motivations, social bonds, and environmental quality.

Managing our investments in human, social, and natural capital for positive returns demands that meaningful measurement be universalized in uniformly calibrated and accessible metrics. Scientifically rigorous, practical, and convenient methods for setting reference standards and making instruments traceable to them are readily available.

We have the means in hand for effecting order-of-magnitude improvements in the meaningfulness of the measures used in education, health care, human and environmental resource management, etc. It’s time we got to work on it.

We are what we measure. It’s time we measured what we want to be.

Response to John Carney’s comments on Sept 25th’s Marketplace with Kai Ryssdal on NPR

September 25, 2009 by livingcapitalmetrics

Way to go, John Carney!! You’re my new hero!

Finally someone has said it right out loud: Even after a year in the economic trough, we are far from a consensus on what went wrong. Everyone is still fighting over the core problem, and so it is impossible to formulate a way forward.

Yes, it has been incredibly frustrating to watch everyone go round and round the central issue without ever really seeing it. Talk about the blind men and the elephant!

The first part of what needs to be done is on the table.  There does seem to be a fair degree of consensus on the idea that, somehow or other, we need to bring ALL forms of capital–human, social, and natural–into the econometric models, the financial spreadsheets, and the Genuine Progress Indicators or Happiness Indexes, so that the profit motive can be harnessed in sustainable, socially responsible ways to build communities, the environment, and human potential. As some have said, we need to transform socialized externalities into capitalized internalities.

I am hardly the first to suggest that. But what has been missing in previous proposals was the means by which we would devise transparent, fungible representations of each significant form of capital. How do we create common currencies for trading in each form of capital? By calibrating the instruments and deploying the reference standard common metrics we will use as those currencies. MEASUREMENT is the core problem. “You manage what you measure” is repeated like a mantra everywhere, but the quality of our measures of risk, opportunity, outcomes, and impacts–in short, of all measures of intangible forms of capital (human, social, intellectual, etc.)–is terrible!

You would never know it from most current measurement practice in business and government, but huge advances have been made over the last 50 years in scaling technology. The mathematical rigor, meaningfulness, practicality, and convenience of measures based in ratings, ability tests, surveys, and assessments could be an order of magnitude better if we just took advantage of existing technologies. We need universally uniform measures of health, skills and abilities, motivation, community life, governance, risk, and environmental quality akin to the measures of weight, volume, time, kilowatts, etc. that we take for granted in economic exchanges everyday.

It is essential to realize that universal uniformity in no way requires universal acceptance of exactly the same instruments, observations, content, questions, items, etc. Any brand tool that verifiably produces measures of the desired outcome, impact, process, etc. in the reference standard metric can compete in the measurement market, just as is the case with clocks or thermometers. Further, far from reducing rich complexities to a meaningless number, new measurement technologies open up new horizons for meaningful relationships by improving our understandings of ourselves and others.

The way forward centers on building a scientifically rigorous metric system that will make our stocks of human, social, and natural capital fungible. We need universally uniform metrics for the outcomes and impacts of schools, hospitals, social services, human, organizational, and natural resources, etc. When you appreciate the extent to which any economy lives and dies on its measurement standards, the need for national and global investments in quantitatively rigorous and qualitatively meaningful metrics becomes all too painfully obvious.

For more information, see http://livingcapitalmetrics.wordpress.com, http://www.livingcapitalmetrics.com, http://www.linkedin.com/in/livingcapitalmetrics, and other sources, such as the Wikipedia entry on Georg Rasch, www.rasch.org, and elsewhere.

These comments were also posted at http://marketplace.publicradio.org/display/web/2009/09/25/pm-weekly-wrap-q/.

NIST Call for White Papers

September 22, 2009 by livingcapitalmetrics

As I’ve been preparing the statistics.com course and consulting on a couple of projects, it’s been difficult to make time for postings here. There’s no lack of things to say, that’s for sure! The following is an alert to an opportunity that should not be passed up….

NIST Call for White Papers

The National Institute for Standards and Technology has posted a new Call for White Papers (http://www.nist.gov/tip/call_for_white_papers_sept09.pdf) as part of its mission “to support, promote, and accelerate innovation in the United States through high-risk, high-reward research in areas of critical national need.”

The White Papers are NIST’s mechanism for collaborating with practitioners in the field in the development of new areas of research into fundamental measurement and metrological systems. NIST is specifically seeking out areas of measurement research that are not currently a priority with any federal funding agency and that have the potential for bringing about fundamental transformations in particular scientific areas.

As was evident in its celebration of World Metrology Day last May, NIST is well aware of the human, economic, and scientific value of technical standards. Mathematics becomes the language of science most fully when universally uniform common currencies provide a lingua franca for communicating experimental results, theoretical predictions, and for economic exchanges of quantitative value. When this truth is fully appreciated, it is obvious that metrological standards for human, social, and natural capital are an area of critical national need that could be highly rewarding. Given the decades of supporting research that are on the books, the risks of investing in this research are quite reasonable. This is especially so when considered relative to the rewards that could accrue from order-of-magnitude improvements in the meaningfulness, utility, and efficiency of measurement based in ordinal observations.

The Call for White Papers is not a funding opportunity but a chance to influence the substance of the areas to be focused on in future funding competitions. One might imagine that NIST would be very interested in supporting research exploring the potential for expanding any of a number of existing measurement systems and methodologies into publicly recognized reference standards.

Deadlines over the next year for White Papers are November 9, February 15, May 10, and July 12, though submissions will be accepted any time between November 9, 2009 and September 30, 2010.

A PDF of a White Paper that builds a case for Rasch-based metrological standards and that was submitted to NIST in its previous round is available at http://www.livingcapitalmetrics.com/images/FisherNISTWhitePaper2.pdf.

Further articulations of connections between Rasch measurement and the wider concerns of instruments traceable to reference standards within metrological networks are available in the following, among others:

Fisher, W. P., Jr. (1996, Winter). The Rasch alternative. Rasch Measurement Transactions, 9(4), 466-467 [http://www.rasch.org/rmt/rmt94.htm].

Fisher, W. P., Jr. (1997). Thurstone’s missed opportunity. Rasch20Measurement Transactions, 11(1), 554 [http://www.rasch.org/rmt/rmt111p.htm].

Fisher, W. P., Jr. (2000). Objectivity in psychosocial measurement: What, why, how. Journal of Outcome Measurement, 4(2), 527-563 [http://www.livingcapitalmetrics.com/images/WP_Fisher_Jr_2000.pdf].

Fisher, W. P., Jr. (2008). Vanishing tricks and intellectualist condescension: Measurement, metrology, and the advancement of science. Rasch Measurement Transactions, 21(3), 1118-1121 [http://www.rasch.org/rmt/rmt213c.htm].

Fisher, W. P., Jr. (2009, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), 42(9), 1278-1287.

Reliability Coefficients: Starting from the Beginning

August 31, 2009 by livingcapitalmetrics

[This posting was prompted by questions concerning a previous blog entry, Reliability Revisited, and provides background on reliability that only Rasch measurement practitioners are likely to possess.] Most measurement applications based in ordinal data do not implement rigorous checks of the internal consistency of the observations, nor do they typically use the log-odds transformation to convert the nonlinear scores into linear measures. Measurement is usually defined in statistical terms, applying population-level models to obtain group-level summary scores, means, and percentages. Measurement, however, ought to involve individual-level models and case-specific location estimates. (See one of my earlier blogs for more on this distinction between statistics and measurement.)

Given the appropriate measurement focus on the individual, the instrument is initially calibrated and measures are estimated in a simultaneous conjoint process. Once the instrument is calibrated, the item estimates can be anchored, measures can be routinely produced from them, and new items can be calibrated into the system, and others dropped, over time. This method has been the norm in admissions, certification, licensure, and high stakes testing for decades (Fisher & Wright, 1994; Bezruczko, 2005).

Measurement modelling of individual response processes has to be stochastic, or else we run into the attenuation paradox (Engelhard, 1993, 1994). This is the situation in which a deterministic progression of observations from one end of the instrument to the other produces apparently error-free data strings that look like this (1 being a correct answer, a higher rating, or the presence of an attribute, and 0 being incorrect, a lower rating, or the absence of the attribute):

00000000000

10000000000

11000000000

11100000000

11110000000

11111000000

11111100000

11111111000

11111111100

11111111110

11111111111

In this situation, strings with all 0s and all 1s give no information useful for estimating measures (rows) or calibrations (columns). It is as though some of the people are shorter than the first unit on the ruler, and others are taller than the top unit. We don’t really have any way of knowing how short or tall they are, so their rows drop out. But eliminating the top and bottom rows makes the leftmost and rightmost columns all 0s and 1s, and eliminating them then gives new rows with all 0s and 1s, etc., until there’s no data left. (See my Revisiting Reliability blog for evaluations of five different probabilistically-structured data sets of this kind simulated to contrast various approaches to assessing reliability and internal consistency.)

The problem for estimation (Linacre, 1991, 1999, 2000) in data like those shown above is that the lack of informational overlaps between the columns, on the one hand, and between the rows, on the other, gives us no basis for knowing how much more of the variable is represented by any one item relative to any other, or by any one person measured relative to any other. In addition, whenever we actually construct measures of abilities, attitudes, or behaviors that conform with this kind of Guttman (1950) structure (Andrich, 1985; Douglas & Wright, 1989; Engelhard, 2008), the items have to be of such markedly different difficulties or agreeabilities that the results tend to involve large numbers of indistinguishable groups of respondents. But when that information is present in a probabilistically consistent way, we have an example of the phenomenon of stochastic resonance (Fisher, 1992b), so called because of the way noise amplifies weak deterministic signals (Andò & Graziani, 2000; Benzi, Sutera, & Vulpiani, 1981; Bulsara & Gammaitoni, 1996; Dykman & McClintock, 1998; Schimansky-Geier, Freund, Neiman, & Shulgin, 1998).

We need the noise, but we can’t let it overwhelm the system. We have to be able to know how much error there is relative to actual signal. Reliability is traditionally defined (Guilford 1965, pp. 439-40) as an estimate of this relation of signal and noise:

“The reliability of any set of measurements is logically defined as the proportion of their variance that is true variance…. We think of the total variance of a set of measures as being made up of two sources of variance: true variance and error variance… The true measure is assumed to be the genuine value of whatever is being measured… The error components occur independently and at random.”

Traditional reliability coefficients, like Cronbach’s alpha, are correlational, implementing a statistical model of group-level information. Error is taken to be the unexplained portion of the variance:

“In his description of alpha Cronbach (1951) proved (1) that alpha is the mean of all possible split-half coefficients, (2) that alpha is the value expected when two random samples of items from a pool like those in the given test are correlated, and (3) that alpha is a lower bound to the proportion of test variance attributable to common factors among the items” (Hattie, 1985, pp. 143-4).

But measurement models of individual-level response processes (Rasch, 1960; Andrich, 1988; Wright, 1977; Fisher & Wright, 1994; Bond & Fox, 2007; Wilson, 2005; Bezruczko, 2005) employ individual-level error estimates (Wright, 1977; Wright & Stone, 1979; Wright & Masters, 1982), not correlational group-level variance estimates. The individual measurement errors are statistically equivalent to sampling confidence intervals, as is evident in both Wright’s equations and in plots of errors and confidence intervals (see Figure 4 in Fisher, 2008). That is, error and confidence intervals both decline at the same rate with larger numbers of item responses per person, or larger numbers of person responses per item.

This phenomenon has a constructive application in instrument design. If a reasonable expectation for the measurement standard deviation can be formulated and related to the error expected on the basis of the number of items and response categories, a good estimate of the measurement reliability can be read off a nomograph (Linacre, 1993).

Wright (Wright & Masters, 1982, pp. 92, 106; Wright, 1996) introduced several vitally important measurement precision concepts and tools that follow from access to individual person and item error estimates. They improve on the traditional KR-20 or Cronbach reliability coefficients because the individualized error estimates better account for the imprecisions of mistargeted instruments, and for missing data, and so more accurately and conservatively estimate reliability.

Wright and Masters introduce a new reliability statistic, G, the measurement separation reliability index. The availability of individual error estimates makes it possible to estimate the true variance of the measures more directly, by subtracting the mean square error from the total variance. The standard deviation based on this estimate of true variance is then made the numerator of a ratio, G, having the root mean square error as its denominator.

Each unit increase in this G index then represents another multiple of the error unit in the amount of quantitative variation present in the measures. This multiple is nonlinearly represented in the traditional reliability coefficients expressed in the 0.00 – 1.00 range, such that the same separation index unit difference is found in the 0.00 to 0.50, 0.50 to 0.80, 0.80 to 0.90, 0.90 to 0.94, 0.94 to 0.96, and 0.96 to 0.97 reliability ranges (see Fisher, 1992a, for a table of values; available online: see references).

G can also be estimated as the square root of the reliability divided by one minus the reliability. Conversely, a reliability coefficient roughly equivalent to Cronbach’s alpha is estimated as G squared divided by G squared plus the error variance. Because individual error estimates are inflated in the presence of missing data and when an instrument is mistargeted and measures tend toward the extremes, the Rasch-based reliability coefficients tend to be more conservative than Cronbach’s alpha, as these sources of error are hidden within the variances and correlations. For a comparison of the G separation index, the G reliability coefficient, and Cronbach’s alpha over five simulated data sets, see the Reliability Revisited blog entry.

Error estimates can be made more conservative yet by multiplying each individual error term by the larger of either 1.0 or the square root of the associated individual mean square fit statistic for that case (Wright, 1995). (The mean square fit statistics are chi-squares divided by their degrees of freedom, and so have an expected value of 1.00; see Smith (1996) for more on fit, and see my recent blog, Revisiting Reliability, for more on the conceptualization and evaluation of reliability relative to fit.)

Wright and Masters (1982, pp. 92, 105-6) also introduce the concept of strata, ranges on the measurement continuum with centers separated by three errors. Strata are in effect a more forgiving expression of the separation reliability index, G, since the latter approximates strata with centers separated by four errors. An estimate of strata defined as having centers separated by four errors is very nearly identical with the separation index. If three errors define a 95% confidence interval, four are equivalent to 99% confidence.

There is a particular relevance in all of this for practical applications involving the combination or aggregation of physical, chemical, and other previously calibrated measures. This is illustrated in, for instance, the use of chemical indicators in assessing disease severity, environmental pollution, etc. Though any individual measure of the amount of a chemical or compound is valid within the limits of its intended purpose, to arrive at measures delineating disease severity, overall pollution levels, etc., the relevant instruments must be designed, tested, calibrated, and maintained, just as any instruments are (Alvarez, 2005; Cipriani, Fox, Khuder, et al., 2005; Fisher, Bernstein, et al., 2002; Fisher, Priest, Gilder, et al., 2008; Hughes, Perkins, Wright, et al., 2003; Perkins, Wright, & Dorsey, 2005; Wright, 2000).

The same methodology that is applied in this work, involving the rating or assessment of the quality of the outcomes or impacts counted, expressed as percentages, or given in an indicator’s native metric (parts per million, acres, number served, etc.), is needed in the management of all forms of human, social, and natural capital. (Watch this space for a forthcoming blog applying this methodology to the scaling of the UN Millennium Development Goals data.) The practical advantages of working from calibrated instrumentation in these contexts include data quality evaluations, the replacement of nonlinear percentages with linear measures, data volume reduction with no loss of information, and the integration of meaningful and substantive qualities with additive quantities on annotated metrics.

References

Alvarez, P. (2005). Several noncategorical measures define air pollution. In N. Bezruczko (Ed.), Rasch measurement in health sciences (pp. 277-93). Maple Grove, MN: JAM Press.

Andò, B., & Graziani, S. (2000). Stochastic resonance theory and applications. New York: Kluwer Academic Publishers.

Andrich, D. (1985). An elaboration of Guttman scaling with Rasch models for measurement. In N. B. Tuma (Ed.), Sociological methodology 1985 (pp. 33-80). San Francisco, California: Jossey-Bass.

Andrich, D. (1988). Rasch models for measurement. (Vols. series no. 07-068). Sage University Paper Series on Quantitative Applications in the Social Sciences). Beverly Hills, California: Sage Publications.

Benzi, R., Sutera, A., & Vulpiani, A. (1981). The mechanism of stochastic resonance. Journal of Physics. A. Mathematical and General, 14, L453-L457.

Bezruczko, N. (Ed.). (2005). Rasch measurement in health sciences. Maple Grove, MN: JAM Press.

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2d edition. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Bulsara, A. R., & Gammaitoni, L. (1996, March). Tuning in to noise. Physics Today, 49, 39-45.

Cipriani, D., Fox, C., Khuder, S., & Boudreau, N. (2005). Comparing Rasch analyses probability estimates to sensitivity, specificity and likelihood ratios when examining the utility of medical diagnostic tests. Journal of Applied Measurement, 6(2), 180-201.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334.

Douglas, G. A., & Wright, B. D. (1989). Response patterns and their probabilities. Rasch Measurement Transactions, 3(4), 75-77 [http://www.rasch.org/rmt/rmt34.htm].

Dykman, M. I., & Mcclintock, P. V. E. (1998, January 22). What can stochastic resonance do? Nature, 391(6665), 344.

Engelhard, G., Jr. (1993). What is the attenuation paradox? Rasch Measurement Transactions, 6(4), 257 [http://www.rasch.org/rmt/rmt64.htm].

Engelhard, G., Jr. (1994). Resolving the attenuation paradox. Rasch Measurement Transactions, 8(3), 379.

Engelhard, G. (2008, July). Historical perspectives on invariant measurement: Guttman, Rasch, and Mokken. Measurement: Interdisciplinary Research & Perspectives, 6(3), 155-189.

Fisher, W. P., Jr. (1992a). Reliability statistics. Rasch Measurement Transactions, 6(3), 238 [http://www.rasch.org/rmt/rmt63i.htm].

Fisher, W. P., Jr. (1992b, Spring). Stochastic resonance and Rasch measurement. Rasch Measurement Transactions, 5(4), 186-187 [http://www.rasch.org/rmt/rmt54k.htm].

Fisher, W. P., Jr. (2008, Summer). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-3 [http://www.rasch.org/rmt/rmt221.pdf].

Fisher, W. P., Jr., Bernstein, L. H., Qamar, A., Babb, J., Rypka, E. W., & Yasick, D. (2002, February). At the bedside: Measuring patient outcomes. Advance for Administrators of the Laboratory, 11(2), 8, 10 [http://laboratory-manager.advanceweb.com/Article/At-the-Bedside-7.aspx].

Fisher, W. P., Jr., Priest, E., Gilder, R., Blankenship, D., & Burton, E. C. (2008, July 3-6). Development of a novel heart failure measure to identify hospitalized patients at risk for intensive care unit admission. Presented at the World Congress on Controversies in Cardiovascular Diseases [http://www.comtecmed.com/ccare/2008/authors_abstract.aspx#Author15], Intercontinental Hotel, Berlin, Germany.

Fisher, W. P., Jr., & Wright, B. D. (Eds.). (1994). Applications of probabilistic conjoint measurement. International Journal of Educational Research, 21(6), 557-664.

Guilford, J. P. (1965). Fundamental statistics in psychology and education. 4th Edn. New York: McGraw-Hill.

Guttman, L. (1950). The basis for scalogram analysis. In S. A. Stouffer & et al. (Eds.), Studies in social psychology in World War II. volume 4: Measurement and prediction (pp. 60-90). New York: Wiley.

Hattie, J. (1985, June). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9(2), 139-64.

Hughes, L., Perkins, K., Wright, B. D., & Westrick, H. (2003). Using a Rasch scale to characterize the clinical features of patients with a clinical diagnosis of uncertain, probable or possible Alzheimer disease at intake. Journal of Alzheimer’s Disease, 5(5), 367-373.

Linacre, J. M. (1991, Spring). Stochastic Guttman order. Rasch Measurement Transactions, 5(4), 189 [http://www.rasch.org/rmt/rmt54p.htm].

Linacre, J. M. (1993). Rasch-based generalizability theory. Rasch Measurement Transactions, 7(1), 283-284; [http://www.rasch.org/rmt/rmt71h.htm].

Linacre, J. M. (1999). Understanding Rasch measurement: Estimation methods for Rasch measures. Journal of Outcome Measurement, 3(4), 382-405.

Linacre, J. M. (2000, Autumn). Guttman coefficients and Rasch data. Rasch Measurement Transactions, 14(2), 746-7 [http://www.rasch.org/rmt/rmt142e.htm].

Perkins, K., Wright, B. D., & Dorsey, J. K. (2005). Using Rasch measurement with medical data. In N. Bezruczko (Ed.), Rasch measurement in health sciences (pp. 221-34). Maple Grove, MN: JAM Press.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Schimansky-Geier, L., Freund, J. A., Neiman, A. B., & Shulgin, B. (1998). Noise induced order: Stochastic resonance. International Journal of Bifurcation and Chaos, 8(5), 869-79.

Smith, R. M. (2000). Fit analysis in latent trait measurement models. Journal of Applied Measurement, 1(2), 199-218.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1995, Summer). Which standard error? Rasch Measurement Transactions, 9(2), 436-437 [http://www.rasch.org/rmt/rmt92n.htm].

Wright, B. D. (1996, Winter). Reliability and separation. Rasch Measurement Transactions, 9(4), 472 [http://www.rasch.org/rmt/rmt94n.htm].

Wright, B. D. (2000). Rasch regression: My recipe. Rasch Measurement Transactions, 14(3), 758-9 [http://www.rasch.org/rmt/rmt143u.htm].

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, Illinois: MESA Press.