Archive for the ‘health care’ Category

How Measurement, Contractual Trust, and Care Combine to Grow Social Capital: Creating Social Bonds We Can Really Trade On

October 14, 2009

Last Saturday, I went to Miami, Florida, at the invitation of Paula Lalinde (see her profile at http://www.linkedin.com/pub/paula-lalinde/11/677/a12) to attend MILITARY 101: Military Life and Combat Trauma As Seen By Troops, Their Families, and Clinicians. This day-long free presentation was sponsored by The Veterans Project of South Florida-SOFAR, in association with The Southeast Florida Association for Psychoanalytic Psychology, The Florida Psychoanalytic Society, the Soldiers & Veterans Initiative, and the Florida BRAIVE Fund. The goals of the session “included increased understanding of the unique experiences and culture related to the military experience during wartime, enhanced understanding of the assessment and treatment of trauma specific difficulties, including posttraumatic stress disorder, common co-occurring conditions, and demands of treatment on trauma clinicians.”

Listening to the speakers on Saturday morning at the Military 101 orientation, I was struck by what seemed to me to be a developmental trajectory implied in the construct of therapy-aided healing. I don’t recall if anyone explicitly mentioned Maslow’s hierarchy but it was certainly implied by the dysfunctionality that attends being pushed down to a basic mode of physical survival.

Also, the various references to the stigma of therapy reminded me of Paula’s arguments as to why a community-based preventative approach would be more accessible and likely more successful than individual programs focused on treating problems. (Echoes here of positive psychology and appreciative inquiry.)

In one part of the program, the ritualized formality of the soldier, family, and support groups’ stated promises to each other suggested a way of operationalizing the community-based approach. The expectations structuring relationships among the parties in this community are usually left largely unstated, unexamined, and unmanaged in all but the broadest, and most haphazard, ways (as most relationships’ expectations usually are). The hierarchy of needs and progressive movement towards greater self-actualization implies a developmental sequence of steps or stages that comprise the actual focus of the implied contracts between the members of the community. This sequence is a measurable continuum along which change can be monitored and managed, with all parties accountable for their contracted role in producing specific outcomes.

The process would begin from the predeployment baseline, taking that level of reliability and basis of trust existing in the community as what we want to maintain, what we might want to get back to, and what we definitely want to build on and surpass, in time. The contract would provide a black-and-white record of expectations. It would embody an image of the desired state of the relationships and it could be returned to repeatedly in communications and in visualizations over time. I’ll come back to this after describing the structure of the relational patterns we can expect to observe over the course of events.

The Saturday morning discussion made repeated reference to the role of chains in the combat experience: the chain of command, and the unit being a chain only as strong as its weakest link. The implication was that normal community life tolerates looser expectations, more informal associations, and involves more in the way of team interactions. The contrast between chains and teams brought to mind work by Wright (1995, 1996a, 1996b; Bainer, 1997) on the way the difficulties of the challenges we face influence how we organize ourselves into groups.

Chains tend to form when the challenge is very difficult and dangerous; here we have mountain climbers roped together, bucket brigades putting out fires, and people stretching out end-to-end over thin ice to rescue someone who’s fallen through. In combat, as was stressed repeatedly last Saturday, the chain is one requiring strict follow-through on orders and promises; lives are at risk and protecting them requires the most rigorous adherence to the most specific details in an operation.

Teams form when the challenge is not difficult and it is possible to coordinate a fluid response of partners whose roles shift in importance as the situation changes. Balls are passed and the lead is taken by each in turn, with others getting out of the way or providing supports that might be vitally important or merely convenient.

A third kind of group, packs, forms when the very nature of the problem is at issue; here, individuals take completely different approaches in an exploratory determination of what is at issue, and how it might be addressed. Examples include the Manhattan Project, for instance, where scientists following personal hunches went in their own directions looking for solutions to complex problems. Wolves and other hunting parties form packs when it is impossible to know where the game might be. And though the old joke says that the best place to look for lost keys is where there’s the most light, if you have others helping you, it’s best to split up and not be looking for them in the same place.

After identifying these three major forms of organization, Wright (1996b) saw that individual groups might transition to and from different modes of organization as the nature of the problem changed. For instance, a 19th-century wagon train of settlers heading through the American West might function well as a team when everyone feels safe traveling along with a cavalry detachment, the road is good, the weather is pleasant, and food and water are plentiful. Given vulnerability to attacks by Native Americans, storms, accidents, lack of game, and/or muddy, rutted roads, however, the team might shift toward a chain formation and circle the wagons, with a later return to the team formation after the danger has passed. In the worst case scenario, disaster breaks the chain into individuals scattered like a pack to fend for themselves, with the limited hope of possibly re-uniting at some later time as a chain or team.

In the current context of the military, it would seem that deployment fragments the team, with the soldier training for a position in the chain of command in which she or he will function as a strong link for the unit. The family and support network can continue to function together and separately as teams to some extent, but the stress may require intermittent chain forms of organization. Further, the separation of the soldier from the family and support would seem to approach a pack level of organization for the three groups taken as a whole.

An initial contract between the parties would describe the functioning of the team at the predeployment stage, recognize the imminent breaking up of the team into chains and packs, and visualize the day when the team would be reformed under conditions in which significant degrees of healing will be required to move out of the pack and chain formations. Perhaps there will be some need and means of countering the forcible boot camp enculturation with medicinal antidote therapies of equal but opposite force. Perhaps some elements of the boot camp experience could be safely modified without compromising the operational chain to set the stage for reintegrating the family and community team.

We would want to be able to draw qualitative information from all three groups as to the nature of their experiences at every stage. I think we would want to focus the information on descriptions of the extent to which each level in Maslow’s hierarchy is realized. This information would be used in the design of an assessment that would map out the changes over time, set up the evaluation framework, and guide interventions toward reforming the team. Given their experience with the healing process, the presenters from last Saturday have obvious capacities for an informed perspective on what’s needed here. And what we build with their input would then also plainly feed back into the kind of presentation they did.

There will likely be signature events in the process that will be used to trigger new additions to the contract, as when the consequences of deployment, trauma, loss, or return relative to Maslow’s hierarchy can be predicted. That is, the contract will be a living document that changes as goals are reached or as new challenges emerge.

This of course is all situated then within the context of measures calibrated and shared across the community to inform contracts, treatment, expectations, etc. following the general metrological principles I outline in my published work (see references).

The idea will be for the consistent production of predictable amounts of impact in the legally binding contractual relationships, such that the benefits produced in terms of individual functionality will attract investments from those in positions to employ those individuals, and from the wider society that wants to improve its overall level of mental health. One could imagine that counselors, social workers, and psychotherapists will sell social capital bonds at prices set by market forces on the basis of information analogous to the information currently available in financial markets, grocery stores, or auto sales lots. Instead of paying taxes, corporations would be required to have minimum levels of social capitalization. These levels might be set relative to the value the organization realizes from the services provided by public schools, hospitals, and governments relative to the production of an educated, motivated, healthy workforce able to get to work on public roads, able to drink public water, and living in a publicly maintained quality environment.

There will be a lot more to say on this latter piece, following up on previous blogs here that take up the topic. The contractual groundwork that sets up the binding obligations for formal agreements is the thought of the day that emerged last weekend at the session in Miami. Good stuff, long way to go, as always….

References
Bainer, D. (1997, Winter). A comparison of four models of group efforts and their implications for establishing educational partnerships. Journal of Research in Rural Education, 13(3), 143-152.

Fisher, W. P., Jr. (1995). Opportunism, a first step to inevitability? Rasch Measurement Transactions, 9(2), 426 [http://www.rasch.org/rmt/rmt92.htm].

Fisher, W. P., Jr. (1996, Winter). The Rasch alternative. Rasch Measurement Transactions, 9(4), 466-467 [http://www.rasch.org/rmt/rmt94.htm].

Fisher, W. P., Jr. (1997a). Physical disability construct convergence across instruments: Towards a universal metric. Journal of Outcome Measurement, 1(2), 87-113.

Fisher, W. P., Jr. (1997b, June). What scale-free measurement means to health outcomes research. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 357-373.

Fisher, W. P., Jr. (1998). A research program for accountable and patient-centered health status measures. Journal of Outcome Measurement, 2(3), 222-239.

Fisher, W. P., Jr. (2000). Objectivity in psychosocial measurement: What, why, how. Journal of Outcome Measurement, 4(2), 527-563 [http://www.livingcapitalmetrics.com/images/WP_Fisher_Jr_2000.pdf].

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2005). Daredevil barnstorming to the tipping point: New aspirations for the human sciences. Journal of Applied Measurement, 6(3), 173-9 [http://www.livingcapitalmetrics.com/images/FisherJAM05.pdf].

Fisher, W. P., Jr. (2008). Vanishing tricks and intellectualist condescension: Measurement, metrology, and the advancement of science. Rasch Measurement Transactions, 21(3), 1118-1121 [http://www.rasch.org/rmt/rmt213c.htm].

Fisher, W. P., Jr. (2009, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), 42(9), 1278-1287.

Wright, B. D. (1995). Teams, packs, and chains. Rasch Measurement Transactions, 9(2), 432 [http://www.rasch.org/rmt/rmt92j.htm].

Wright, B. D. (1996a). Composition analysis: Teams, packs, chains. In G. Engelhard & M. Wilson (Eds.), Objective measurement: Theory into practice, Vol. 3 (pp. 241-264). Norwood, New Jersey: Ablex [http://www.rasch.org/memo67.htm].

Wright, B. D. (1996b). Pack to chain to team. Rasch Measurement Transactions, 10(2), 501 [http://www.rasch.org/rmt/rmt102s.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Comments on the National Accounts of Well-Being

October 4, 2009

Well-designed measures of human, social, and natural capital captured in genuine progress indicators and properly put to work on the front lines of education, health care, social services, human and environmental resource management, etc. will harness the profit motive as a driver of growth in human potential, community trust, and environmental quality. But it is a tragic shame that so many well-meaning efforts ignore the decisive advantages of readily available measurement methods. For instance, consider the National Accounts of Well-Being (available at http://www.nationalaccountsofwellbeing.org/learn/download-report.html).

This report’s authors admirably say that “Advances in the measurement of well-being mean that now we can reclaim the true purpose of national accounts as initially conceived and shift towards more meaningful measures of progress and policy effectiveness which capture the real wealth of people’s lived experience” (p. 2).

Of course, as is evident in so many of my posts here and in the focus of my scientific publications, I couldn’t agree more!

But look at p. 61, where the authors say “we acknowledge that we need to be careful about interpreting the distribution of transformed scores. The curvilinear transformation results in scores at one end of the distribution being stretched more than those at the other end. This means that standard deviations, for example, of countries with higher scores, are likely to be distorted upwards. As the results section shows, however, this pattern was not in fact found in our data, so it appears that this distortion does not have too much effect. Furthermore, being overly concerned with the distortion would imply absolute faith that the original scales used in the questions are linear. Such faith would be ill-founded. For example, it is not necessarily the case that the difference between ‘all or almost all of the time’ (a response scored as ‘4’ for some questions) and ‘most of the time’ (scored as ‘3’), is the same as the difference between ‘most of the time’ (‘3’) and ‘some of the time’ (‘2’).”

This is just incredible, that the authors admit so baldly that their numbers don’t add up at the same time that they offer those very same numbers in voluminous masses to a global audience that largely takes them at face value. What exactly does it mean to most people “to be careful about interpreting the distribution of transformed scores”?

More to the point, what does it mean that faith in the linearity of the scales is ill-founded? They are doing arithmetic with those scores! There is no way a constant difference between each number on the scale cannot be assumed! Instead of offering cautions, the creators of anything as visible and important as National Accounts of Well Being ought to do the work needed to construct scales that measure in numbers that add up. Instead of saying they don’t know what the size of the unit of measurement is at different places on the ruler, why don’t they formulate a theory of the thing they want to measure, state testable hypotheses as to the constancy and invariance of the measuring unit, and conduct the experiments? It is not, after all, as though we do not have a mature measurement science that has been doing this kind of thing for more than 80 years.

By its very nature, the act of adding up ratings into a sum, and dividing by the number of ratings included in that sum to produce an average, demands the assumption of a common unit of measurement. But practical science does not function or advance on the basis of untested assumptions. Different numbers that add up to the same sum have to mean the same thing: 1+3+4=8=2+3+3, etc. So the capacity of the measurement system to support meaningful inferences as to the invariance of the unit has to be established experimentally.

There is no way to do arithmetic and compute statistics on ordinal rating data without assuming a constant, additive unit of measurement. Either unrealistic demands are being made on people’s cognitive abilities to stretch and shrink numeric units, or the value of the numbers as a basis for action is seriously and unnecessarily compromised.

A lot can be done to construct linear units of measurement that provide the meaningfulness desired by the developers of the National Accounts of Well-Being.

For explanations and illustrations of why scores and percentages are not measures, see https://livingcapitalmetrics.wordpress.com/2009/07/01/graphic-illustrations-of-why-scores-ratings-and-percentages-are-not-measures-part-one/.

The numerous advantages real measures have over raw ratings are listed at https://livingcapitalmetrics.wordpress.com/2009/07/06/table-comparing-scores-ratings-and-percentages-with-rasch-measures/.

To understand the contrast between dead and living capital as it applies to measures based in ordinal data from tests and rating scales, see http://www.rasch.org/rmt/rmt154j.htm.

For a peer-reviewed scientific paper on the theory and research supporting the viability of a metric system for human, social, and natural capital, see http://dx.doi.org/doi:10.1016/j.measurement.2009.03.014.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Three demands of meaningful measurement

September 28, 2009

The core issue in measurement is meaningfulness. There are three major aspects of meaningfulness to take into account in measurement. These have to do with the constancy of the unit, interpreting the size of differences in measures, and evaluating the coherence of the units and differences.

First, raw scores (counts of right answers or other events, sums of ratings, or rankings) do not stand for anything that adds up the way they do (see previous blogs for more on this). Any given raw score unit can be 4-5 times larger than another, depending on where they fall in the range. Meaningful measurement demands a constant unit. Instrument scaling methods provide it.

Second, meaningful measurement requires that we be able to say just what any quantitative amount of difference is supposed to represent. What does a difference between two measures stand for in the way of what is and isn’t done at those two levels? Is the difference within the range of error, and so random? Is the difference many times more than the error, and so repeatedly reproducible and constant? Meaningful measurement demands that we be able to make reliable distinctions.

Third, meaningful measurement demands that the items work together to measure the same thing. If reliable distinctions can be made between measures, what is the one thing that all of the items tap into? If the data exhibit a consistency that is shared across items and across persons, what is the nature of that consistency? Meaningful measurement posits a model of what data must look like to be interpretable and coherent, and then it evaluates data in light of that model.

When a constant unit is in hand, when the limits of randomness relative to stable differences are known, and when individual responses are consistent with one another, then, and only then, is measurement meaningful. Inconstant units, unknown amounts of random variation, and inconsistent data can never amount to the science we need for understanding and managing skills, abilities, health, motivations, social bonds, and environmental quality.

Managing our investments in human, social, and natural capital for positive returns demands that meaningful measurement be universalized in uniformly calibrated and accessible metrics. Scientifically rigorous, practical, and convenient methods for setting reference standards and making instruments traceable to them are readily available.

We have the means in hand for effecting order-of-magnitude improvements in the meaningfulness of the measures used in education, health care, human and environmental resource management, etc. It’s time we got to work on it.

We are what we measure. It’s time we measured what we want to be.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Response to John Carney’s comments on Sept 25th’s Marketplace with Kai Ryssdal on NPR

September 25, 2009

Way to go, John Carney!! You’re my new hero!

Finally someone has said it right out loud: Even after a year in the economic trough, we are far from a consensus on what went wrong. Everyone is still fighting over the core problem, and so it is impossible to formulate a way forward.

Yes, it has been incredibly frustrating to watch everyone go round and round the central issue without ever really seeing it. Talk about the blind men and the elephant!

The first part of what needs to be done is on the table.  There does seem to be a fair degree of consensus on the idea that, somehow or other, we need to bring ALL forms of capital–human, social, and natural–into the econometric models, the financial spreadsheets, and the Genuine Progress Indicators or Happiness Indexes, so that the profit motive can be harnessed in sustainable, socially responsible ways to build communities, the environment, and human potential. As some have said, we need to transform socialized externalities into capitalized internalities.

I am hardly the first to suggest that. But what has been missing in previous proposals was the means by which we would devise transparent, fungible representations of each significant form of capital. How do we create common currencies for trading in each form of capital? By calibrating the instruments and deploying the reference standard common metrics we will use as those currencies. MEASUREMENT is the core problem. “You manage what you measure” is repeated like a mantra everywhere, but the quality of our measures of risk, opportunity, outcomes, and impacts–in short, of all measures of intangible forms of capital (human, social, intellectual, etc.)–is terrible!

You would never know it from most current measurement practice in business and government, but huge advances have been made over the last 50 years in scaling technology. The mathematical rigor, meaningfulness, practicality, and convenience of measures based in ratings, ability tests, surveys, and assessments could be an order of magnitude better if we just took advantage of existing technologies. We need universally uniform measures of health, skills and abilities, motivation, community life, governance, risk, and environmental quality akin to the measures of weight, volume, time, kilowatts, etc. that we take for granted in economic exchanges everyday.

It is essential to realize that universal uniformity in no way requires universal acceptance of exactly the same instruments, observations, content, questions, items, etc. Any brand tool that verifiably produces measures of the desired outcome, impact, process, etc. in the reference standard metric can compete in the measurement market, just as is the case with clocks or thermometers. Further, far from reducing rich complexities to a meaningless number, new measurement technologies open up new horizons for meaningful relationships by improving our understandings of ourselves and others.

The way forward centers on building a scientifically rigorous metric system that will make our stocks of human, social, and natural capital fungible. We need universally uniform metrics for the outcomes and impacts of schools, hospitals, social services, human, organizational, and natural resources, etc. When you appreciate the extent to which any economy lives and dies on its measurement standards, the need for national and global investments in quantitatively rigorous and qualitatively meaningful metrics becomes all too painfully obvious.

For more information, see https://livingcapitalmetrics.wordpress.com, http://www.livingcapitalmetrics.com, http://www.linkedin.com/in/livingcapitalmetrics, and other sources, such as the Wikipedia entry on Georg Rasch, http://www.rasch.org, and elsewhere.

These comments were also posted at http://marketplace.publicradio.org/display/web/2009/09/25/pm-weekly-wrap-q/.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Posted today at HealthReform.gov

July 26, 2009

Any bill serious about health care reform needs to demand that the industry take advantage of readily available and dramatically improved measurement methods. We manage what we measure, and 99% of existing outcome measures are measures in name only. A kind of metric system for outcomes could provide standard product definitions, could effect huge reductions in information transaction costs, and could bring about a whole new magnitude of market efficiencies. Far from being a drag on the system, the profit motive is the best source of energy we have for driving innovation and resetting the cost-quality equation. But the disastrously low quality of our measures corrupts the data and prevents informed decision making by consumers and quality improvement experts. Any health care reform effort that does not demand improved measurement is doomed to fall far short of the potential that is within our reach. For more information, see www.Rasch.org, www.livingcapitalmetrics.com, http://dx.doi.org/10.1016/j.measurement.2009.03.014, or http://home.att.net/~rsmith.arm/RMHS_flyer.pdf.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Contesting the Claim, Part I: Are Rasch Measures Really as Objective as Physical Measures?

July 21, 2009

Psychometricians, statisticians, metrologists, and measurement theoreticians tend to be pretty unassuming kinds of people. They’re unobtrusive and retiring, by and large. But there is one thing some of them are prone to say that will raise the ire of others in a flash, and the poor innocent geek will suddenly be subjected to previously unknown forms and degrees of social exclusion.

What is that one thing? “Instruments calibrated by fitting data to a Rasch model measure with the same kind of objectivity as is obtained with physical measures.” That’s one version. Another could be along these lines: “When data fit a Rasch model, we’ve discovered a pattern in human attitudes or behaviors so regular that it is conceptually equivalent to a law of nature.”

Maybe it is the implication of objectivity as something that must be politically incorrect that causes the looks of horror and recoiling retreats in the nonmetrically inclined when they hear things like this. Maybe it is the ingrained cultural predisposition to thinking such claims outrageously preposterous that makes those unfamiliar with 80 years of developments and applications so dismissive. Maybe it’s just fear of the unknown, or a desire not to have to be responsible for knowing something important that hardly anyone else knows.

Of course, it could just be a simple misunderstanding. When people hear the word “objective” do most of them have an image of an object in mind? Does objectivity connote physical concreteness to most people? That doesn’t hold up well for me, since we can be objective about events and things people do without any confusions involving being able to touch and feel what’s at issue.

No, I think something else is going on. I think it has to do with the persistent idea that objectivity requires a disconnected, alienated point of view, one that ignores the mutual implication of subject and object in favor of analytically tractable formulations of problems that, though solvable, are irrelevant to anything important or real. But that is hardly the only available meaning of objectivity, and it isn’t anywhere near the best. It certainly is not what is meant in the world of measurement theory and practice.

It’s better to think of objectivity as something having to do with things like the object of a conversation, or an object of linguistic reference: “chair” as referring to the entire class of all forms of seating technology, for instance. In these cases, we know right away that we’re dealing with what might be considered a heuristic ideal, an abstraction. It also helps to think of objectivity in terms of fairness and justice. After all, don’t we want our educational, health care, and social services systems to respect the equality of all individuals and their rights?

That is not, of course, how measurement theoreticians in psychology have always thought about objectivity. In fact, it was only 70-80 years ago that most psychologists gave up on objective measurement because they couldn’t find enough evidence of concrete phenomena to support the claims to objectivity they wanted to make (Michell, 1999). The focus on the reflex arc led a lot of psychologists into psychophysics, and the effects of operant conditioning led others to behaviorism. But a lot of the problems studied in these fields, though solvable, turned out to be uninteresting and unrelated to the larger issues of life demanding attention.

And so, with no physical entity that could be laid end-to-end and concatenated in the way weights are in a balance scale, psychologists just redefined measurement to suit what they perceived to be the inherent limits of their subject matter. Measurement didn’t have to be just ratio or interval, it could also be ordinal and even nominal. The important thing was to get numbers that could be statistically manipulated. That would provide more than enough credibility, or obfuscation, to create the appearance of legitimate science.

But while mainstream psychology was focused on hunting for statistically significant p-values, there were others trying to figure out if attitudes, abilities, and behaviors could be measured in a rigorously meaningful way.

Louis Thurstone, a former electrical engineer turned psychologist, was among the first to formulate the problem. Writing in 1928, Thurstone rightly focused on the instrument as the focus of attention:

The scale must transcend the group measured.–One crucial experimental test must be applied to our method of measuring attitudes before it can be accepted as valid. A measuring instrument must not be seriously affected in its measuring function by the object of measurement. To the extent that its measuring function is so affected, the validity of the instrument is impaired or limited. If a yardstick measured differently because of the fact that it was a rug, a picture, or a piece of paper that was being measured, then to that extent the trustworthiness of that yardstick as a measuring device would be impaired. Within the range of objects for which the measuring instrument is intended, its function must be independent of the object of measurement”  (Thurstone, 1959, p. 228).

Thurstone aptly captures what is meant when it is said that attitudes, abilities, or behaviors can be measured with the same kind of objectivity as is obtained in the natural sciences. Objectivity is realized when a test, survey, or assessment functions the same way no matter who is being measured, and, conversely (Thurstone took this up, too), an attitude, ability, or behavior exhibits the same amount of what is measured no matter which instrument is used.

This claim, too, may seem to some to be so outrageously improbable as to be worthy of rejecting out of hand. After all, hasn’t everyone learned how the fact of being measured changes the measure? Thing is, this is just as true in physics and ecology as it is in psychiatry or sociology, and the natural sciences haven’t abandoned their claims to objectivity. So what’s up?

What’s up is that all sciences now have participant observers. The old Cartesian duality of the subject-object split still resides in various rhetorical choices and affects our choices and behaviors, but, in actual practice, scientific methods have always had to deal with the way questions imply particular answers.

And there’s more. Qualitative methods have grown out of some of the deep philosophical introspections of the twentieth century, such as phenomenology, hermeneutics, deconstruction, postmodernism, etc. But most researchers who are adopting qualitative methods over quantitative ones don’t know that the philosophers legitimating the new focuses on narrative, interpretation, and the construction of meaning did quite a lot of very good thinking about mathematics and quantitative reasoning. Much of my own published work engages with these philosophers to find new ways of thinking about measurement (Fisher, 2004, for instance). And there are some very interesting connections to be made that show quantification does not necessarily have to involve a positivist, subject-object split.

So where does that leave us? Well, with probability. Not in the sense of statistical hypothesis testing, but in the sense of calibrating instruments with known probabilistic characteristics. If the social sciences are ever to be scientific, null hypothesis significance tests are going to have to be replaced with universally uniform metrics embodying and deploying the regularities of natural laws, as is the case in the physical sciences. Various arguments on this issue have been offered for decades (Cohen, 1994; Meehl, 1967, 1978; Goodman, 1999; Guttman, 1985; Rozeboom, 1960). The point is not to proscribe allowable statistics based on scale type  (Velleman & Wilkinson, 1993). Rather, we need to shift and simplify the focus of inference from the statistical analysis of data to the calibration and distribution of instruments that support distributed cognition, unify networks, lubricate markets, and coordinate collective thinking and acting (Fisher, 2000, 2009). Persuasion will likely matter far less in resolving the matter than an ability to create new value, efficiencies, and profits.

In 1964, Luce and Tukey gave us another way of stating what Thurstone was getting at:

“The axioms of conjoint measurement apply naturally to problems of classical physics and permit the measurement of conventional physical quantities on ratio scales…. In the various fields, including the behavioral and biological sciences, where factors producing orderable effects and responses deserve both more useful and more fundamental measurement, the moral seems clear: when no natural concatenation operation exists, one should try to discover a way to measure factors and responses such that the ‘effects’ of different factors are additive.”

In other words, if we cannot find some physical thing that we can make add up the way numbers do, as we did with length, weight, volts, temperature, time, etc., then we ought to ask questions in a way that allows the answers to reveal the kind of patterns we expect to see when things do concatenate. What Thurstone and others working in his wake have done is to see that we could possibly do some things virtually in terms of abstract relations that we cannot do actually in terms of concrete relations.

The concept is no more difficult to comprehend than understanding the difference between playing solitaire with actual cards and writing a computer program to play solitaire with virtual cards. Either way, the same relationships hold.

A Danish mathematician, Georg Rasch, understood this. Working in the 1950s with data from psychological and reading tests, Rasch worked from his training in the natural sciences and mathematics to arrive at a conception of measurement that would apply in the natural and human sciences equally well. He realized that

“…the acceleration of a body cannot be determined; the observation of it is admittedly liable to … ‘errors of measurement’, but … this admittance is paramount to defining the acceleration per se as a parameter in a probability distribution — e.g., the mean value of a Gaussian distribution — and it is such parameters, not the observed estimates, which are assumed to follow the multiplicative law [acceleration = force / mass, or mass * acceleration = force].

“Thus, in any case an actual observation can be taken as nothing more than an accidental response, as it were, of an object — a person, a solid body, etc. — to a stimulus — a test, an item, a push, etc. — taking place in accordance with a potential distribution of responses — the qualification ‘potential’ referring to experimental situations which cannot possibly be [exactly] reproduced.

“In the cases considered [earlier in the book] this distribution depended on one relevant parameter only, which could be chosen such as to follow the multiplicative law.

“Where this law can be applied it provides a principle of measurement on a ratio scale of both stimulus parameters and object parameters, the conceptual status of which is comparable to that of measuring mass and force. Thus, … the reading accuracy of a child … can be measured with the same kind of objectivity as we may tell its weight …” (Rasch, 1960, p. 115).

Rasch’s model not only sets the parameters for data sufficient to the task of measurement, it lays out the relationships that must be found in data for objective results to be possible. Rasch studied with Ronald Fisher in London in 1935, expanded his understanding of statistical sufficiency with him, and then applied it in his measurement work, but not in the way that most statisticians understand it. Yes, in the context of group-level statistics, sufficiency concerns the reproducibility of a normal distribution when all that is known are the mean and the standard deviation. But sufficiency is something quite different in the context of individual-level measurement. Here, counts of correct answers or sums of ratings serve as sufficient statistics  for any statistical model’s parameters when they contain all of the information needed to establish that the parameters are independent of one another, and are not interacting in ways that keep them tied together. So despite his respect for Ronald Fisher and the concept of sufficiency, Rasch’s work with models and methods that worked equally well with many different kinds of distributions led him to jokingly suggest (Andersen, 1995, p. 385) that all textbooks mentioning the normal distribution should be burned!

In plain English, all that we’re talking about here is what Thurstone said: the ruler has to work the same way no matter what or who it is measuring, and we have to get the same results for what or who we are measuring no matter which ruler we use. When parameters are not separable, when they stick together because some measures change depending on which questions are asked or because some calibrations change depending on who answers them, we have encountered a “failure of invariance” that tells us something is wrong. If we are to persist in our efforts to determine if something objective exists and can be measured, we need to investigate these interactions and explain them. Maybe there was a data entry error. Maybe a form was misprinted. Maybe a question was poorly phrased. Maybe we have questions that address different constructs all mixed together. Maybe math word problems work like reading test items for students who can’t read the language they’re written in.  Standard statistical modeling ignores these potential violations of construct validity in favor of adding more parameters to the model.

But that’s another story for another time. Tomorrow we’ll take a closer look at sufficiency, in both conceptual and practical terms. Cited references are always available on request, but I’ll post them in a couple of days.

A Tale of Two Industries: Contrasting Quality Assessment and Improvement Frameworks

July 8, 2009

Imagine the chaos that would result if industrial engineers each had their own tool sets calibrated in idiosyncratic metrics with unit sizes that changed depending on the size of what they measured, and they conducted quality improvement studies focusing on statistical significance tests of effect sizes. Furthermore, these engineers ignore the statistical power of their designs, and don’t know when they are finding statistically significant results by pure chance, and when they are not. And finally, they also ignore the substantive meaning of the numbers, so that they never consider the differences they’re studying in terms of varying probabilities of response to the questions they ask.

So when one engineer tries to generalize a result across applications, what happens is that it kind of works sometimes, doesn’t work at all other times, is often ignored, and does not command a compelling response from anyone because they are invested in their own metrics, samples, and results, which are different from everyone else’s. If there is any discussion of the relative merits of the research done, it is easy to fall into acrimonious and heated arguments that cannot be resolved because of the lack of consensus on what constitutes valid data, instrumentation, and theory.

Thus, the engineers put up the appearance of polite decorum. They smile and nod at each other’s local, sample-dependent, and irreproducible results, while they build mini-empires of funding, students, quoting circles, and professional associations on the basis of their personal authority and charisma. As they do so, costs in their industry go spiralling out of control, profits are almost nonexistent, fewer and fewer people can afford their products, smart people are going into other fields, and overall product quality is declining.

Of course, this is the state of affairs in education and health care, not in industrial engineering. In the latter field, the situation is much different. Here, everyone everywhere is very concerned to be sure they are always measuring the same thing as everyone else and in the same unit. Unexpected results of individual measures pop out instantly and are immediately redone. Innovations are more easily generated and disseminated because everyone is thinking together in the same language and seeing effects expressed in the same images. Someone else’s ideas and results can be easily fitted into anyone else’s experience, and the viability of a new way of doing things can be evaluated on the basis of one’s own experience and skills.

Arguments can be quite productive, as consensus on basic values drives the demand for evidence. Associations and successes are defined more in terms of merit earned from productivity and creativity demonstrated through the accumulation of generalized results. Costs in these industries are constantly dropping, profits are steady or increasing, more and more people can afford their products, smart people are coming into the field, and overall product quality is improving.

There is absolutely no reason why education and health care cannot thrive and grow like other industries. It is up to us to show how.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Table Comparing Scores, Ratings, and Percentages with Real Measures

July 6, 2009

(Documentation to be posted tomorrow.)

Characteristics Raw Scores and/or Percentages Rasch Measurement

Quantitative hypothesis

Neither formulated nor tested

Formulated and tested

Criteria for falsifying quantitative hypothesis None Additivity, conjoint transitivity, parameter separation, unidimensionality, invariance, statistical sufficiency, monotonicity, homogeneity, infinite divisibility, etc.
Relation to sample distribution Dependent Independent

Paradigm

Descriptive statistics

Prescriptive measurement

Model-data relation

Models describe data, models fit to data, model with best statistics chosen

Models prescribe data quality needed for objective inference, data fit to models, GIGO principle

Relation to structure of natural laws

None

Identical

Statistical tests of quantitative hypothesis None Information-weighted and outlier-sensitive model fit, Principal Components Analysis, many other fit statistics available
Reliability coefficients Cronbach’s alpha, KR-20, etc. Cronbach’s alpha, KR-20, etc. and Separation, Strata
Reliability error source Unexplained portion of variance Mean square of individual error estimates
Range of measurement Arbitrary, from minimum to maximum score Nonarbitrary, infinite
Unit status Ordinal, nonlinear Interval, linear
Unit status assumed in statistical comparisons Interval, linear Interval, linear
Proofs of unit status Correlational Axiomatic; reproduced physical metrics; graphical plots; independent cross-sample recalibrations; etc.
Error theory for individual scores/measures None Derived from sampling theory
Architecture (capacity to add/delete items) Closed Open
Supports adaptive administration and mass customization No (changes to items change meaning of scores) Yes (changes to items do not change meaning of measure)
Supports traceability to metrological reference standard No Yes
Domains scored Either persons or items but rarely both All facets in model (persons, items, rating categories, judges, tasks, etc.)
Comparability of domains scored Would be incomparable if scored Comparable; each interpreted in terms of the other
Unscored domain characteristics Assumed all same score or random (though probably not)
No unscored domain
Relation with other measures of same construct Incommensurable Commensurable and equatable
Construct definition None Consistency, meaningfulness, interpretability, and predictability of calibration/measure hierarchies
Focus of interpretation Mean scores or percentages relative to demographics or experimental groups Measures relative to calibrations and vice versa; measures relative to demographics or experimental groups
Relation to qualitative methods Stark difference in philosophical commitments Rooted in same philosophical commitments
Quality of research dialogue Researchers’ expertise elevated relative to research subjects Research subjects voice individual and collective perspectives on coherence of construct as defined by researchers’ questions
Source of narrative theme Researcher Object of unfolding dialogue

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Graphic Illustrations of Why Scores, Ratings, and Percentages Are Not Measures, Part Two

July 2, 2009

Part One of this two-part blog offered pictures illustrating the difference between numbers that stand for something that adds up and those that do not. The uncontrolled variation in the numbers that pass for measures in health care, education, satisfaction surveys, performance assessments, etc. is analogous to the variation in weights and measures found in Medieval European markets. It is well established that metric uniformity played a vital role in the industrial and scientific revolutions of the nineteenth century. Metrology will inevitably play a similarly central role in the economic and scientific revolutions taking place today.

Clients and students often express their need for measures that are manageable, understandable, and relevant. But sometimes it turns out that we do not understand what we think we understand. New understandings can make what previously seemed manageable and relevant appear unmanageable and irrelevant. Perhaps our misunderstandings about measurement will one day explain why we have failed to innovate and improve as much as we could have.

Of course, there are statistical methods for standardizing scores and proportions that make them comparable across different normal distributions, but I’ve never once seen them applied to employee, customer, or patient survey results reported to business or hospital managers. They certainly are not used in determining comparable proficiency levels of students under No Child Left Behind. Perhaps there are consultants and reporting systems that make standardized z-scores a routine part of their practices, but even if they are, why should anyone willingly base their decisions on the assumption that normal distributions have been obtained? Why not use methods that give the same result no matter how scores are distributed?

To bring the point home, if statistical standardization is a form of measurement, why don’t we use the z-scores for height distributions instead of the direct measures of how tall we each are? Plainly, the two kinds of numbers have different applications. Somehow, though, we try to make do without the measures in many applications involving tests and surveys, with the unfortunate consequence of much lost information and many lost opportunities for better communication.

Sometimes I wonder, if we would give a test on the meaning of the scores, percentages, and logits discussed in Part One to managers, executives, and entrepreneurs, would many do any better on the parts they think they understand than on the parts they find unfamiliar? I suspect not. Some executives whose pay-for-performance bonuses are inflated by statistical accidents are going to be unhappy with what I’m going to say here, but, as I’ve been saying for years, clarifying financial implications will go a long way toward motivating the needed changes.

How could that be true? Well, consider the way we treat percentages. Imagine that three different hospitals see their patients’ percents agreement with a key survey item change as follows. Which one changed the most?

 

A. from 30.85% to 50.00%: a 19.15% change

B. from 6.68% to 15.87%: a 9.18% change

C. from 69.15% to 84.13%: a 14.99% change

As is illustrated in Figure 1 below, given that all three pairs of administrations of the survey are included together in the same measure distribution, it is likely that the three changes were all the same size.

In this scenario, all the survey administrations shared the same standard deviation in the underlying measure distribution that the key item’s percentage was drawn from, and they started from different initial measures. Different ranges in the measures are associated with different parts of the sample’s distribution, and so different numbers and percentages of patients are associated with the same amount of measured change. It is easy to see that 100-unit measured gains in the range of 50-150 or 1000-1100 on the horizontal axis would scarcely amount to 1% changes, but the same measured gain in the middle of the distribution could be as much as 25%.

Figure 1. Different Percents, Same Measures

Figure 1. Different Percentages, Same Measures

Figure 1 shows how the same measured gain can look wildly different when expressed as a percentage, depending on where the initial measure is positioned in the distribution. But what happens when percentage gains are situated in different distributions that have different patterns of variation?

More specifically, consider a situation in which three different hospitals see their percents agreement with a key survey item change as follows.

A. from 30.85% to 50.00%: a 19.15% change

B. from 30.85% to 50.00%: a 19.15% change

C. from 30.85% to 50.00%: a 19.15% change

Did one change more than the others? Of course, the three percentages are all the same, so we would naturally think that the three increases are all the same. But what if the standard deviations characterizing the three different hospitals’ score distributions are different?

Figure 2, below, shows that the three 19.15% changes could be associated with quite different measured gains. When the distribution is wider and the standard deviation is larger, any given percentage change will be associated with a larger measured change than in cases with narrower distributions and smaller standard deviations.

Same Percentage Gains, Different Measured Gains

Figure 2. Same Percentage Gains, Different Measured Gains

And if this is not enough evidence as to the foolhardiness of treating percentages as measures, bear with me through one more example. Imagine another situation in which three different hospitals see their percents agreement with a key survey item change as follows.

A. from 30.85% to 50.00%: a 19.15% change

B. from 36.96% to 50.00%: a 13.04% change

C. from 36.96% to 50.00%: a 13.04% change

Did one change more than the others? Plainly A obtains the largest percentage gain. But Figure 3 shows that, depending on the underlying distribution, A’s 19.15% gain might be a smaller measured change than either B’s or C’s. Further, B’s and C’s measures might not be identical, contrary to what would be expected from the percentages alone.

Figure 3. Percentages Completely at Odds with Measures

Figure 3. Percentages Completely at Odds with Measures

Now we have a fuller appreciation of the scope of the problems associated with the changing unit size illustrated in Part One. Though we think we understand percentages and insist on using them as something familiar and routine, the world that they present to us is as crazily distorted as a carnival funhouse. And we won’t even begin to consider how things look in the context of distributions skewed toward one end of the continuum or the other! There is similarly no point at all in going to bimodal or multimodal distributions (ones that have more than one peak). The vast majority of business applications employing scores, ratings, and percentages as measures do not take the underlying distribution into account. Given the problems that arise in optimal conditions (i.e., with a normal distribution), there is no need to belabor the issue with an enumeration of all the possible things that could be going wrong. Far better to simply move on and construct measurement systems that remain invariant across the different shapes of local data sets’ particular distributions.

How could we have gone so far in making these nonsensical numbers the focus of our attention? To put things back in perspective, we need to keep in mind the evolving magnitude of the problems we face. When Florence Nightingale was deploring the lack of any available indications of the effectiveness of her efforts, a little bit of flawed information was a significant improvement over no information. Ordinal, situation-specific numbers provided highly useful information when problems emerged in local contexts on a scale that could be comprehended and addressed by individuals and small groups.

We no longer live in that world. Today’s problems require kinds of information that must be more meaningful, precise, and actionable than ever before. And not only that, this information cannot remain accessible only to managers, executives, researchers, and data managers. It must be brought to bear in every transaction and information exchange in the industry.

Information has to be formatted in the common currency of uniform metrics to make it as fluid and empowering as possible. Would the auto industry have been able to bring off a quality revolution if every worker’s toolkit was calibrated in a different unit? Could we expect to coordinate schedules easily if we each had clocks scaled in different time units? Obviously not; why should we expect quality revolutions in health care and education when nearly all of our relevant metrics are incommensurable?

Management consultants realized decades ago that information creates a sense of responsibility in the person who possesses it. We cannot expect clinicians and teachers to take full responsibility for the outcomes they produce until they have the information they need to evaluate and improve them. Existing data and systems plainly are not up to the task.

The problem is far less a matter of complex or difficult issues than it is one of culture and priorities. It often takes less effort to remain in a dysfunctional rut and deal with massive inefficiencies than it does to get out of the rut and invent a new system with new potentials. Big changes tend to take place only when systems become so bogged down by their problems that new systems emerge simply out of the need to find some way to keep things in motion. These blogs are written in the hope that we might be able to find our way to new methods without suffering the catastrophes of total system failure. One might well imagine an entrepreneurially-minded consortium of providers, researchers, payors, accreditors, and patient advocates joining forces in small pilot projects testing out new experimental systems.

To know how much of something we’re getting for our money and whether its a fair bargain, we need to be able to compare amounts across providers, vendors, treatment options, teaching methods, etc. Scores summed from tests, surveys, or assessments, individual ratings, and percentages of a maximum possible score or frequency do not provide this information because they are not measures. Their unit sizes vary across individuals, collections of indicators (instruments), time, and space. The consequences of treating scores and percentages as measures are not trivial. We will eventually come to see that measurement quality is the primary source of the differences between the current health care and education systems’ regional variations and endlessly spiralling costs, on the one hand, and the geographically uniform quality, costs, and improvements in the systems we will create in the future.

Markets are dysfunctional when quality and costs cannot be evaluated in common terms by consumers, providers’ quality improvement specialists, researchers, accreditors, and payers. There are widespread calls for greater transparency in purchasing decisions, but transparency is not being defined and operationalized meaningfully or usefully. As currently employed, transparency refers to making key data available for public scrutiny. But these data are almost always expressed as scores, ratings, or percentages that are anything but transparent. In addition to not adding up, these data are also usually presented in indigestibly large volumes, and are not quality assessed.

All things considered, we’re doing amazingly well with our health care and education systems given the way we’ve hobbled ourselves with dysfunctional, incommensurable measures. And that gives us real cause for hope! What will we be able to accomplish when we really put our minds to measuring what we want to manage? How much better will we be able to do when entrepreneurs have the tools they need to innovate new efficiences? Who knows what we’ll be capable of when we have meaningful measures that stand for amounts that really add up, when data volumes are dramatically reduced to manageable levels, and when data quality is effectively assessed and improved?

For more on the problems associated with these kinds of percentages in the context of NCLB, see Andrew Dean Ho’s article in the August/September, 2008 issue of Educational Researcher, and Charles Murray’s “By the Numbers” column in the July 25, 2006 Wall Street Journal.

This is not the end of the story as to what the new measurement paradigm brings to bear. Next, I’ll post a table contrasting the features of scores, ratings, and percentages with those of measures. Until then, check out the latest issue of the Journal of Applied Measurement at http://www.jampress.org, see what’s new in measurement software at http://www.winsteps.com or http://www.rummlab.com.au, or look into what’s up in the way of measurement research projects with the BEAR group at UC Berkeley (http://gse.berkeley.edu/research/BEAR/research.html).

Finally, keep in mind that we are what we measure. It’s time we measured what we want to be.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Graphic Illustrations of Why Scores, Ratings, and Percentages Are Not Measures, Part One

July 1, 2009

It happens occasionally when I’m speaking to a group unfamiliar with measurement concepts that my audiences audibly gasp at some of the things I say. What can be so shocking about anything as mundane as measurement? A lot of things, in fact, since we are in the strange situation of having valid and rigorous intuitions about what measures ought to be, while we simultaneously have entire domains of life in which our measures almost never live up to those intuitions in practice.

So today I’d like to spell out a few things about measurement, graphically. First, I’m going to draw a picture of what good measurement looks like. This picture will illustrate why we value numbers and want to use them for managing what’s important. Then I’m going to draw a picture of what scores, ratings, and percentages look like. Here we’ll see how numbers do not automatically stand for something that adds up the way they do, and why we don’t want to use these funny numbers for managing anything we really care about. What we will see here, in effect, is why high stakes graduation, admissions, and professional certification and licensure testing agencies have long since abandoned scores, ratings, and percentages as their primary basis for making decisions.

After contrasting those pictures, a third picture will illustrate how to blend the valid intuitions informing what we expect from measures with the equally valid intuitions informing the observations expressed in scores, ratings, and percentages.

Imagine measuring everything in the room you’re in twice, once with a yardstick and once with a meterstick. You record every measure in inches and in centimeters. Then you plot these pairs of measures against each other, with inches on the vertical axis and centimeters on the horizontal. You would come up with a picture like Figure 1, below.

Figure 1. How We Expect Measures to Work

Figure 1. How We Expect Measures to Work

The key thing to appreciate about this plot is that the amounts of length measured by the two different instruments stay the same no matter which number line they are mapped onto. You would get a plot like this even if you sawed a yardstick in half and plotted the inches read off the two halves. You’d also get the same kind of a plot (obviously) if you paired up measures of the same things from two different manufacturer’s inch rulers, or from two different brands of metersticks. And you could do the same kind of thing with ounces and grams, or degrees Fahrenheit and Celsius.

So here we are immersed in the boring-to-the-point-of-being-banal details of measurement. We take these alignments completely for granted, but they are not given to us for nothing. They are products of the huge investments we make in metrological standards. Metrology came of age in the early nineteenth century. Until then, weights and measures varied from market to market. Units with the same name might be different sizes, and units with different names might be the same size. As was so rightly celebrated on World Metrology Day (May 20), metric uniformity contributes hugely to the world economy by reducing transaction costs and by structuring representations of fair value.

We are in dire need of similar metrological systems for human, social, and natural capital. Health care reform, improved education systems, and environmental management will not come anywhere near realizing their full potentials until we establish, implement, and monitor metrological standards that bring intangible forms of capital to economic life.

But can we construct plots like Figure 1 from the numeric scores, ratings, and percentages we commonly assume to be measures? Figure 2 shows the kind of picture we get when we plot percentages against each other (scores and ratings behave in the same way, for reasons given below). These data might be from easy and hard halves of the same reading or math test, from agreeable and disagreeable ends of the same rating scale survey, or from different tests or surveys that happen to vary in their difficulty or agreeability. The Figure 2 data might also come from different situations in which some event or outcome occurs more frequently in one place than it does in another (we’ll go more into this in Part Two of this report).

Figure 2. Percents Correct or Agreement from Different Tests or Surveys

Figure 2. Percents Correct or Agreement from Different Tests or Surveys

In contrast with the linear relation obtained in the comparison of inches and centimeters, here we have a curve. Why must this relation necessarily be curved? It cannot be linear because both instruments limit their measurement ranges, and they set different limits. So, if someone scores a 0 on the easy instrument, they are highly likely to also score 0 on the instrument posing more difficult or disagreeable questions. Conversely, if someone scores 100 on the hard instrument, they are highly likely to also score 100 on the easy one.

But what is going to happen in the rest of the measurement range? By the definition of easy and hard, scores on the easy instrument will be higher than those on the hard one. And because the same measured amount is associated with different ranges in the easy and hard score distributions, the scores vary at different rates (Part Two will explore this phenomenon in more detail).

These kinds of numbers are called ordinal because they meaningfully convey information about rank order. They do not, however, stand for amounts that add up. We are, of course, completely free to treat these ordinal numbers however we want, in any kind of arithmetical or statistical comparison. Whether such comparisons are meaningful and useful is a completely different issue.

Figure 3 shows the Figure 2 data transformed. The mathematical transformation of the percentages produces what is known as a logit, so called because it is a log-odds unit, obtained as the natural logarithm of the response odds. (The response odds are the response probabilities–the original percentages of the maximum possible score–divided by one minus themselves.) This is the simplest possible way of estimating linear measures. Virtually no computer program providing these kinds of estimates would employ an algorithm this simple and potentially fallible.

Figure 3. Logit (Log-Odds Units) Estimates of the Figure 2 Data

Figure 3. Logit (Log-Odds Units) Estimates of the Figure 2 Data

Although the relationship shown in Figure 3 is not as precise as that shown in Figure 1, especially at the extremes, the values plotted fall far closer to the identity line than the values in Figure 2 do. Like Figure 1, Figure 3 shows that constant amounts of the thing measured exist irrespective of the particular number line they happen to be mapped onto.

What this means is that the two instruments could be designed so that the same numbers are read off of them when the same amounts are measured. We value numbers as much as we do because they are so completely transparent: 2+2=4 no matter what. But this transparency can be a liability when we assume that every unit amount is the same as all the others and they actually vary substantially. When different units stand for different amounts, confusion reigns. But we can reasonably hope and strive for great things as we bring human, social, and natural capital to life via universally uniform metrics traceable to reference standards.

A large literature on these methods exists and ought to be more widely read. For more information, see http://www.rasch.org, http://www.livingcapitalmetrics.com, etc.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.