Archive for the ‘Tuning instruments’ Category

Measuring Instruments as Media for the Expression of Creative Passions in Education

June 26, 2015

Measurement is often viewed as a reduction of complex phenomena to numbers. It is accordingly also often conceived as mechanical, and disconnected from the world of life. Educational examinations are seen by many as an especially egregious form of inappropriate reduction. This perspective is contradicted, however, by a perspective that sees an analogy between educational assessment and music. Calibrated instruments, mathematical scales, and high technology play key roles in the production of music, which, ironically, is widely considered the most alive, captivating and emotionally powerful of the arts. Though behavioral psychology has indeed learned how to use music to manipulate consumer purchasing decisions, music is unabashedly accepted nonetheless as the highest expression of passion in art.

The question then arises as to if and how measurement in other areas, such as in education, might be conceived, designed, and practiced as a medium for the expression and fulfillment of creative passions. Key issues involved in substantively realizing a musical metaphor in human and social measurement include capacities to tune instruments, to define common scales, to score performances, to orchestrate harmonious relationships, to enhance choral grace note effects, and to combine elements in unique but pleasing and recognizable rhythmic arrangements.

Practical methods for making educational measurement the medium for the expression of creative passions for learning are in place in thousands of schools nationally and internationally. With such tools in hand, formative applications of integrated instruction and assessment could be conceived as intuitive media for composing and conducting expressions of creative passions. Student outcomes in reading, mathematics, and other domains may then come to be seen in terms of portfolios of works akin to those produced by musicians, sculptors, film makers, or painters.

Hundreds of thousands of books and millions of articles tuned to the same text complexity scale, for instance, provide readers an extensive palette of colorful tones and timbres for expressing their desires and capacities for learning. Graphical presentations of individual students’ outcomes, as well as outcomes aggregated by classroom, school, district, etc., could be presented, interpreted and experienced as public performances of artful developmental narratives enabling dramatic performances of personal uniqueness and social generality.

Measurement instrumentation in education is able to capture, aggregate, and organize literacy, numeracy, socio-emotional intelligence, and other performances into special portfolios documenting the play and dance of emerging new understandings. As in any creative process, accidents, errors, and idiosyncratic patterns of strengths and weaknesses may evoke powerful and dramatic expressions of beauty, and human and social value. And just as members of musical ensembles may complement one another’s skills, using rhythm and harmony to improve each others’ playing abilities in practice, so, too, instruments of formative assessment tuned to the same scale can be used to coordinate and enhance individual student and teacher skill levels.

Possibilities for orchestrating such performances across educational, health care, social service, environmental management, and other fields could similarly take advantage of existing instrument calibration and measurement technologies.

Measurement as a Medium for the Expression of Creative Passions in Education

April 23, 2014

Measurement is often viewed as a purely technical task involving a reduction of complex phenomena to numbers. It is accordingly also experienced as mechanical in nature, and disconnected from the world of life. Educational examinations are often seen as an especially egregious form of inappropriate reduction.

This perspective on measurement is contradicted, however, by the essential roles of calibrated instrumentation, mathematical scales, and high technology in the production of music, which, ironically, is widely considered the most alive, captivating and emotionally powerful of the arts.

The question then arises as to if and how measurement in other areas, such as in education, might be conceived, designed, and practiced as a medium for the expression and fulfillment of creative passions. Key issues involved in substantively realizing a musical metaphor in human and social measurement include capacities to tune instruments, to define common scales, to orchestrate harmonious relationships, to enhance choral grace note effects, and to combine elements in unique but pleasing and recognizable forms.

Practical methods of this kind are in place in hundreds of schools nationally and internationally. With such tools in hand, formative applications of integrated instruction and assessment could be conceived as intuitive media for composing and conducting expressions of creative passions.

Student outcomes in reading, mathematics, and other domains may then come to be seen in terms of portfolios of works akin to those produced by musicians, sculptors, film makers, or painters. Hundreds of thousands of books and millions of articles tuned to the same text complexity scale provide readers an extensive palette of colorful tones and timbres for expressing their desires and capacities for learning. Graphical presentations of individual students’ outcomes, as well as outcomes aggregated by classroom, school, district, etc., may be interpreted and experienced as public performances of artful developmental narratives enabling dramatic performances of personal uniqueness and social generality.

Technical canvases capture, aggregate, and organize literacy performances into special portfolios documenting the play and dance of emerging new understandings. As in any creative process, accidents, errors, and idiosyncratic patterns of strengths and weaknesses may evoke powerful expressions of beauty, and human and social value. Just as members of musical ensembles may complement one another’s skills, using rhythm and harmony to improve each others’ playing abilities in practice, so, too, instruments of formative assessment tuned to the same scale can be used to enhance individual teacher skill levels.

Possibilities for orchestrating such performances across educational, health care, social service, environmental management, and other fields could similarly take advantage of existing instrument calibration and measurement technologies.

Creatively Expressing How Love Matters for Justice: Setting the Stage and Tuning the Instruments

April 16, 2014

Nussbaum (2013) argues about the political importance of connecting with our bodies without shame and disgust, and of the relevance musical and poetic public expressions of varieties of love offer to conceptions of justice. Institutions embodying principles of loving justice require media integrating emotional expression with technical calculation, in exactly the same way music does. Being able to dance at the revolution demands instruments tuned to shared scales, no matter if equal temperament, just intonation, meantone tuning, or any of a variety of other well, or irregular, temperaments are chosen.

The physicality of dancing, so often evoking romance and courtship, provides a point of entry to a metaphoric logic of reproduction applicable to the Socratic midwifery of ideas and to the products of social intercourse. Tuning the instruments of the human, social, and environmental arts and sciences to harmonize and choreograph relationships may then enable formulation of nonreductionist approaches to the problem of how to reconcile political emotions with physical or geometrical accounts of the scales of justice.

Historical accounts of (musical, medical, electrical, etc.) metrological standards describe ways in which passionate concern for shared vulnerabilities and common joys have sometimes succeeded in deploying systems realizing higher forms of just relations (Alder, 2002; Berg and Timmermans, 2000;  Isacoff, 2001; Schaffer, 1992). The question of the day is whether we will succeed in creating yet new forms of such relations in the many areas of life where they are needed.

Yes, as Nussbaum (2013, p. 396) admits, the demand for love is a tall order, and unrealistic. But all heuristic fictions, from Pythagorean triangles to the mathematical pendulum, are unrealistic and are never actually observed in practice, as has been pointed out by a number of historians and philosophers (Butterfield 1957, pp. 16-17; Heidegger, 1967, p. 89; Rasch, 1960, pp. 37-38, 1973/2011). These fictions are, however, eminently useful as guides, goals, and as coherent ways of telling our stories, and that is the criterion by which they should be judged.

 

Alder, K. (2002). The measure of all things: The seven-year odyssey and hidden error that transformed the world. New York: The Free Press.

Berg, M., & Timmermans, S. (2000). Order and their others: On the constitution of universalities in medical work. Configurations, 8(1), 31-61.

Butterfield, H. (1957). The origins of modern science (revised edition). New York: The Free Press.

Heidegger, M. (1967). What is a thing? (W. B. Barton, Jr. & V. Deutsch, Trans.). South Bend, Indiana: Regnery/Gateway.

Isacoff, S. M. (2001). Temperament: The idea that solved music’s greatest riddle. New York: Alfred A. Knopf.

Nussbaum, M. (2013). Political emotions: Why love matters for justice. Cambridge, MA: The Belknap Press of Harvard University Press.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.)

Rasch, G. (1973/2011, Spring). All statistical models are wrong! Comments on a paper presented by Per Martin-Löf, at the Conference on Foundational Questions in Statistical Inference, Aarhus, Denmark, May 7-12, 1973. Rasch Measurement Transactions, 24(4), 1309 [http://www.rasch.org/rmt/rmt244.pdf].

Schaffer, S. (1992). Late Victorian metrology and its instrumentation: A manufactory of Ohms. In R. Bud & S. E. Cozzens (Eds.), Invisible connections: Instruments, institutions, and science (pp. 23-56). Bellingham, WA: SPIE Optical Engineering Press.

Subjectivity, Objectivity, Performance Measurement and Markets

April 23, 2011

Though he attributes his insight to a colleague (George Baker), Michael Jensen has once more succinctly stated a key point I’ve repeatedly tried to convey in my blog posts. As Jensen (2003, p. 397) puts it,

…any activity whose performance can be perfectly measured objectively does not belong inside the firm. If its performance can be adequately measured objectively it can be spun out of the firm and contracted for in a market transaction.

YES!! Though nothing is measured perfectly, my message has been a series of variations on precisely this theme. Well-measured property, services, products, and commodities in today’s economy are associated with scientific, legal and financial structures and processes that endow certain representations with meaningful indications of kind, amount, value and ownership. It is further well established that the ownership of the products of one’s creative endeavors is essential to economic advancement and the enlargement of the greater good. Markets could not exist without objective measures, and thus we have the central commercial importance of metric standards.

The improved measurement of service outcomes and performances is going to create an environment capable of supporting similar legal and financial indications of value and ownership. Many of the causes of today’s economic crises can be traced to poor quality information and inadequate measures of human, social, and natural value. Bringing publicly verifiable scientific data and methods to bear on the tuning of instruments for measuring these forms of value will make their harmonization much simpler than it ever could be otherwise. Social and environmental costs and value have been relegated to the marginal status of externalities because they have not been measured in ways that made it possible to bring them onto the books and into the models.

But the stage is being set for significant changes. Decades of research calibrating objective measures of a wide variety of performances and outcomes are inexorably leading to the creation of an intangible assets metric system (Fisher, 2009a, 2009b, 2011). Meaningful and rigorous individual-level universally available uniform metrics for each significant intangible asset (abilities, health, trustworthiness, etc.) will

(a) make it possible for each of us to take full possession, ownership, and management control of our investments in and returns from these forms of capital,

(b) coordinate the decisions and behaviors of consumers, researchers, and quality improvement specialists to better match supply and demand, and thereby

(c) increase the efficiency of human, social, and natural capital markets, harnessing the profit motive for the removal of wasted human potential, lost community coherence, and destroyed environmental quality.

Jensen’s observation emerges in his analysis of performance measures as one of three factors in defining the incentives and payoffs for a linear compensation plan (the other two being the intercept and the slope of the bonus line relating salary and bonus to the performance measure targets). The two sentences quoted above occur in this broader context, where Jensen (2003, pp. 396-397) states that,

…we must decide how much subjectivity will be involved in each performance measure. In considering this we must recognize that every performance measurement system in a firm must involve an important amount of subjectivity. The reason, as my colleague George Baker has pointed out, is that any activity whose performance can be perfectly measured objectively does not belong inside the firm. If its performance can be adequately measured objectively it can be spun out of the firm and contracted for in a market transaction. Thus, one of the most important jobs of managers, complementing objective measures of performance with managerial subjective evaluation of subtle interdependencies and other factors is exactly what most managers would like to avoid. Indeed, it is this factor along with efficient risk bearing that is at the heart of what gives managers and firms an advantage over markets.

Jensen is here referring implicitly to the point Coase (1990) makes regarding the nature of the firm. A firm can be seen as a specialized market, one in which methods, insights, and systems not generally available elsewhere are employed for competitive advantage. Products are brought to market competitively by being endowed with value not otherwise available. Maximizing that value is essential to the viability of the firm.

Given conflicting incentives and the mixed messages of the balanced scorecard, managers have plenty of opportunities for creatively avoiding the difficult task of maximizing the value of the firm. Jensen (2001) shows that attending to the “managerial subjective evaluation of subtle interdependencies” is made impossibly complex when decisions and behaviors are pulled in different directions by each stakeholder’s particular interests. Other research shows that even traditional capital structures are plagued by the mismeasurement of leverage, distress costs, tax shields, and the speed with which individual firms adjust their capital needs relative to leverage targets (Graham & Leary, 2010). The objective measurement of intangible assets surely seems impossibly complex to those familiar with these problems.

But perhaps the problems associated with measuring traditional capital structures are not so different from those encountered in the domain of intangible assets. In both cases, a particular kind of unjustified self-assurance seems always to attend the mere availability of numeric data. To the unpracticed eye, numbers seem to always behave the same way, no matter if they are rigorous measures of physical commodities, like kilowatts, barrels, or bushels, or if they are currency units in an accounting spreadsheet, or if they are percentages of agreeable responses to a survey question. The problem is that, when interrogated in particular ways with respect to the question of how much of something is supposedly measured, these different kinds of numbers give quite markedly different kinds of answers.

The challenge we face is one of determining what kind of answers we want to the questions we have to ask. Presumably, we want to ask questions and get answers pertinent to obtaining the information we need to manage life creatively, meaningfully, effectively and efficiently. It may be useful then, as a kind of thought experiment, to make a bold leap and imagine a scenario in which relevant questions are answered with integrity, accountability, and transparency.

What will happen when the specialized expertise of human resource professionals is supplanted by a market in which meaningful and comparable measures of the hireability, retainability, productivity, and promotability of every candidate and employee are readily available? If Baker and Jensen have it right, perhaps firms will no longer have employees. This is not to say that no one will work for pay. Instead, firms will contract with individual workers at going market rates, and workers will undoubtedly be well aware of the market value of their available shares of their intangible assets.

A similar consequence follows for the social safety net and a host of other control, regulatory, and policing mechanisms. But we will no longer be stuck with blind faith in the invisible hand and market efficiency, following the faith of those willing to place their trust and their futures in the hands of mechanisms they only vaguely understand and cannot control. Instead, aggregate effects on individuals, communities, and the environment will be tracked in publicly available and critically examined measures, just as stocks, bonds, and commodities are tracked now.

Previous posts in this blog explore the economic possibilities that follow from having empirically substantiated, theoretically predictable, and instrumentally mediated measures embodying broad consensus standards. What we will have for human, social, and natural capital will be the same kind of objective measures that have made markets work as well as they have thus far. It will be a whole new ball game when profits become tied to human, social, and environmental outcomes.

References

Coase, R. (1990). The firm, the market, and the law. Chicago: University of Chicago Press.

Fisher, W. P., Jr. (2009a, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement, 42(9), 1278-1287.

Fisher, W. P.. Jr. (2009b). NIST Critical national need idea White Paper: metrological infrastructure for human, social, and natural capital (Tech. Rep. No. http://www.livingcapitalmetrics.com/images/FisherNISTWhitePaper2.pdf). New Orleans: LivingCapitalMetrics.com.

Fisher, W. P., Jr. (2010, 22 November). Meaningfulness, measurement, value seeking, and the corporate objective function: An introduction to new possibilities. Available at http://ssrn.com/abstract=1713467.

Fisher, W. P., Jr. (2011). Bringing human, social, and natural capital to life: Practical consequences and opportunities. Journal of Applied Measurement, 12(1), in press.

Graham, J. R., & Leary, M. T. (2010, 21 December). A review of empirical capital structure research and directions for the future. Available at http://ssrn.com/abstract=1729388.

Jensen, M. C. (2001, Fall). Value maximization, stakeholder theory, and the corporate objective function. Journal of Applied Corporate Finance, 14(3), 8-21.

Jensen, M. C. (2003). Paying people to lie: The truth about the budgeting process. European Financial Management, 9(3), 379-406.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

A Simple Example of How Better Measurement Creates New Market Efficiencies, Reduces Transaction Costs, and Enables the Pricing of Intangible Assets

March 4, 2011

One of the ironies of life is that we often overlook the obvious in favor of the obscure. And so one hears of huge resources poured into finding and capitalizing on opportunities that provide infinitesimally small returns, while other opportunities—with equally certain odds of success but far more profitable returns—are completely neglected.

The National Institute for Standards and Technology (NIST) reports returns on investment ranging from 32% to over 400% in 32 metrological improvements made in semiconductors, construction, automation, computers, materials, manufacturing, chemicals, photonics, communications and pharmaceuticals (NIST, 2009). Previous posts in this blog offer more information on the economic value of metrology. The point is that the returns obtained from improvements in the measurement of tangible assets will likely also be achieved in the measurement of intangible assets.

How? With a little bit of imagination, each stage in the development of increasingly meaningful, efficient, and useful measures described in this previous post can be seen as implying a significant return on investment. As those returns are sought, investors will coordinate and align different technologies and resources relative to a roadmap of how these stages are likely to unfold in the future, as described in this previous post. The basic concepts of how efficient and meaningful measurement reduces transaction costs and market frictions, and how it brings capital to life, are explained and documented in my publications (Fisher, 2002-2011), but what would a concrete example of the new value created look like?

The examples I have in mind hinge on the difference between counting and measuring. Counting is a natural and obvious thing to do when we need some indication of how much of something there is. But counting is not measuring (Cooper & Humphry, 2010; Wright, 1989, 1992, 1993, 1999). This is not some minor academic distinction of no practical use or consequence. It is rather the source of the vast majority of the problems we have in comparing outcome and performance measures.

Imagine how things would be if we couldn’t weigh fruit in a grocery store, and all we could do was count pieces. We can tell when eight small oranges possess less overall mass of fruit than four large ones by weighing them; the eight small oranges might weigh .75 kilograms (about 1.6 pounds) while the four large ones come in at 1.0 kilo (2.2 pounds). If oranges were sold by count instead of weight, perceptive traders would buy small oranges and make more money selling them than they could if they bought large ones.

But we can’t currently arrive so easily at the comparisons we need when we’re buying and selling intangible assets, like those produced as the outcomes of educational, health care, or other services. So I want to walk through a couple of very down-to-earth examples to bring the point home. Today we’ll focus on the simplest version of the story, and tomorrow we’ll take up a little more complicated version, dealing with the counts, percentages, and scores used in balanced scorecard and dashboard metrics of various kinds.

What if you score eight on one reading test and I score four on a different reading test? Who has more reading ability? In the same way that we might be able to tell just by looking that eight small oranges are likely to have less actual orange fruit than four big ones, we might also be able to tell just by looking that eight easy (short, common) words can likely be read correctly with less reading ability than four difficult (long, rare) words can be.

So let’s analyze the difference between buying oranges and buying reading ability. We’ll set up three scenarios for buying reading ability. In all three, we’ll imagine we’re comparing how we buy oranges with the way we would have to go about buying reading ability today if teachers were paid for the gains made on the tests they administer at the beginning and end of the school year.

In the first scenario, the teachers make up their own tests. In the second, the teachers each use a different standardized test. In the third, each teacher uses a computer program that draws questions from the same online bank of precalibrated items to construct a unique test custom tailored to each student. Reading ability scenario one is likely the most commonly found in real life. Scenario three is the rarest, but nonetheless describes a situation that has been available to millions of students in the U.S., Australia, and elsewhere for several years. Scenarios one, two and three correspond with developmental levels one, three, and five described in a previous blog entry.

Buying Oranges

When you go into one grocery store and I go into another, we don’t have any oranges with us. When we leave, I have eight and you have four. I have twice as many oranges as you, but yours weigh a kilo, about a third more than mine (.75 kilos).

When we paid for the oranges, the transaction was finished in a few seconds. Neither one of us experienced any confusion, annoyance, or inconvenience in relation to the quality of information we had on the amount of orange fruits we were buying. I did not, however, pay twice as much as you did. In fact, you paid more for yours than I did for mine, in direct proportion to the difference in the measured amounts.

No negotiations were necessary to consummate the transactions, and there was no need for special inquiries about how much orange we were buying. We knew from experience in this and other stores that the prices we paid were comparable with those offered in other times and places. Our information was cheap, as it was printed on the bag of oranges or could be read off a scale, and it was very high quality, as the measures were directly comparable with measures from any other scale in any other store. So, in buying oranges, the impact of information quality on the overall cost of the transaction was so inexpensive as to be negligible.

Buying Reading Ability (Scenario 1)

So now you and I go through third grade as eight year olds. You’re in one school and I’m in another. We have different teachers. Each teacher makes up his or her own reading tests. When we started the school year, we each took a reading test (different ones), and we took another (again, different ones) as we ended the school year.

For each test, your teacher counted up your correct answers and divided by the total number of questions; so did mine. You got 72% correct on the first one, and 94% correct on the last one. I got 83% correct on the first one, and 86% correct on the last one. Your score went up 22%, much more than the 3% mine went up. But did you learn more? It is impossible to tell. What if both of your tests were easier—not just for you or for me but for everyone—than both of mine? What if my second test was a lot harder than my first one? On the other hand, what if your tests were harder than mine? Perhaps you did even better than your scores seem to indicate.

We’ll just exclude from consideration other factors that might come to bear, such as whether your tests were significantly longer or shorter than mine, or if one of us ran out of time and did not answer a lot of questions.

If our parents had to pay the reading teacher at the end of the school year for the gains that were made, how would they tell what they were getting for their money? What if your teacher gave a hard test at the start of the year and an easy one at the end of the year so that you’d have a big gain and your parents would have to pay more? What if my teacher gave an easy test at the start of the year and a hard one at the end, so that a really high price could be put on very small gains? If our parents were to compare their experiences in buying our improved reading ability, they would have a lot of questions about how much improvement was actually obtained. They would be confused and annoyed at how inconvenient the scores are, because they are difficult, if not impossible, to compare. A lot of time and effort might be invested in examining the words and sentences in each of the four reading tests to try to determine how easy or hard they are in relation to each other. Or, more likely, everyone would throw their hands up and pay as little as they possibly can for outcomes they don’t understand.

Buying Reading Ability (Scenario 2)

In this scenario, we are third graders again, in different schools with different reading teachers. Now, instead of our teachers making up their own tests, our reading abilities are measured at the beginning and the end of the school year using two different standardized tests sold by competing testing companies. You’re in a private suburban school that’s part of an independent schools association. I’m in a public school along with dozens of others in an urban school district.

For each test, our parents received a report in the mail showing our scores. As before, we know how many questions we each answered correctly, and, unlike before, we don’t know which particular questions we got right or wrong. Finally, we don’t know how easy or hard your tests were relative to mine, but we know that the two tests you took were equated, and so were the two I took. That means your tests will show how much reading ability you gained, and so will mine.

We have one new bit of information we didn’t have before, and that’s a percentile score. Now we know that at the beginning of the year, with a percentile ranking of 72, you performed better than 72% of the other private school third graders taking this test, and at the end of the year you performed better than 76% of them. In contrast, I had percentiles of 84 and 89.

The question we have to ask now is if our parents are going to pay for the percentile gain, or for the actual gain in reading ability. You and I each learned more than our peers did on average, since our percentile scores went up, but this would not work out as a satisfactory way to pay teachers. Averages being averages, if you and I learned more and faster, someone else learned less and slower, so that, in the end, it all balances out. Are we to have teachers paying parents when their children learn less, simply redistributing money in a zero sum game?

And so, additional individualized reports are sent to our parents by the testing companies. Your tests are equated with each other, and they measure in a comparable unit that ranges from 120 to 480. You had a starting score of 235 and finished the year with a score of 420, for a gain of 185.

The tests I took are comparable and measure in the same unit, too, but not the same unit as your tests measure in. Scores on my tests range from 400 to 1200. I started the year with a score of 790, and finished at 1080, for a gain of 290.

Now the confusion in the first scenario is overcome, in part. Our parents can see that we each made real gains in reading ability. The difficulty levels of the two tests you took are the same, as are the difficulties of the two tests I took. But our parents still don’t know what to pay the teacher because they can’t tell if you or I learned more. You had lower percentiles and test scores than I did, but you are being compared with what is likely a higher scoring group of suburban and higher socioeconomic status students than the urban group of disadvantaged students I’m compared against. And your scores aren’t comparable with mine, so you might have started and finished with more reading ability than I did, or maybe I had more than you. There isn’t enough information here to tell.

So, again, the information that is provided is insufficient to the task of settling on a reasonable price for the outcomes obtained. Our parents will again be annoyed and confused by the low quality information that makes it impossible to know what to pay the teacher.

Buying Reading Ability (Scenario 3)

In the third scenario, we are still third graders in different schools with different reading teachers. This time our reading abilities are measured by tests that are completely unique. Every student has a test custom tailored to their particular ability. Unlike the tests in the first and second scenarios, however, now all of the tests have been constructed carefully on the basis of extensive data analysis and experimental tests. Different testing companies are providing the service, but they have gone to the trouble to work together to create consensus standards defining the unit of measurement for any and all reading test items.

For each test, our parents received a report in the mail showing our measures. As before, we know how many questions we each answered correctly. Now, though we don’t know which particular questions we got right or wrong, we can see typical items ordered by difficulty lined up in a way that shows us what kind of items we got wrong, and which kind we got right. And now we also know your tests were equated relative to mine, so we can compare how much reading ability you gained relative to how much I gained. Now our parents can confidently determine how much they should pay the teacher, at least in proportion to their children’s relative measures. If our measured gains are equal, the same payment can be made. If one of us obtained more value, then proportionately more should be paid.

In this third scenario, we have a situation directly analogous to buying oranges. You have a measured amount of increased reading ability that is expressed in the same unit as my gain in reading ability, just as the weights of the oranges are comparable. Further, your test items were not identical with mine, and so the difficulties of the items we took surely differed, just as the sizes of the oranges we bought did.

This third scenario could be made yet more efficient by removing the need for creating and maintaining a calibrated item bank, as described by Stenner and Stone (2003) and in the sixth developmental level in a prior blog post here. Also, additional efficiencies could be gained by unifying the interpretation of the reading ability measures, so that progress through high school can be tracked with respect to the reading demands of adult life (Williamson, 2008).

Comparison of the Purchasing Experiences

In contrast with the grocery store experience, paying for increased reading ability in the first scenario is fraught with low quality information that greatly increases the cost of the transactions. The information is of such low quality that, of course, hardly anyone bothers to go to the trouble to try to decipher it. Too much cost is associated with the effort to make it worthwhile. So, no one knows how much gain in reading ability is obtained, or what a unit gain might cost.

When a school district or educational researchers mount studies to try to find out what it costs to improve reading ability in third graders in some standardized unit, they find so much unexplained variation in the costs that they, too, raise more questions than answers.

In grocery stores and other markets, we don’t place the cost of making the value comparison on the consumer or the merchant. Instead, society as a whole picks up the cost by funding the creation and maintenance of consensus standard metrics. Until we take up the task of doing the same thing for intangible assets, we cannot expect human, social, and natural capital markets to obtain the efficiencies we take for granted in markets for tangible assets and property.

References

Cooper, G., & Humphry, S. M. (2010). The ontological distinction between units and entities. Synthese, pp. DOI 10.1007/s11229-010-9832-1.

Fisher, W. P., Jr. (2002, Spring). “The Mystery of Capital” and the human sciences. Rasch Measurement Transactions, 15(4), 854 [http://www.rasch.org/rmt/rmt154j.htm].

Fisher, W. P., Jr. (2003). Measurement and communities of inquiry. Rasch Measurement Transactions, 17(3), 936-8 [http://www.rasch.org/rmt/rmt173.pdf].

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2005). Daredevil barnstorming to the tipping point: New aspirations for the human sciences. Journal of Applied Measurement, 6(3), 173-9 [http://www.livingcapitalmetrics.com/images/FisherJAM05.pdf].

Fisher, W. P., Jr. (2007, Summer). Living capital metrics. Rasch Measurement Transactions, 21(1), 1092-3 [http://www.rasch.org/rmt/rmt211.pdf].

Fisher, W. P., Jr. (2009a, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement, 42(9), 1278-1287.

Fisher, W. P.. Jr. (2009b). NIST Critical national need idea White Paper: Metrological infrastructure for human, social, and natural capital (Tech. Rep., http://www.livingcapitalmetrics.com/images/FisherNISTWhitePaper2.pdf). New Orleans: LivingCapitalMetrics.com.

Fisher, W. P., Jr. (2011). Bringing human, social, and natural capital to life: Practical consequences and opportunities. Journal of Applied Measurement, 12(1), in press.

NIST. (2009, 20 July). Outputs and outcomes of NIST laboratory research. Available: http://www.nist.gov/director/planning/studies.cfm (Accessed 1 March 2011).

Stenner, A. J., & Stone, M. (2003). Item specification vs. item banking. Rasch Measurement Transactions, 17(3), 929-30 [http://www.rasch.org/rmt/rmt173a.htm].

Williamson, G. L. (2008). A text readability continuum for postsecondary readiness. Journal of Advanced Academics, 19(4), 602-632.

Wright, B. D. (1989). Rasch model from counting right answers: Raw scores as sufficient statistics. Rasch Measurement Transactions, 3(2), 62 [http://www.rasch.org/rmt/rmt32e.htm].

Wright, B. D. (1992, Summer). Scores are not measures. Rasch Measurement Transactions, 6(1), 208 [http://www.rasch.org/rmt/rmt61n.htm].

Wright, B. D. (1993). Thinking with raw scores. Rasch Measurement Transactions, 7(2), 299-300 [http://www.rasch.org/rmt/rmt72r.htm].

Wright, B. D. (1999). Common sense for measurement. Rasch Measurement Transactions, 13(3), 704-5  [http://www.rasch.org/rmt/rmt133h.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

 

One of the ironies of life is that we often overlook the obvious in favor of the obscure. And so one hears of huge resources poured into finding and capitalizing on opportunities that provide infinitesimally small returns, while other opportunities—with equally certain odds of success but far more profitable returns—are completely neglected.

The National Institute for Standards and Technology (NIST) reports returns on investment ranging from 32% to over 400% in 32 metrological improvements made in semiconductors, construction, automation, computers, materials, manufacturing, chemicals, photonics, communications and pharmaceuticals (NIST, 2009). Previous posts in this blog offer more information on the economic value of metrology. The point is that the returns obtained from improvements in the measurement of tangible assets will likely also be achieved in the measurement of intangible assets.

How? With a little bit of imagination, each stage in the development of increasingly meaningful, efficient, and useful measures described in this previous post can be seen as implying a significant return on investment. As those returns are sought, investors will coordinate and align different technologies and resources relative to a roadmap of how these stages are likely to unfold in the future, as described in this previous post. But what would a concrete example of the new value created look like?

The examples I have in mind hinge on the difference between counting and measuring. Counting is a natural and obvious thing to do when we need some indication of how much of something there is. But counting is not measuring (Cooper & Humphry, 2010; Wright, 1989, 1992, 1993, 1999). This is not some minor academic distinction of no practical use or consequence. It is rather the source of the vast majority of the problems we have in comparing outcome and performance measures.

Imagine how things would be if we couldn’t weigh fruit in a grocery store, and all we could do was count pieces. We can tell when eight small oranges possess less overall mass of fruit than four large ones by weighing them; the eight small oranges might weigh .75 kilograms (about 1.6 pounds) while the four large ones come in at 1.0 kilo (2.2 pounds). If oranges were sold by count instead of weight, perceptive traders would buy small oranges and make more money selling them than they could if they bought large ones.

But we can’t currently arrive so easily at the comparisons we need when we’re buying and selling intangible assets, like those produced as the outcomes of educational, health care, or other services. So I want to walk through a couple of very down-to-earth examples to bring the point home. Today we’ll focus on the simplest version of the story, and tomorrow we’ll take up a little more complicated version, dealing with the counts, percentages, and scores used in balanced scorecard and dashboard metrics of various kinds.

What if you score eight on one reading test and I score four on a different reading test? Who has more reading ability? In the same way that we might be able to tell just by looking that eight small oranges are likely to have less actual orange fruit than four big ones, we might also be able to tell just by looking that eight easy (short, common) words can likely be read correctly with less reading ability than four difficult (long, rare) words can be.

So let’s analyze the difference between buying oranges and buying reading ability. We’ll set up three scenarios for buying reading ability. In all three, we’ll imagine we’re comparing how we buy oranges with the way we would have to go about buying reading ability today if teachers were paid for the gains made on the tests they administer at the beginning and end of the school year.

In the first scenario, the teachers make up their own tests. In the second, the teachers each use a different standardized test. In the third, each teacher uses a computer program that draws questions from the same online bank of precalibrated items to construct a unique test custom tailored to each student. Reading ability scenario one is likely the most commonly found in real life. Scenario three is the rarest, but nonetheless describes a situation that has been available to millions of students in the U.S., Australia, and elsewhere for several years. Scenarios one, two and three correspond with developmental levels one, three, and five described in a previous blog entry.

Buying Oranges

When you go into one grocery store and I go into another, we don’t have any oranges with us. When we leave, I have eight and you have four. I have twice as many oranges as you, but yours weigh a kilo, about a third more than mine (.75 kilos).

When we paid for the oranges, the transaction was finished in a few seconds. Neither one of us experienced any confusion, annoyance, or inconvenience in relation to the quality of information we had on the amount of orange fruits we were buying. I did not, however, pay twice as much as you did. In fact, you paid more for yours than I did for mine, in direct proportion to the difference in the measured amounts.

No negotiations were necessary to consummate the transactions, and there was no need for special inquiries about how much orange we were buying. We knew from experience in this and other stores that the prices we paid were comparable with those offered in other times and places. Our information was cheap, as it was printed on the bag of oranges or could be read off a scale, and it was very high quality, as the measures were directly comparable with measures from any other scale in any other store. So, in buying oranges, the impact of information quality on the overall cost of the transaction was so inexpensive as to be negligible.

Buying Reading Ability (Scenario 1)

So now you and I go through third grade as eight year olds. You’re in one school and I’m in another. We have different teachers. Each teacher makes up his or her own reading tests. When we started the school year, we each took a reading test (different ones), and we took another (again, different ones) as we ended the school year.

For each test, your teacher counted up your correct answers and divided by the total number of questions; so did mine. You got 72% correct on the first one, and 94% correct on the last one. I got 83% correct on the first one, and 86% correct on the last one. Your score went up 22%, much more than the 3% mine went up. But did you learn more? It is impossible to tell. What if both of your tests were easier—not just for you or for me but for everyone—than both of mine? What if my second test was a lot harder than my first one? On the other hand, what if your tests were harder than mine? Perhaps you did even better than your scores seem to indicate.

We’ll just exclude from consideration other factors that might come to bear, such as whether your tests were significantly longer or shorter than mine, or if one of us ran out of time and did not answer a lot of questions.

If our parents had to pay the reading teacher at the end of the school year for the gains that were made, how would they tell what they were getting for their money? What if your teacher gave a hard test at the start of the year and an easy one at the end of the year so that you’d have a big gain and your parents would have to pay more? What if my teacher gave an easy test at the start of the year and a hard one at the end, so that a really high price could be put on very small gains? If our parents were to compare their experiences in buying our improved reading ability, they would have a lot of questions about how much improvement was actually obtained. They would be confused and annoyed at how inconvenient the scores are, because they are difficult, if not impossible, to compare. A lot of time and effort might be invested in examining the words and sentences in each of the four reading tests to try to determine how easy or hard they are in relation to each other. Or, more likely, everyone would throw their hands up and pay as little as they possibly can for outcomes they don’t understand.

Buying Reading Ability (Scenario 2)

In this scenario, we are third graders again, in different schools with different reading teachers. Now, instead of our teachers making up their own tests, our reading abilities are measured at the beginning and the end of the school year using two different standardized tests sold by competing testing companies. You’re in a private suburban school that’s part of an independent schools association. I’m in a public school along with dozens of others in an urban school district.

For each test, our parents received a report in the mail showing our scores. As before, we know how many questions we each answered correctly, and, as before, we don’t know which particular questions we got right or wrong. Finally, we don’t know how easy or hard your tests were relative to mine, but we know that the two tests you took were equated, and so were the two I took. That means your tests will show how much reading ability you gained, and so will mine.

But we have one new bit of information we didn’t have before, and that’s a percentile score. Now we know that at the beginning of the year, with a percentile ranking of 72, you performed better than 72% of the other private school third graders taking this test, and at the end of the year you performed better than 76% of them. In contrast, I had percentiles of 84 and 89.

The question we have to ask now is if our parents are going to pay for the percentile gain, or for the actual gain in reading ability. You and I each learned more than our peers did on average, since our percentile scores went up, but this would not work out as a satisfactory way to pay teachers. Averages being averages, if you and I learned more and faster, someone else learned less and slower, so that, in the end, it all balances out. Are we to have teachers paying parents when their children learn less, simply redistributing money in a zero sum game?

And so, additional individualized reports are sent to our parents by the testing companies. Your tests are equated with each other, so they measure in a comparable unit that ranges from 120 to 480. You had a starting score of 235 and finished the year with a score of 420, for a gain of 185.

The tests I took are comparable and measure in the same unit, too, but not the same unit as your tests measure in. Scores on my tests range from 400 to 1200. I started the year with a score of 790, and finished at 1080, for a gain of 290.

Now the confusion in the first scenario is overcome, in part. Our parents can see that we each made real gains in reading ability. The difficulty levels of the two tests you took are the same, as are the difficulties of the two tests I took. But our parents still don’t know what to pay the teacher because they can’t tell if you or I learned more. You had lower percentiles and test scores than I did, but you are being compared with what is likely a higher scoring group of suburban and higher socioeconomic status students than the urban group of disadvantaged students I’m compared against. And your scores aren’t comparable with mine, so you might have started and finished with more reading ability than I did, or maybe I had more than you. There isn’t enough information here to tell.

So, again, the information that is provided is insufficient to the task of settling on a reasonable price for the outcomes obtained. Our parents will again be annoyed and confused by the low quality information that makes it impossible to know what to pay the teacher.

Buying Reading Ability (Scenario 3)

In the third scenario, we are still third graders in different schools with different reading teachers. This time our reading abilities are measured by tests that are completely unique. Every student has a test custom tailored to their particular ability. Unlike the tests in the first and second scenarios, however, now all of the tests have been constructed carefully on the basis of extensive data analysis and experimental tests. Different testing companies are providing the service, but they have gone to the trouble to work together to create consensus standards defining the unit of measurement for any and all reading test items.

For each test, our parents received a report in the mail showing our measures. As before, we know how many questions we each answered correctly. Now, though we don’t know which particular questions we got right or wrong, we can see typical items ordered by difficulty lined up in a way that shows us what kind of items we got wrong, and which kind we got right. And now we also know your tests were equated relative to mine, so we can compare how much reading ability you gained relative to how much I gained. Now our parents can confidently determine how much they should pay the teacher, at least in proportion to their children’s relative measures. If our measured gains are equal, the same payment can be made. If one of us obtained more value, then proportionately more should be paid.

In this third scenario, we have a situation directly analogous to buying oranges. You have a measured amount of increased reading ability that is expressed in the same unit as my gain in reading ability, just as the weights of the oranges are comparable. Further, your test items were not identical with mine, and so the difficulties of the items we took surely differed, just as the sizes of the oranges we bought did.

This third scenario could be made yet more efficient by removing the need for creating and maintaining a calibrated item bank, as described by Stenner and Stone (2003) and in the sixth developmental level in a prior blog post here. Also, additional efficiencies could be gained by unifying the interpretation of the reading ability measures, so that progress through high school can be tracked with respect to the reading demands of adult life (Williamson, 2008).

Comparison of the Purchasing Experiences

In contrast with the grocery store experience, paying for increased reading ability in the first scenario is fraught with low quality information that greatly increases the cost of the transactions. The information is of such low quality that, of course, hardly anyone bothers to go to the trouble to try to decipher it. Too much cost is associated with the effort to make it worthwhile. So, no one knows how much gain in reading ability is obtained, or what a unit gain might cost.

When a school district or educational researchers mount studies to try to find out what it costs to improve reading ability in third graders in some standardized unit, they find so much unexplained variation in the costs that they, too, raise more questions than answers.

But we don’t place the cost of making the value comparison on the consumer or the merchant in the grocery store. Instead, society as a whole picks up the cost by funding the creation and maintenance of consensus standard metrics. Until we take up the task of doing the same thing for intangible assets, we cannot expect human, social, and natural capital markets to obtain the efficiencies we take for granted in markets for tangible assets and property.

References

Cooper, G., & Humphry, S. M. (2010). The ontological distinction between units and entities. Synthese, pp. DOI 10.1007/s11229-010-9832-1.

NIST. (2009, 20 July). Outputs and outcomes of NIST laboratory research. Available: http://www.nist.gov/director/planning/studies.cfm (Accessed 1 March 2011).

Stenner, A. J., & Stone, M. (2003). Item specification vs. item banking. Rasch Measurement Transactions, 17(3), 929-30 [http://www.rasch.org/rmt/rmt173a.htm].

Williamson, G. L. (2008). A text readability continuum for postsecondary readiness. Journal of Advanced Academics, 19(4), 602-632.

Wright, B. D. (1989). Rasch model from counting right answers: Raw scores as sufficient statistics. Rasch Measurement Transactions, 3(2), 62 [http://www.rasch.org/rmt/rmt32e.htm].

Wright, B. D. (1992, Summer). Scores are not measures. Rasch Measurement Transactions, 6(1), 208 [http://www.rasch.org/rmt/rmt61n.htm].

Wright, B. D. (1993). Thinking with raw scores. Rasch Measurement Transactions, 7(2), 299-300 [http://www.rasch.org/rmt/rmt72r.htm].

Wright, B. D. (1999). Common sense for measurement. Rasch Measurement Transactions, 13(3), 704-5  [http://www.rasch.org/rmt/rmt133h.htm].

How bad will the financial crises have to get before…?

April 30, 2010

More and more states and nations around the world face the possibility of defaulting on their financial obligations. The financial crises are of epic historical proportions. This is a disaster of the first order. And yet, it is so odd–we have the solutions and preventative measures we need at our finger tips, but no one knows about them or is looking for them.

So,  I am persuaded to once again wonder if there might now be some real interest in the possibilities of capitalizing on

  • measurement’s well-known capacity for reducing transaction costs by improving information quality and reducing information volume;
  • instruments calibrated to measure in constant units (not ordinal ones) within known error ranges (not as though the measures are perfectly precise) with known data quality;
  • measures made meaningful by their association with invariant scales defined in terms of the questions asked;
  • adaptive instrument administration methods that make all measures equally precise by targeting the questions asked;
  • judge calibration methods that remove the person rating performances as a factor influencing the measures;
  • the metaphor of transparency by calibrating instruments that we really look right through at the thing measured (risk, governance, abilities, health, performance, etc.);
  • efficient markets for human, social, and natural capital by means of the common currencies of uniform metrics, calibrated instrumentation, and metrological networks;
  • the means available for tuning the instruments of the human, social, and environmental sciences to well-tempered scales that enable us to more easily harmonize, orchestrate, arrange, and choreograph relationships;
  • our understandings that universal human rights require universal uniform measures, that fair dealing requires fair measures, and that our measures define who we are and what we value; and, last but very far from least,
  • the power of love–the back and forth of probing questions and honest answers in caring social intercourse plants seminal ideas in fertile minds that can be nurtured to maturity and Socratically midwifed as living meaning born into supportive ecologies of caring relations.

How bad do things have to get before we systematically and collectively implement the long-established and proven methods we have at our disposal? It is the most surreal kind of schizophrenia or passive-aggressive avoidance pathology to keep on tormenting ourselves with problems for which we have solutions.

For more information on these issues, see prior blogs posted here, the extensive documentation provided, and http://www.livingcapitalmetrics.com.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Parameterizing Perfection: Practical Applications of a Mathematical Model of the Lean Ideal

April 2, 2010

To properly pursue perfection, we need to parameterize it. That is, taking perfection as the ideal, unattainable standard against which we judge our performance is equivalent to thinking of it as a mathematical model. Organizations are intended to realize their missions independent of the particular employees, customers, suppliers, challenges, products, etc. they happen to engage with at any particular time. Organizational performance measurement (Spitzer, 2007) ought to then be designed in terms of a model that posits, tests for, and capitalizes on the always imperfectly realized independence of those parameters.

Lean thinking (Womack & Jones, 1996) focuses on minimizing waste and maximizing value. At every point at which resources are invested in processes, services, or products, the question is asked, “What value is added here?” Resources are wasted when no value is added, when they can be removed with no detrimental effect on the value of the end product. In their book, Natural Capitalism: Creating the Next Industrial Revolution, Hawken, Lovins, and Lovins (1999, p. 133) say

“Lean thinking … changes the standard for measuring corporate success. … As they [Womack and Jones] express it: ‘Our earnest advice to lean firms today is simple. To hell with your competitors; compete against perfection by identifying all activities that are muda [the Japanese term for waste used in Toyota’s landmark quality programs] and eliminating them. This is an absolute rather than a relative standard which can provide the essential North Star for any organization.”

Further, every input should “be presumed waste until shown otherwise.” A constant, ongoing, persistent pressure for removing waste is the basic characteristic of lean thinking. Perfection is never achieved, but it aptly serves as the ideal against which progress is measured.

Lean thinking sounds a lot like a mathematical model, though it does not seem to have been written out in a mathematical form, or used as the basis for calibrating instruments, estimating measures, evaluating data quality, or for practical assessments of lean organizational performance. The closest anyone seems to have come to parameterizing perfection is in the work of Genichi Taguchi (Ealey, 1988), which has several close parallels with Rasch measurement (Linacre, 1993).  But meaningful and objective quantification, as required and achieved in the theory and practice of fundamental measurement (Andrich, 2004; Bezruczko, 2005; Bond & Fox 2007; Smith & Smith, 2004; Wilson, 2005; Wright, 1999), in fact asserts abstract ideals of perfection as models of organizational, social, and psychological processes in education, health care, marketing, etc. These models test the extent to which outcomes remain invariant across examination or survey questions, across teachers, students, schools, and curricula, or across treatment methods, business processes, or policies.

Though as yet implemented only to a limited extent in business (Drehmer, Belohlav, James, & Coye, 2000; Drehmer & Deklava, 2001;  Lunz & Linacre, 1998; Salzberger, 2009), advanced measurement’s potential rewards are great. Fundamental measurement theory has been successfully applied in research and practice thousands of times over the last 40 years and more, including in very large scale assessments and licensure/certification applications (Adams, Wu, & Macaskill, 1997; Masters, 2007; Smith, Julian, Lunz, et al., 1994). These successes speak to an opportunity for making broad improvements in outcome measurement that could provide more coherent product definition, and significant associated opportunities for improving product quality and the efficiency with which it is produced, in the manner that has followed from the use of fundamental measures in other industries.

Of course, processes and outcomes are never implemented or obtained with perfect consistency. This would be perfectly true only in a perfect world. But to pursue perfection, we need to parameterize it. In other words, to raise the bar in any area of performance assessment, we have to know not only what direction is up, but we also need to know when we have raised the bar far enough. But we cannot tell up from down, we do not know how much to raise the bar, and we cannot properly evaluate the effects of lean experiments when we have no way of locating measures on a number line that embodies the lean ideal.

To think together collectively in ways that lead to significant new innovations, to rise above what Jaron Lanier calls the “global mush” of confused and self-confirming hive thinking, we need the common languages of widely accepted fundamental measures of the relevant processes and outcomes, measures that remain constant across samples of customers, patients, employees, students, etc., and across products, sales techniques, curricula, treatment processes, assessment methods, and brands of instrument.

We are all well aware that the consequences of not knowing where the bar is, of not having product definitions, can be disastrous. In many respects, as I’ve said previously in this blog, the success or failure of health care reform hinges on getting measurement right. The Institute of Medicine report, To Err is Human, of several years ago stresses the fact that system failures pose the greatest threat to safety in health care because they lead to human errors. When a system as complex as health care lacks a standard product definition, and product delivery is fragmented across multiple providers with different amounts and kinds of information in different settings, the system becomes dangerously cumbersome and over-complicated, with unacceptably wide variations and errors in its processes and outcomes, not to even speak of its economic inefficiency.

In contrast with the widespread use of fundamental measures in the product definitions of other industries, health care researchers typically implement neither the longstanding, repeatedly proven, and mathematically rigorous models of fundamental measurement theory nor the metrological networks through which reference standard metrics are engineered. Most industries carefully define, isolate, and estimate the parameters of their products, doing so in ways 1) that ensure industry-wide comparability and standardization, and 2) that facilitate continuous product improvement by revealing multiple opportunities for enhancement. Where organizations in other industries manage by metrics and thereby keep their eyes on the ball of product quality, health care organizations often manage only their own internal processes and cannot in fact bring the product quality ball into view.

In his message concerning the Institute for Healthcare Improvement’s Pursuing Perfection project a few years ago, Don Berwick, like others (Coye, 2001; Coye & Detmer, 1998), observed that health care does not yet have an organization setting new standards in the way that Toyota did for the auto industry in the 1970s. It still doesn’t, of course. Given the differences between the auto and health care industries uses of fundamental measures of product quality and associated abilities to keep their eyes on the quality ball, is it any wonder then, that no one in health care has yet hit a home run? It may well be that no one will hit a home run in health care until reference standard measures of product quality are devised.

The need for reference standard measures in uniform data systems is crucial, and the methods for obtaining them are widely available and well-known. So what is preventing the health care industry from adopting and deploying them? Part of the answer is the cost of the initial investment required. In 1980, metrology comprised about six percent of the U.S. gross national product (Hunter, 1980). In the period from 1981 to 1994, annual expenditures on research and development in the U.S. were less than three percent of the GNP, and non-defense R&D was about two percent (NIST Subcommittee on Research, National Science and Technology Council, 1996). These costs, however, must be viewed as investments from which high rates of return can be obtained (Barber, 1987; Gallaher, Rowe, Rogozhin, et al., 2007; Swann, 2005).

For instance, the U.S. National Institute of Standards and Technology estimated the economic impact of 12 areas of research in metrology, in four broad areas including semiconductors, electrical calibration and testing, optical industries, and computer systems (NIST, 1996, Appendix C; also see NIST, 2003). The median rate of return in these 12 areas was 147 percent, and returns ranged from 41 to 428 percent. The report notes that these results compare favorably with those obtained in similar studies of return rates from other public and private research and development efforts. Even if health care metrology produces only a small fraction of the return rate produced in physical metrology, its economic impact could still amount to billions of dollars annually. The proposed pilot projects therefore focus on determining what an effective health care outcomes metrology system should look like. What should its primary functions be? What should it cost? What rates of return could be expected from it?

Metrology, the science of measurement (Pennella, 1997), requires 1) that instruments be calibrated within individual laboratories so as to isolate and estimate the values of the required parameters (Wernimont, 1978); and 2) that individual instruments’ capacities to provide the same measure for the same amount, and so be traceable to a reference standard, be established and monitored via interlaboratory round-robin trials (Mandel, 1978).

Fundamental measurement has already succeeded in demonstrating the viability of reference standard measures of health outcomes, measures whose meaningfulness does not depend on the particular samples of items employed or patients measured. Though this work succeeds as far as it goes, it being done in a context that lacks any sense of the need for metrological infrastructure. Health care needs networks of scientists and technicians collaborating not only in the first, intralaboratory phase of metrological work, but also in the interlaboratory trials through which different brands or configurations of instruments intended to measure the same variable would be tuned to harmoniously produce the same measure for the same amount.

Implementation of the two phases of metrological innovation in health care would then begin with the intralaboratory calibration of existing and new instruments for measuring overall organizational performance, quality of care, and patients’ health status, quality of life, functionality, etc.  The second phase takes up the interlaboratory equating of these instruments, and the concomitant deployment of reference standard units of measurement throughout a health care system and the industry as a whole. To answer questions concerning health care metrology’s potential returns on investment, the costs for, and the savings accrued from, accomplishing each phase of each pilot will be tracked or estimated.

When instruments measuring in universally uniform, meaningful units are put in the hands of clinicians, a new scientific revolution will occur in medicine. It will be analogous to previous ones associated with the introduction of the thermometer and the instruments of optometry and the clinical laboratory. Such tools will multiply many times over the quality improvement methods used by Brent James, touted as holding the key to health care reform in a recent New York Times profile. Instead of implicitly hypothesizing models of perfection and assessing performance relative to them informally, what we need is a new science that systematically implements the lean ideal on industry-wide scales. The future belongs to those who master these techniques.

References

Adams, R. J., Wu, M. L., & Macaskill, G. (1997). Scaling methodology and procedures for the mathematics and science scales. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study Technical Report: Vol. 2: Implementation and Analysis – Primary and Middle School Years (pp. 111-145). Chestnut Hill, MA: Boston College.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Barber, J. M. (1987). Economic rationale for government funding of work on measurement standards. In R. Dobbie, J. Darrell, K. Poulter & R. Hobbs (Eds.), Review of DTI work on measurement standards (p. Annex 5). London: Department of Trade and Industry.

Berwick, D. M., James, B., & Coye, M. J. (2003, January). Connections between quality measurement and improvement. Medical Care, 41(1 (Suppl)), I30-38.

Bezruczko, N. (Ed.). (2005). Rasch measurement in health sciences. Maple Grove, MN: JAM Press.

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2d edition. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Coye, M. J. (2001, November/December). No Toyotas in health care: Why medical care has not evolved to meet patients’ needs. Health Affairs, 20(6), 44-56.

Coye, M. J., & Detmer, D. E. (1998). Quality at a crossroads. The Milbank Quarterly, 76(4), 759-68.

Drehmer, D. E., Belohlav, J. A., & Coye, R. W. (2000, Dec). A exploration of employee participation using a scaling approach. Group & Organization Management, 25(4), 397-418.

Drehmer, D. E., & Deklava, S. M. (2001, April). A note on the evolution of software engineering practices. Journal of Systems and Software, 57(1), 1-7.

Ealey, L. A. (1988). Quality by design: Taguchi methods and U.S. industry. Dearborn MI: ASI Press.

Gallaher, M. P., Rowe, B. R., Rogozhin, A. V., Houghton, S. A., Davis, J. L., Lamvik, M. K., et al. (2007). Economic impact of measurement in the semiconductor industry (Tech. Rep. No. 07-2). Gaithersburg, MD: National Institute for Standards and Technology.

Hawken, P., Lovins, A., & Lovins, H. L. (1999). Natural capitalism: Creating the next industrial revolution. New York: Little, Brown, and Co.

Hunter, J. S. (1980, November). The national system of scientific measurement. Science, 210(21), 869-874.

Linacre, J. M. (1993). Quality by design: Taguchi and Rasch. Rasch Measurement Transactions, 7(2), 292.

Lunz, M. E., & Linacre, J. M. (1998). Measurement designs using multifacet Rasch modeling. In G. A. Marcoulides (Ed.), Modern methods for business research. Methodology for business and management (pp. 47-77). Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc.

Mandel, J. (1978, December). Interlaboratory testing. ASTM Standardization News, 6, 11-12.

Masters, G. N. (2007). Special issue: Programme for International Student Assessment (PISA). Journal of Applied Measurement, 8(3), 235-335.

National Institute for Standards and Technology (NIST). (1996). Appendix C: Assessment examples. Economic impacts of research in metrology. In C. o. F. S. Subcommittee on Research (Ed.), Assessing fundamental science: A report from the Subcommittee on Research, Committee on Fundamental Science. Washington, DC: National Standards and Technology Council [http://www.nsf.gov/statistics/ostp/assess/nstcafsk.htm#Topic%207; last accessed 18 February 2008].

National Institute for Standards and Technology (NIST). (2003, 15 January). Outputs and outcomes of NIST laboratory research. Retrieved 12 July 2009, from http://www.nist.gov/director/planning/studies.htm#measures.

Pennella, C. R. (1997). Managing the metrology system. Milwaukee, WI: ASQ Quality Press.\

Salzberger, T. (2009). Measurement in marketing research: An alternative framework. Northampton, MA: Edward Elgar.

Smith, R. M., Julian, E., Lunz, M., Stahl, J., Schulz, M., & Wright, B. D. (1994). Applications of conjoint measurement in admission and professional certification programs. International Journal of Educational Research, 21(6), 653-664.

Smith, E. V., Jr., & Smith, R. M. (2004). Introduction to Rasch measurement. Maple Grove, MN: JAM Press.

Spitzer, D. (2007). Transforming performance measurement: Rethinking the way we measure and drive organizational success. New York: AMACOM.

Swann, G. M. P. (2005, 2 December). John Barber’s pioneering work on the economics of measurement standards [Electronic version]. Retrieved http://www.cric.ac.uk/cric/events/jbarber/swann.pdf from Notes for Workshop in Honor of John Barber held at University of Manchester.

Wernimont, G. (1978, December). Careful intralaboratory study must come first. ASTM Standardization News, 6, 11-12.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Womack, J. P., & Jones, D. T. (1996, Sept./Oct.). Beyond Toyota: How to root out waste and pursue perfection. Harvard Business Review, 74, 140-58.

Wright, B. D. (1999). Fundamental measurement for psychology. In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every educator and psychologist should know (pp. 65-104 [http://www.rasch.org/memo64.htm]). Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Tuning our assessment instruments to harmonize our relationships

January 10, 2010

“Music is the art of measuring well.”
Augustine of Hippo

With the application of Rasch’s probabilistic models for measurement, we are tuning the instruments of the human, social, and environmental sciences, with the aim of being able to harmonize relationships of all kinds. This is not an empty metaphor: the new measurement scales are equivalent, mathematically, with the well-tempered, and later 12-tone equal temperament, scales that were introduced in response to the technological advances associated with the piano.

The idea that the regular patterns found in music are akin to those found in the world at large and in the human psyche is an ancient one. The Pythagoreans held that

“…music’s concordances [were] the covenants that tones form under heaven’s watchful eye. For the Pythagoreans, though, the importance of these special proportions went well beyond music. They were signs of the natural order, like the laws governing triangles; music’s rules were simply the geometry governing things in motion: not only vibrating strings but also celestial bodies and the human soul” (Isacoff, 2001, p. 38).

I have already elsewhere in this blog elaborated on the progressive expansion of geometrical thinking into natural laws and measurement models; now, let us turn our attention to music as another fertile source of the analogies that have proven so productive over the course of the history of science (also explored elsewhere in this blog).

You see, tuning systems up to the invention of the piano (1709) required instruments to be retuned for performers to play in different keys. Each key had a particular characteristic color to its sound. And not only that, some note pairings in any key (such as every twelfth 5th in the mean tone tuning) were so dissonant that they were said to howl, and were referred to as wolves. Composers went out of their way to avoid putting these notes together, or used them in rare circumstances for especially dramatic effects.

Dozens of tuning systems had been proposed in the 17th century, and the concept of an equal-temperament scale was in general currency at the time of the piano’s invention. Bach is said to have tuned his own keyboards so that he could switch keys fluidly from within a composition. His “Well-Tempered Clavier” (published in 1722) demonstrates how a well temperament allows one to play in all 24 major and minor keys without retuning the instrument. Bach also is said to have deliberately used wolf note pairings to show that they did not howl in the way they did with the mean tone tuning.

Equal temperament is not equal-interval in the Pythagorean sense of same-sized changes in the frequencies of vibrating strings. Rather, those frequencies are scaled using the natural logarithm, and that logarithmic scale is what is divided into equal intervals. This is precisely what is also done in Rasch scaling algorithms applied to test, assessment, and survey data in contemporary measurement models.

Pianos are tuned from middle C out, with each sequential pair of notes to the left and right tuned to be the same distance away from C. As the tuner moves further and further away from C, the unit distance of the notes from middle C is slightly adjusted or stretched, so that the sharps and flats become the same note in the black keys.

What is being done, in effect, is that the natural logarithm of the note frequencies is being taken. In statistics, the natural logarithm is called a two-stretch transformation, because it pulls both ends of the normal distribution’s bell curve away from the center, with the ends being pulled further than the regions under the curve closer to the center. This stretching effect is of huge importance to measurement because it makes it possible for different collections of questions addressing the same thing to measure in the same unit.

That is, the instrument dependency of summed ratings or counts of right answers  or categorical response frequencies is like a key-dependent tuning system. The natural logarithm modulates transitions across musical notes in such a way as to make different keys work in the same scaling system, and it also modulates transitions across different reading tests so that they all measure in a unit that remains the same size with the same meaning.

Now, many people fear that the measurement of human abilities, attitudes, health, etc. must inherently involve a meaningless reduction of richly varied and infinite experience to a number. Many people are violently opposed to any suggestion that this could be done in a meaningful and productive way. However, is not music the most emotionally powerful and subtle art form in existence, and simultaneously also incredibly high-tech and mathematical? Even if you ignore the acoustical science and the studio electronics, the instruments themselves embody some of the oldest and most intensively studied mathematical principles in existence.

And, yes, these principles are used in TV, movies, dentists’ offices and retail stores to help create sympathies and environments conducive to the, sometimes painful and sometimes crass, commercial tasks at hand. But music is also by far the most popular art form, and it is accessible everywhere to everyone any time precisely as a result of the very technologies that many consider anathema in the human and social sciences.

But it seems to me that the issue is far more a matter of who controls the technology than it is one of the technology itself. In the current frameworks of the human and social sciences, and of the economic domains of human, social, and natural capital, whoever owns the instrument owns the measurement system and controls the interpretation of the data, since each instrument measures in its own unit. But in the new Rasch technology’s open architecture, anyone willing to master the skills needed can build instruments tuned to the reference standard, ubiquitous and universally available scale. What is more, the demand that all instruments measuring the same thing must harmonize will transfer control of data interpretation to a public sphere in which experimental reproducibility trumps authoritarian dictates.

This open standards system will open the door to creativity and innovation on a par with what musicians take for granted. Common measurement scales will allow people to jam out in an infinite variety of harmonic combinations, instrumental ensembles, choreographed moves, and melodic and rhythmic patterns. Just as music ranges from jazz to symphonic, rock to punk to hiphop to blues to country to techno, or atonal to R & B, so, too, do our relationships. A whole new world of potential innovations opens up in the context of methods for systematically evaluating naturally occurring and deliberately orchestrated variations in organizations, management, HR training methods, supply lines, social spheres, environmental quality, etc.

The current business world’s near-complete lack of comparable information on human, social, and natural capital is oppressive. It puts us in the situation of never knowing what we get for our money in education and healthcare, even as costs in these areas spiral into absolutely stratospheric levels. Having instruments in every area of education, health care, recreation, employment, and commerce tuned to common scales will be liberating, not oppressive. Having clear, reproducible, meaningful, and publicly negotiated measures of educational and clinical care outcomes, of productivity and innovation, and of trust, loyalty, and environmental quality will be a boon.

In conclusion, consider one more thing. About 100 years ago, a great many musicians and composers revolted against what they felt were the onerous and monotonous constraints of the equal-tempered tuning system. Thus we had an explosion of tonal and rhythmic innovations across the entire range of musical artistry. With the global popularity of world music’s blending of traditional forms with current technology and Western forms, the use of alternatives to equal temperament has never been greater. I read once that Joni Mitchell has used something like 32 different tunings in her recordings. Jimi Hendrix and Neil Young are also famous for using unique tunings to define their trademark sounds. What would the analogy of this kind of creativity be in the tuning of tests and surveys? I don’t know, but I’m looking forward to seeing it, experiencing it, and maybe even contributing to it. Les Paul may not be the only innovator in instrument design who figured out not only how to make it easy for others to express themselves in measured tones, but who also knew how to rock out his own yayas!

References and further reading:

Augustine of Hippo. (1947/2002). On music. In Writings of Saint Augustine Volume 2. Immortality of the soul and other works. (L. Schopp, Trans.) (pp. 169-384). New York: Catholic University of America Press.

Barbour, J. M. (2004/1954). Tuning and temperament: A historical survey. Mineola, NY: Dover Publications.

Heelan, P. A. (1979). Music as basic metaphor and deep structure in Plato and in ancient cultures. Journal of Social and Biological Structures, 2, 279-291.

Isacoff, S. M. (2001). Temperament: The idea that solved music’s greatest riddle. New York: Alfred A. Knopf.

Jorgensen, O. (1991). Tuning: Containing the perfection of eighteenth-century temperament, the lost art of nineteenth-century temperament and the science of equal temperament. East Lansing, Michigan: Michigan State University.

Kivy, P. (2002). Introduction to a philosophy of music. Oxford, England: Oxford University Press.

Mathieu, W. A. (1997). Harmonic experience: Tonal harmony from its natural origins to its modern expression. Rochester, Vermont: Inner Traditions International.

McClain, E. (1984/1976). The myth of invariance: The origin of the gods, mathematics and music from the Rg Veda to Plato (P. A. Heelan, Ed.). York Beach, Maine: Nicolas-Hays, Inc.

Russell, G. (2001/1953). Lydian chromatic concept of tonal organization (4th ed.). Brookline, MA: Concept Publishing.

Stone, M. (2002, Autumn). Musical temperament. Rasch Measurement Transactions, 16(2), 873.

Sullivan, A. T. (1985). The seventh dragon: The riddle of equal temperament. Lake Oswego, OR: Metamorphous Press.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Just posted on the LinkedIn Human Performance Discussion on the art and science of measurement

December 16, 2009

Great question and discussion!

Business performance measurement and management ought to be a blend of art and science akin to music–the most intuitive and absorbing of the arts and simultaneously reliant on some of the most high tech precision instrumentation available.

Unfortunately, the vast majority of the numbers used in HR and marketing are not scientific. Despite the fact that highly scientific  instruments for intangibles measurement have been available for decades, this is generally true in two ways. First, measures of some qualitative substance that really adds up the way numbers do have to be read off a calibrated instrument. Most surveys and assessments used in business are not calibrated. Second, once instruments measuring a particular thing are calibrated, to be fully scientific they all have to be linked together in a metric system so that everyone everywhere thinks and acts together in a common language.

The advantages of taking the trouble to calibrate and link instruments are numerous. The history of industry is the history of the ways we have capitalized on standardized technologies. A whole new economy is implied by our capacity to vastly improve the measurement and management of human, social, and natural capital.

The research on the integration of qualitative substance and quantitative precision in meaningful measurement is extensive. My most recent publication appeared in the November 2009 issue of Measurement (Elsevier): doi:10.1016/j.measurement.2009.03.014.

For more information, see some of my published papers and the references cited in them at http://www.livingcapitalmetrics.com/researchpapers.html.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Just posted on www.economist.com in response to Sept 26 Schumpeter article

September 29, 2009

Let’s cut through the Gordian Knot to the real issue. That we manage what we measure is as close to an absolute truth as there ever was. What got us into this mess was the inadequacy of the vast majority of our measures. So-called “measures” that only get in the way of management are a sign that new standards, criteria, and methods of measurement are needed. The core issue we face is how to transform socialized externalities into capitalized internalities. Transaction costs are the most important and largest costs in any economic exchange. We reduce and control these via measurement. Human, social, and natural capital transaction costs are virtually uncontrolled and unmeasured. We need a metric system for universally uniform measures of abilities and skills, health, motivation, loyalty and trust, and environmental quality. And we needed it yesterday. But who is working on it? Who is talking about it? Most importantly, who is taking advantage of the huge strides that have been made in measurement science over the last 50 years, strides that have made measurement far more rigorous, practical, and flexible than anyone in business seems to know. As to business being an art, so is music, but music is played on and reproduced by some of the highest technology and finest precision instrumentation around. What we need to do is tune the instruments of the management arts and sciences so that we can harmonize our relationships, get with the beat, and sing the melodies we feel in our hearts and souls. For more information, see http://www.livingcapitalmetrics.com, or my blog at https://livingcapitalmetrics.wordpress.com.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.