Posts Tagged ‘amounts’

Number lines, counting, and measuring in arithmetic education

July 29, 2011

Over the course of two days spent at a meeting on mathematics education, a question started to form in my mind, one I don’t know how to answer, and to which there may be no answer. I’d like to try to formulate what’s on my mind in writing, and see if it’s just nonsense, a curiosity, some old debate that’s been long since resolved, issues too complex to try to use in elementary education, or something we might actually want to try to do something about.

The question stems from my long experience in measurement. It is one of the basic principles of the field that counting and measuring are different things (see the list of publications on this, below). Counts don’t behave like measures unless the things being counted are units of measurement established as equal ratios or intervals that remain invariant independent of the local particulars of the sample and instrument.

Plainly, if you count two groups of small and large rocks or oranges, the two groups can have the same number of things and the group with the larger things will have more rock or orange than the group with the smaller things. But the association of counting numbers and arithmetic operations with number lines insinuates and reinforces to the point of automatic intuition the false idea that numbers always represent quantity. I know that number lines are supposed to represent an abstract continuum but I think it must be nearly impossible for children to not assume that the number line is basically a kind of ruler, a real physical thing that behaves much like a row of same size wooden blocks laid end to end.

This could be completely irrelevant if the distinction between “How many?” and “How much?” is intensively taught and drilled into kids. Somehow I think it isn’t, though. And here’s where I get to the first part of my real question. Might not the universal, early, and continuous reinforcement of this simplistic equating of number and quantity have a lot to do with the equally simplistic assumption that all numeric data and statistical analysis is somehow quantitative? We count rocks or fish or sticks and call the resulting numbers quantities, and so we do the same thing when we count correct answers or ratings of “Strongly Agree.”

Though that counting is a natural and obvious point from which to begin studying whether something is quantitatively measurable, there are no defined units of measurement in the ordinal data gathered up from tests and surveys. The difference between any two adjacent scores varies depending on which two adjacent scores are compared. This has profound implications for the inferences we make and for our ability to think together as a field about our objects of investigation.

Over the last 30 years and more, we have become increasingly sensitized to the way our words prefigure our expectations and color our perceptions. This struggle to say what we mean and to not prejudicially exclude others from recognition as full human beings is admirable and good. But if that is so, why is it then that we nonetheless go on unjustifiably reducing the real characteristics of people’s abilities, health, performances, etc. to numbers that do not and cannot stand for quantitative amounts? Why do we keep on referring to counts as quantities? Why do we insist on referring to inconstant and locally dependent scores as measures? And why do we refuse to use the readily available methods we have at our disposal to create universally uniform measures that consistently represent the same unit amount always and everywhere?

It seems to me that the image of the number line as a kind of ruler is so indelibly impressed on us as a habit of thought that it is very difficult to relinquish it in favor of a more abstract model of number. Might it be important for us to begin to plant the seeds for more sophisticated understandings of number early in mathematics education? I’m going to wonder out loud about this to some of my math education colleagues…

Cooper, G., & Humphry, S. M. (2010). The ontological distinction between units and entities. Synthese, pp. DOI 10.1007/s11229-010-9832-1.

Wright, B. D. (1989). Rasch model from counting right answers: Raw scores as sufficient statistics. Rasch Measurement Transactions, 3(2), 62 [http://www.rasch.org/rmt/rmt32e.htm].

Wright, B. D. (1993). Thinking with raw scores. Rasch Measurement Transactions, 7(2), 299-300 [http://www.rasch.org/rmt/rmt72r.htm].

Wright, B. D. (1994, Autumn). Measuring and counting. Rasch Measurement Transactions, 8(3), 371 [http://www.rasch.org/rmt/rmt83c.htm].

Wright, B. D., & Linacre, J. M. (1989). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70(12), 857-867 [http://www.rasch.org/memo44.htm].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

A Second Simple Example of Measurement’s Role in Reducing Transaction Costs, Enhancing Market Efficiency, and Enables the Pricing of Intangible Assets

March 9, 2011

The prior post here showed why we should not confuse counts of things with measures of amounts, though counts are the natural starting place to begin constructing measures. That first simple example focused on an analogy between counting oranges and measuring the weight of oranges, versus counting correct answers on tests and measuring amounts of ability. This second example extends the first by, in effect, showing what happens when we want to aggregate value not just across different counts of some one thing but across different counts of different things. The point will be, in effect, to show how the relative values of apples, oranges, grapes, and bananas can be put into a common frame of reference and compared in a practical and convenient way.

For instance, you may go into a grocery store to buy raspberries and blackberries, and I go in to buy cantaloupe and watermelon. Your cost per individual fruit will be very low, and mine will be very high, but neither of us will find this annoying, confusing, or inconvenient because your fruits are very small, and mine, very large. Conversely, your cost per kilogram will be much higher than mine, but this won’t cause either of us any distress because we both recognize the differences in the labor, handling, nutritional, and culinary value of our purchases.

But what happens when we try to purchase something as complex as a unit of socioeconomic development? The eight UN Millennium Development Goals (MDGs) represent a start at a systematic effort to bring human, social, and natural capital together into the same economic and accountability framework as liquid and manufactured capital, and property. But that effort is stymied by the inefficiency and cost of making and using measures of the goals achieved. The existing MDG databases (http://data.un.org/Browse.aspx?d=MDG), and summary reports present overwhelming numbers of numbers. Individual indicators are presented for each year, each country, each region, and each program, goal by goal, target by target, indicator by indicator, and series by series, in an indigestible volume of data.

Though there are no doubt complex mathematical methods by which a philanthropic, governmental, or NGO investor might determine how much development is gained per million dollars invested, the cost of obtaining impact measures is so high that most funding decisions are made with little information concerning expected returns (Goldberg, 2009). Further, the percentages of various needs met by leading social enterprises typically range from 0.07% to 3.30%, and needs are growing, not diminishing. Progress at current rates means that it would take thousands of years to solve today’s problems of human suffering, social disparity, and environmental quality. The inefficiency of human, social, and natural capital markets is so overwhelming that there is little hope for significant improvements without the introduction of fundamental infrastructural supports, such as an Intangible Assets Metric System.

A basic question that needs to be asked of the MDG system is, how can anyone make any sense out of so much data? Most of the indicators are evaluated in terms of counts of the number of times something happens, the number of people affected, or the number of things observed to be present. These counts are usually then divided by the maximum possible (the count of the total population) and are expressed as percentages or rates.

As previously explained in various posts in this blog, counts and percentages are not measures in any meaningful sense. They are notoriously difficult to interpret, since the quantitative meaning of any given unit difference varies depending on the size of what is counted, or where the percentage falls in the 0-100 continuum. And because counts and percentages are interpreted one at a time, it is very difficult to know if and when any number included in the sheer mass of data is reasonable, all else considered, or if it is inconsistent with other available facts.

A study of the MDG data must focus on these three potential areas of data quality improvement: consistency evaluation, volume reduction, and interpretability. Each builds on the others. With consistent data lending themselves to summarization in sufficient statistics, data volume can be drastically reduced with no loss of information (Andersen, 1977, 1999; Wright, 1977, 1997), data quality can be readily assessed in terms of sufficiency violations (Smith, 2000; Smith & Plackner, 2009), and quantitative measures can be made interpretable in terms of a calibrated ruler’s repeatedly reproducible hierarchy of indicators (Bond & Fox, 2007; Masters, Lokan, & Doig, 1994).

The primary data quality criteria are qualitative relevance and meaningfulness, on the one hand, and mathematical rigor, on the other. The point here is one of following through on the maxim that we manage what we measure, with the goal of measuring in such a way that management is better focused on the program mission and not distracted by accounting irrelevancies.

Method

As written and deployed, each of the MDG indicators has the face and content validity of providing information on each respective substantive area of interest. But, as has been the focus of repeated emphases in this blog, counting something is not the same thing as measuring it.

Counts or rates of literacy or unemployment are not, in and of themselves, measures of development. Their capacity to serve as contributing indications of developmental progress is an empirical question that must be evaluated experimentally against the observable evidence. The measurement of progress toward an overarching developmental goal requires inferences made from a conceptual order of magnitude above and beyond that provided in the individual indicators. The calibration of an instrument for assessing progress toward the realization of the Millennium Development Goals requires, first, a reorganization of the existing data, and then an analysis that tests explicitly the relevant hypotheses as to the potential for quantification, before inferences supporting the comparison of measures can be scientifically supported.

A subset of the MDG data was selected from the MDG database available at http://data.un.org/Browse.aspx?d=MDG, recoded, and analyzed using Winsteps (Linacre, 2011). At least one indicator was selected from each of the eight goals, with 22 in total. All available data from these 22 indicators were recorded for each of 64 countries.

The reorganization of the data is nothing but a way of making the interpretation of the percentages explicit. The meaning of any one country’s percentage or rate of youth unemployment, cell phone users, or literacy has to be kept in context relative to expectations formed from other countries’ experiences. It would be nonsense to interpret any single indicator as good or bad in isolation. Sometimes 30% represents an excellent state of affairs, other times, a terrible one.

Therefore, the distributions of each indicator’s percentages across the 64 countries were divided into ranges and converted to ratings. A lower rating uniformly indicates a status further away from the goal than a higher rating. The ratings were devised by dividing the frequency distribution of each indicator roughly into thirds.

For instance, the youth unemployment rate was found to vary such that the countries furthest from the desired goal had rates of 25% and more(rated 1), and those closest to or exceeding the goal had rates of 0-10% (rated 3), leaving the middle range (10-25%) rated 2. In contrast, percentages of the population that are undernourished were rated 1 for 35% or more, 2 for 15-35%, and 3 for less than 15%.

Thirds of the distributions were decided upon only on the basis of the investigator’s prior experience with data of this kind. A more thorough approach to the data would begin from a finer-grained rating system, like that structuring the MDG table at http://mdgs.un.org/unsd/mdg/Resources/Static/Products/Progress2008/MDG_Report_2008_Progress_Chart_En.pdf. This greater detail would be sought in order to determine empirically just how many distinctions each indicator can support and contribute to the overall measurement system.

Sixty-four of the available 336 data points were selected for their representativeness, with no duplications of values and with a proportionate distribution along the entire continuum of observed values.

Data from the same 64 countries and the same years were then sought for the subsequent indicators. It turned out that the years in which data were available varied across data sets. Data within one or two years of the target year were sometimes substituted for missing data.

The data were analyzed twice, first with each indicator allowed its own rating scale, parameterizing each of the category difficulties separately for each item, and then with the full rating scale model, as the results of the first analysis showed all indicators shared strong consistency in the rating structure.

Results

Data were 65.2% complete. Countries were assessed on an average of 14.3 of the 22 indicators, and each indicator was applied on average to 41.7 of the 64 country cases. Measurement reliability was .89-.90, depending on how measurement error is estimated. Cronbach’s alpha for the by-country scores was .94. Calibration reliability was .93-.95. The rating scale worked well (see Linacre, 2002, for criteria). The data fit the measurement model reasonably well, with satisfactory data consistency, meaning that the hypothesis of a measurable developmental construct was not falsified.

The main result for our purposes here concerns how satisfactory data consistency makes it possible to dramatically reduce data volume and improve data interpretability. The figure below illustrates how. What does it mean for data volume to be drastically reduced with no loss of information? Let’s see exactly how much the data volume is reduced for the ten item data subset shown in the figure below.

The horizontal continuum from -100 to 1300 in the figure is the metric, the ruler or yardstick. The number of countries at various locations along that ruler is shown across the bottom of the figure. The mean (M), first standard deviation (S), and second standard deviation (T) are shown beneath the numbers of countries. There are ten countries with a measure of just below 400, just to the left of the mean (M).

The MDG indicators are listed on the right of the figure, with the indicator most often found being achieved relative to the goals at the bottom, and the indicator least often being achieved at the top. The ratings in the middle of the figure increase from 1 to 3 left to right as the probability of goal achievement increases as the measures go from low to high. The position of the ratings in the middle of the figure shifts from left to right as one reads up the list of indicators because the difficulty of achieving the goals is increasing.

Because the ratings of the 64 countries relative to these ten goals are internally consistent, nothing but the developmental level of the country and the developmental challenge of the indicator affects the probability that a given rating will be attained. It is this relation that defines fit to a measurement model, the sufficiency of the summed ratings, and the interpretability of the scores. Given sufficient fit and consistency, any country’s measure implies a given rating on each of the ten indicators.

For instance, imagine a vertical line drawn through the figure at a measure of 500, just above the mean (M). This measure is interpreted relative to the places at which the vertical line crosses the ratings in each row associated with each of the ten items. A measure of 500 is read as implying, within a given range of error, uncertainty, or confidence, a rating of

  • 3 on debt service and female-to-male parity in literacy,
  • 2 or 3 on how much of the population is undernourished and how many children under five years of age are moderately or severely underweight,
  • 2 on infant mortality, the percent of the population aged 15 to 49 with HIV, and the youth unemployment rate,
  • 1 or 2 the poor’s share of the national income, and
  • 1 on CO2 emissions and the rate of personal computers per 100 inhabitants.

For any one country with a measure of 500 on this scale, ten percentages or rates that appear completely incommensurable and incomparable are found to contribute consistently to a single valued function, developmental goal achievement. Instead of managing each separate indicator as a universe unto itself, this scale makes it possible to manage development itself at its own level of complexity. This ten-to-one ratio of reduced data volume is more than doubled when the total of 22 items included in the scale is taken into account.

This reduction is conceptually and practically important because it focuses attention on the actual object of management, development. When the individual indicators are the focus of attention, the forest is lost for the trees. Those who disparage the validity of the maxim, you manage what you measure, are often discouraged by the the feeling of being pulled in too many directions at once. But a measure of the HIV infection rate is not in itself a measure of anything but the HIV infection rate. Interpreting it in terms of broader developmental goals requires evidence that it in fact takes a place in that larger context.

And once a connection with that larger context is established, the consistency of individual data points remains a matter of interest. As the world turns, the order of things may change, but, more likely, data entry errors, temporary data blips, and other factors will alter data quality. Such changes cannot be detected outside of the context defined by an explicit interpretive framework that requires consistent observations.

-100  100     300     500     700     900    1100    1300
|-------+-------+-------+-------+-------+-------+-------|  NUM   INDCTR
1                                 1  :    2    :  3     3    9  PcsPer100
1                         1   :   2    :   3            3    8  CO2Emissions
1                    1  :    2    :   3                 3   10  PoorShareNatInc
1                 1  :    2    :  3                     3   19  YouthUnempRatMF
1              1   :    2   :   3                       3    1  %HIV15-49
1            1   :   2    :   3                         3    7  InfantMortality
1          1  :    2    :  3                            3    4  ChildrenUnder5ModSevUndWgt
1         1   :    2    :  3                            3   12  PopUndernourished
1    1   :    2   :   3                                 3    6  F2MParityLit
1   :    2    :  3                                      3    5  DebtServExpInc
|-------+-------+-------+-------+-------+-------+-------|  NUM   INDCTR
-100  100     300     500     700     900    1100    1300
                   1
       1   1 13445403312323 41 221    2   1   1            COUNTRIES
       T      S       M      S       T

Discussion

A key element in the results obtained here concerns the fact that the data were about 35% missing. Whether or not any given indicator was actually rated for any given country, the measure can still be interpreted as implying the expected rating. This capacity to take missing data into account can be taken advantage of systematically by calibrating a large bank of indicators. With this in hand, it becomes possible to gather only the amount of data needed to make a specific determination, or to adaptively administer the indicators so as to obtain the lowest-error (most reliable) measure at the lowest cost (with the fewest indicators administered). Perhaps most importantly, different collections of indicators can then be equated to measure in the same unit, so that impacts may be compared more efficiently.

Instead of an international developmental aid market that is so inefficient as to preclude any expectation of measured returns on investment, setting up a calibrated bank of indicators to which all measures are traceable opens up numerous desirable possibilities. The cost of assessing and interpreting the data informing aid transactions could be reduced to negligible amounts, and the management of the processes and outcomes in which that aid is invested would be made much more efficient by reduced data volume and enhanced information content. Because capital would flow more efficiently to where supply is meeting demand, nonproducers would be cut out of the market, and the effectiveness of the aid provided would be multiplied many times over.

The capacity to harmonize counts of different but related events into a single measurement system presents the possibility that there may be a bright future for outcomes-based budgeting in education, health care, human resource management, environmental management, housing, corrections, social services, philanthropy, and international development. It may seem wildly unrealistic to imagine such a thing, but the return on the investment would be so monumental that not checking it out would be even crazier.

A full report on the MDG data, with the other references cited, is available on my SSRN page at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1739386.

Goldberg, S. H. (2009). Billions of drops in millions of buckets: Why philanthropy doesn’t advance social progress. New York: Wiley.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Draft Legislation on Development and Adoption of an Intangible Assets Metric System

November 19, 2009

In my opinion, more could be done to effect meaningful and effective health care reform with legislation like that proposed below, which has fewer than 3,800 words, than will ever be possible with the 2,074 pages in Congress’s current health care reform bill. What’s more, creating the infrastructure for human, social, and natural capital markets in this way would not only cost a tiny fraction of the projected $847 billion bill being debated, it would be an investment that would pay returns many times larger than the initial investment. See previous posts in this blog for more info on how and why this is so.

The draft legislation below is adapted from The Metric Conversion Act (Title 15 U.S.C. Chapter6 §(204) 205a – 205k). The viability of a metric system for human, social, and natural capital is indicated by the realized state of scientific rigor in the measurement of human, social, and natural capital (Fisher, 2009b). The need for such a system is indicated by the current crisis’s pointed economic demands that all forms of capital be unified within a common econometric and financial framework (Fisher, 2009a). It is equally demanded by the moral and philosophical requirements of fair play and meaningfulness (Fisher, 2004). The day is fast approaching when a metric system for intangible assets will be recognized as the urgent need that it is (Fisher, 2009c).

At some point in the near future, it can be expected that a table showing how to interpret the units of the Intangible Assets Metric System will be published in the Federal Register, just as the International System units have been.

For those unfamiliar with the state of the art in measurement, these may seem like wildly unrealistic goals. Those wondering how a reasonable person might arrive at such opinions are urged to consult other posts in this blog, and the references cited in them. The advantages of an intangible assets metric system for sustainable and socially responsible economic policies and practices are nothing short of profound. As Georg Rasch (1980, p. xx) said in reference to the stringent demands of his measurement models, “this is a huge challenge, but once the problem has been formulated it does seem possible to meet it.” We are less likely to attain goals that we do not actively formulate. In the spirit of John Dewey’s student, Chiang Mon-Lin, what we need are “wild hypotheses and careful tests.” There is no wilder idea with greater potential impact for redefining profit as the reduction of waste, and for thereby mitigating human suffering, sociopolitical discontent, and environmental degradation.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2009a). Bringing human, social, and natural capital to life: Practical consequences and opportunities. In M. Wilson, K. Draney, N. Brown, B. Duckor (Eds.), Advances in Rasch Measurement, Vol. Two (p. in press). Maple Grove, MN: JAM Press.

Fisher, W. P., Jr. (2009b, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), 42(9), 1278-1287.

Fisher, W. P. J. (2009c). NIST Critical national need idea White Paper: Metrological infrastructure for human, social, and natural capital (Tech. Rep.). New Orleans: LivingCapitalMetrics.com.

Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Title xx U.S.C. Chapter x §(100) 101a – 101k
METRIC SYSTEM FOR INTANGIBLE ASSETS DEVELOPMENT LAW
(Pub. L. 10-xxx, §x, Intangible Assets Metrics Development Act, July 25, 2010)

§ 100. New metric system development authorized. – A new national effort is hereby initiated throughout the United States of America focusing on building and realizing the benefits of a metric system for the intangible assets known as human, social, and natural capital.

§ 101a. Congressional statement of findings. – The Congress finds as follows:

(1) The United States was an original signatory party to the 1875 Treaty of the Meter (20 Stat. 709), which established the General Conference of Weights and Measures, the International Committee of Weights and Measures and the International Bureau of Weights and Measures.

(2) The use of metric measurement standards in the United States was authorized by law in 1866; with the Metric Conversion Act of 1975 this Nation established a national policy of committing itself and taking steps to facilitate conversion to the metric system.

(3) World trade is dependent on the metric system of measurement; continuing trends toward globalization demand expansion of the metric system to include vital economic resources shown scientifically measurable in research conducted over the last 80 years.

(4) Industries and consumers in the United States are often at competitive disadvantages when dealing in domestic and international markets because no existing systems for measuring intangible assets (human, social, and natural capital) are expressed in standardized, universally uniform metrics. The end result is that education, health care, human resource, and other markets are unable to reward quality; supply and demand are unmatched, consumers make decisions with no or insufficient information; and quality cannot be systematically improved.

(5) The inherent simplicity of the metric system of measurement and standardization of weights and measures has led to major cost savings in certain industries which have converted to that system; similar savings are expected to follow from the development and implementation of a metric system for intangible assets.

(6) The Federal Government has a responsibility to develop procedures and techniques to assist industry, especially small business, as it voluntarily seeks to adopt a new metric system of measurement for intangible assets that have always required management but which have not yet been uniformly and systematically measured.

(7) A new metric system of measurement for human, social, and natural capital can provide substantial advantages to the Federal Government in its own operations.

§ 101b. Declaration of policy. – It is therefore the declared policy of the United States-

(1) to support the development and implementation of a new metric system of intangibles assets measurement as the preferred system of weights and measures for United States trade and commerce involving human, social, and natural capital;

(2) to require that each Federal agency,by a date certain and to the extent economically feasible by the end of the fiscal year 2011, use the new metric system of intangibles measurement in its procurements, grants, and other business-related activities, except to the extent that such use is impractical or is likely to cause significant inefficiencies or loss of markets to United States firms, such as when foreign competitors are producing competing products in non-metric units; and

(3) to seek out ways to increase understanding of the new metric system of intangibles measurement through educational information and guidance and in Government publications.

§ 101c. Definitions

As used in this subchapter, the term-

(1) ‘Board’ means the United States Intangible Assets Metrics Board, established under section 101d of this Title;

(2) ‘engineering standard’ means a standard which prescribes (A) a concise set of conditions and requirements that must be satisfied by a material, product, process, procedure, convention, or test method; and (B) the physical, functional, performance and/or conformance characteristics thereof;

(3) ‘international standard or recommendation’ means an engineering standard or recommendation which is (A) formulated and promulgated by an international organization and (B) recommended for adoption by individual nations as a national standard;

(4) ‘metric system of measurement’ means the International System of Units as established by the General Conference of Weights and Measures in 1960 and as interpreted or modified for the United States by the Secretary of Commerce;

(5) ‘full and open competition’ has the same meaning as defined in section 403 of title 41;

(6) ‘total installed price’ means the price of purchasing a product or material, trimming or otherwise altering some or all of that product or material, if necessary to fit with other building components,and then installing that product or material into a Federal facility;

(7) ‘hard-metric’ means measurement, design, and manufacture using the metric system of measurement, but does not include measurement,design, and manufacture using English system measurement units which are subsequently reexpressed in the metric system of measurement;

(8) ‘cost or pricing data or price analysis’ has the meaning given such terms in section 254b of title 41; and

(9) ‘Federal facility’ means any public building (as defined under section 612 of title 40) and shall include any Federal building or construction project: (A) on lands in the public domain;(B) on lands used in connection with Federal programs for agriculture research, recreation, and conservation programs; (C) on or used  in connection with river, harbor, flood control, reclamation, or power projects; (D) on or used in connection with housing and residential projects; (E) on military installations (including any fort, camp,post, naval training station, airfield, proving ground, military supply depot, military school, any similar facility of the Department of Defense); (F) on installations of the Department of Veterans Affairs used for hospital or domiciliary purposes; or (G) on lands used in connection with Federal prisons, but does not include (i)any Federal building or construction project the exclusion of which the President deems to be justified in the public interest, or (ii) any construction project or building owned or controlled by a State government, local government, Indian tribe, or any private entity.

§101d. United States Intangible Assets Metrics Board

(a) Establishment. – There is established, in accordance with this section, an independent instrumentality to be known as a United States Intangible Assets Metrics Board.

(b) Membership; Chairman; appointment of members; term of office;vacancies. – The Board shall consist of 17 individuals, as follows:

(1) the Chairman, a qualified individual who shall be appointed by the President, by and with the advice and consent of the Senate;

(2) seventeen members who shall be appointed by the President, by and with the advice and consent of the Senate, on the following basis-

(A) one to be selected from lists of qualified individuals recommended by psychometricians and organizations representative of psychometric interests;

(B) one to be selected from lists of qualified individuals recommended by social scientists, the scientific and technical community, and organizations representative of social scientists and technicians;

(C) one to be selected from lists of qualified individuals recommended by environmental scientists, the scientific and technical community, and organizations representative of environmental scientists and technicians;

(D) one to be selected from a list of qualified individuals recommended by the National Association of Manufacturers or its successor;

(E) one to be selected from lists of qualified individuals recommended by the United States Chamber of Commerce, or its successor, retailers,and other commercial organizations;

(F) two to be selected from lists of qualified individuals recommended by the American Federation of Labor and Congress of Industrial Organizations or its successor, who are representative of workers directly affected by human capital metrics for health, skills, motivations, and productivity, and by other organizations representing labor;

(G) one to be selected from a list of qualified individuals recommended by the National Governors Conference, the National Council of State Legislatures, and organizations representative of State and local government;

(H) two to be selected from lists of qualified individuals recommended by organizations representative of small business;

(I) one to be selected from lists of qualified individuals representative of the human resource management industry;

(J) one to be selected from a list of qualified individuals recommended by the National Conference on Weights and Measures and standards making organizations;

(K) one to be selected from lists of qualified individuals recommended by educators, the educational community, and organizations representative of educational interests; and

(L) four at-large members to represent consumers and other interests deemed suitable by the President and who shall be qualified individuals.

As used in this subsection, each ‘list’ shall include the names of at least three individuals for each applicable vacancy. The terms of office of the members of the Board first taking office shall expire as designated by the President at the time of nomination; five at the end of the second year; five at the end of the fourth year;and six at the end of the sixth year. The term of office of the Chairman of such Board shall be six years. Members, including the Chairman, may be appointed to an additional term of six years, in the same manner as the original appointment. Successors to members of such Board shall be appointed in the same manner as the original members and shall have terms of office expiring six years from the date of expiration of the terms for which their predecessors were appointed. Any individual appointed to fill a vacancy occurring prior to the expiration of any term of office shall be appointed for the remainder of that term. Beginning 45 days after the date of incorporation of the Board, six members of such Board shall constitute a quorum for the transaction of any function of the Board.

(c) Compulsory powers. – Unless otherwise provided by the Congress, the Board shall have no compulsory powers.

(d) Termination. – The Board shall cease to exist when the Congress, by law, determines that its mission has been accomplished.

§101e. – Functions and powers of Board. – It shall be the function of the Board to devise and carry out a broad program of planning, coordination, and public education, consistent with other national policy and interests, with the aim of implementing the policy set forth in this subchapter. In carrying out this program,the Board shall-

(1) consult with and take into account the interests, views, and costs relevant to the inefficiencies that have long plagued the management of unmeasured forms of capital in United States commerce and industry, including small business; science; engineering; labor; education; consumers; government agencies at the Federal, State, and local level; nationally recognized standards developing and coordinating organizations; intangibles metrics development, planning and coordinating groups; and such other individuals or groups as are considered appropriate by the Board to the carrying out of the purposes of this subchapter. The Board shall take into account activities underway in the private and public sectors, so as not to duplicate unnecessarily such activities;

(2) provide for appropriate procedures whereby various groups,under the auspices of the Board, may formulate, and recommend or suggest, to the Board specific programs for coordinating intangibles metrics development in each industry and segment thereof and specific dimensions and configurations in the new metric system and in other measurements for general use. Such programs, dimensions, and configurations shall be consistent with (A) the needs, interests, and capabilities of manufacturers (large and small), suppliers, labor, consumers, educators,and other interested groups, and (B) the national interest;

(3) publicize, in an appropriate manner, proposed programs and provide an opportunity for interested groups or individuals to submit comments on such programs. At the request of interested parties, the Board, in its discretion, may hold hearings with regard to such programs. Such comments and hearings may be considered by the Board;

(4) encourage activities of standardization organizations to develop or revise, as rapidly as practicable, policy and IT standards based on the new intangibles metrics, and to take advantage of opportunities to promote (A) rationalization or simplification of relationships,(B) improvements of design, (C) reduction of size variations, (D) increases in economy, and (E) where feasible, the efficient use of energy and the conservation of natural resources;

(5) encourage the retention, in the new metric language of human, social, and natural capital standards, of those United States policy and IT designs, practices, and conventions that are internationally accepted or that embody superior technology;

(6) consult and cooperate with foreign governments, and intergovernmental organizations, in collaboration with the Department of State, and, through appropriate member bodies, with private international organizations, which are or become concerned with the encouragement and coordination of increased use of intangible assets metrics measurement units or policy and IT standards based on such units, or both. Such consultation shall include efforts, where appropriate, to gain international recognition for intangible assets metrics standards proposed by the United States;

(7) assist the public through information and education programs, to become familiar with the meaning and applicability of metric terms and measures in daily life. Such programs shall include –

(A) public information programs conducted by the Board, through the use of newspapers, magazines, radio, television, the Internet, social networking, and other media, and through talks before appropriate citizens’ groups, and trade and public organizations;

(B) counseling and consultation by the Secretary of Education; the Secretary of Labor; the Administrator of the Small Business Administration; and the Director of the National Science Foundation, with educational associations, State and local educational agencies, labor education committees, apprentice training committees, and other interested groups, in order to assure (i) that the new intangible assets metric system of measurement is included in the curriculum of the Nation’s educational institutions, and (ii) that teachers and other appropriate personnel are properly trained to teach the intangible assets metric system of measurement;

(C) consultation by the Secretary of Commerce with the National Conference of Weights and Measures in order to assure that State and local weights and measures officials are (i) appropriately involved in intangible assets metric development and adoption activities and (ii) assisted in their efforts to bring about timely amendments to weights and measures laws; and

(D) such other public information activities, by any Federal agency in support of this subchapter, as relate to the mission of suchagency;

(8) collect, analyze, and publish information about the extent of usage of intangible assets metric measurements; evaluate the costs and benefits of that usage; and make efforts to minimize any adverse effects resulting from increasing intangible assets metric usage;

(9) conduct research, including appropriate surveys; publish the results of such research; and recommend to the Congress and to the President such action as may be appropriate to deal with any unresolved problems, issues, and questions associated with intangible assets metric development, adoption, or usage, such problems, issues, and questions may include, but are not limited to, the impact on different occupations and industries, possible increased costs to consumers, the impact on society and the economy, effects on small business, the impact on the international trade position of the United States, the appropriateness of and methods for using procurement by the Federal Government as a means to effect development and adoption of the intangible assets metric system, the proper conversion or transition period in particular sectors of society, and consequences for national defense;

(10) submit annually to the Congress and to the President a report on its activities. Each such report shall include a status report on the development and adoption process as well as projections for continued progress in that process. Such report may include recommendations covering any legislation or executive action needed to implement the programs of development and adoption accepted by the Board. The Board may also submit such other reports and recommendations as it deems necessary;and

(11) submit to the President, not later than 1 year after the date of enactment of the Act making appropriations for carrying out this subchapter, a report on the need to provide an effective structural mechanism for adopting intangible assets metric units in statutes, regulations, and other laws at all levels of government, on a coordinated and timely basis, in response to voluntary programs adopted and implemented by various sectors of society under the auspices and with the approval of the Board. If the Board determines that such a need exists, such report shall include recommendations as to appropriate and effective means for establishing and implementing such a mechanism.

§101f. – Duties of Board. – In carrying out its duties under this subchapter, the Board may –

(1) establish an Executive Committee, and such other committees as it deems desirable;

(2) establish such committees and advisory panels as it deems necessary to work with the various sectors of the Nation’s economy and with Federal and State governmental agencies in the development and implementation of detailed development and adoption plans for those sectors. The Board may reimburse,to the extent authorized by law, the members of such committees;

(3) conduct hearings at such times and places as it deems appropriate;

(4) enter into contracts, in accordance with the Federal Property and Administrative Services Act of 1949, as amended (40 U.S.C. 471et seq.), with Federal or State agencies, private firms, institutions, and individuals for the conduct of research or surveys, the preparation of reports, and other activities necessary to the discharge of its duties;

(5) delegate to the Executive Director such authority as it deems advisable; and

(6) perform such other acts as may be necessary to carry out the duties prescribed by this subchapter.

§101g. – Gifts, donations and bequests to Board

(a) Authorization; deposit into Treasury and disbursement. – The Board may accept, hold, administer, and utilize gifts, donations,and bequests of property, both real and personal, and personal services, for the purpose of aiding or facilitating the work of the Board. Gifts and bequests of money, and the proceeds from the sale of any other property received as gifts or requests, shall be deposited in the Treasury in a separate fund and shall be disbursed upon order of the Board.

(b) Federal income, estate, and gift taxation of property. – For purpose of Federal income, estate, and gift taxation, property accepted under subsection (a) of this section shall be considered as a gift or bequest to or for the use of the United States.

(c) Investment of moneys; disbursement of accrued income. – Upon the request of the Board, the Secretary of the Treasury may invest and reinvest, in securities of the United States, any moneys contained in the fund authorized in subsection (a) of this section. Income accruing from such securities, and from any other property acceptedto the credit of such fund, shall be dispersed upon the order ofthe Board.

(d) Reversion to Treasury of unexpended funds. – Funds not expended by the Board as of the date when it ceases to exist, in accordance with section 105d(d) of this title, shall revert to the Treasury of the United States as of such date.

§101h. – Compensation of Board members; travel expenses.- Members of the Board who are not in the regular full-time employ of the United States shall, while attending meetings or conferences of the Board or while otherwise engaged in the business of the Board, be entitled to receive compensation at a rate not to exceed the daily rate currently being paid grade 18 of the General Schedule (under section 5332 of title 5), including travel time. While so serving, on the business of the Board away from their homes or regular places of business, members of the Board may be allowed travel expenses,including per diem in lieu of subsistence, as authorized by section5703 of title 5, for persons employed intermittently in the Government service. Payments under this section shall not render members of the Board employees or of the United States for any purpose. Members of the Board who are in the employ of the United States shall be entitled to travel expenses when traveling on the business of the Board.

§101i. – Personnel

(a) Executive Director; appointment; tenure; duties. – The Board shall appoint a qualified individual to serve as the Executive Director of the Board at the pleasure of the Board. The Executive Director, subject to the direction of the Board, shall be responsible to the Board and shall carry out the intangible assets metric development and adoption program, pursuant to the provisions of this subchapter and the policies established by the Board.

(b) Executive Director; salary. – The Executive Director of the Board shall serve full time and be subject to the provisions of chapter 51 and subchapter III of chapter 53 of title 5. The annual salary of the Executive Director shall not exceed level III of the Executive Schedule under section 5314 of such title.

(c) Staff personnel; appointment and compensation. – The Board may appoint and fix the compensation of such staff personnel as may be necessary to carry out the provisions of this subchapter in accordance with the provisions of chapter 51 and subchapter III of chapter 53 of title 5.

(d) Experts and consultants; employment and compensation; annual review of contracts. – The Board may (1) employ experts and consultants or organizations thereof, as authorized by section 3109 of title5; (2) compensate individuals so employed at rates not in excess of the rate currently being paid grade 18 of the General Schedule under section 5332 of such title, including travel time; and (3) may allow such individuals, while away from their homes or regular places of business, travel expenses (including per diem in lieu of subsistence) as authorized by section 5703 of such title 5 for persons in the Government service employed intermittently: Provided, however, that contracts for such temporary employment may be renewed annually.

§101j. – Financial and administrative services; sourceand reimbursement. – Financial and administrative services, including those related to budgeting, accounting, financial reporting, personnel, and procurement, and such other staff services as maybe needed by the Board, may be obtained by the Board from the Secretary of Commerce or other appropriate sources in the Federal Government. Payment for such services shall be made by the Board, in advance or by reimbursement, from funds of the Board in such amounts as may be agreed upon by the Chairman of the Board and by the source of the services being rendered.

§101k. – Authorization of appropriations; availability.- There are authorized to be appropriated such sums as may be necessary to carry out the provisions of this subchapter. Appropriations to carry out the provisions of this subchapter may remain available for obligation and expenditure for such period or periods as maybe specified in the Acts making such appropriations.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

On the alleged difficulty of quantifying this or that

October 5, 2009

That this effect or that phenomenon is “difficult to quantify” is one of those phrases that people use from time to time. But, you know, building a computer is difficult, too. I couldn’t do it, and you probably couldn’t, either. Computers are, however, readily available for purchase and it doesn’t matter if you or I can make our own.

Same thing with measurement. Of course, instrument design and calibration are highly technical endeavors, and despite 80+ years of success, most people seem to think it is impossible to really quantify abstract things like abilities, attitudes, motivations,  trust, outcomes and impacts, or maturational development. But real quantification, the kind that is commonly thought possible only for physical things, has been underway in psychology and the social sciences for a long time. More people need to know this.

As anyone who has read much of this blog knows, I’m not talking about some kind of simplistic survey or assessment process that takes measurement to be a mere assignment of numbers to observations. Instrument calibration takes a lot more thought and effort than is usually invested in it. But it isn’t impossible, not by a long shot.

Just as you would not despair of ever having your own computer just because you cannot make one yourself, those who throw up their hands at the supposed difficulty of quantifying something need to think again. Where there’s a will, there’s a way, and scientifically rigorous methods of determining whether something is measurable are a lot more ready to hand than most people realize.

For more information, see my survey design recommendations on pages 1,072-4 at http://www.rasch.org/rmt/rmt203.pdf and Ben Wright’s 15 steps to measurement at http://www.rasch.org/rmt/rmt141g.htm.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Graphic Illustrations of Why Scores, Ratings, and Percentages Are Not Measures, Part Two

July 2, 2009

Part One of this two-part blog offered pictures illustrating the difference between numbers that stand for something that adds up and those that do not. The uncontrolled variation in the numbers that pass for measures in health care, education, satisfaction surveys, performance assessments, etc. is analogous to the variation in weights and measures found in Medieval European markets. It is well established that metric uniformity played a vital role in the industrial and scientific revolutions of the nineteenth century. Metrology will inevitably play a similarly central role in the economic and scientific revolutions taking place today.

Clients and students often express their need for measures that are manageable, understandable, and relevant. But sometimes it turns out that we do not understand what we think we understand. New understandings can make what previously seemed manageable and relevant appear unmanageable and irrelevant. Perhaps our misunderstandings about measurement will one day explain why we have failed to innovate and improve as much as we could have.

Of course, there are statistical methods for standardizing scores and proportions that make them comparable across different normal distributions, but I’ve never once seen them applied to employee, customer, or patient survey results reported to business or hospital managers. They certainly are not used in determining comparable proficiency levels of students under No Child Left Behind. Perhaps there are consultants and reporting systems that make standardized z-scores a routine part of their practices, but even if they are, why should anyone willingly base their decisions on the assumption that normal distributions have been obtained? Why not use methods that give the same result no matter how scores are distributed?

To bring the point home, if statistical standardization is a form of measurement, why don’t we use the z-scores for height distributions instead of the direct measures of how tall we each are? Plainly, the two kinds of numbers have different applications. Somehow, though, we try to make do without the measures in many applications involving tests and surveys, with the unfortunate consequence of much lost information and many lost opportunities for better communication.

Sometimes I wonder, if we would give a test on the meaning of the scores, percentages, and logits discussed in Part One to managers, executives, and entrepreneurs, would many do any better on the parts they think they understand than on the parts they find unfamiliar? I suspect not. Some executives whose pay-for-performance bonuses are inflated by statistical accidents are going to be unhappy with what I’m going to say here, but, as I’ve been saying for years, clarifying financial implications will go a long way toward motivating the needed changes.

How could that be true? Well, consider the way we treat percentages. Imagine that three different hospitals see their patients’ percents agreement with a key survey item change as follows. Which one changed the most?

 

A. from 30.85% to 50.00%: a 19.15% change

B. from 6.68% to 15.87%: a 9.18% change

C. from 69.15% to 84.13%: a 14.99% change

As is illustrated in Figure 1 below, given that all three pairs of administrations of the survey are included together in the same measure distribution, it is likely that the three changes were all the same size.

In this scenario, all the survey administrations shared the same standard deviation in the underlying measure distribution that the key item’s percentage was drawn from, and they started from different initial measures. Different ranges in the measures are associated with different parts of the sample’s distribution, and so different numbers and percentages of patients are associated with the same amount of measured change. It is easy to see that 100-unit measured gains in the range of 50-150 or 1000-1100 on the horizontal axis would scarcely amount to 1% changes, but the same measured gain in the middle of the distribution could be as much as 25%.

Figure 1. Different Percents, Same Measures

Figure 1. Different Percentages, Same Measures

Figure 1 shows how the same measured gain can look wildly different when expressed as a percentage, depending on where the initial measure is positioned in the distribution. But what happens when percentage gains are situated in different distributions that have different patterns of variation?

More specifically, consider a situation in which three different hospitals see their percents agreement with a key survey item change as follows.

A. from 30.85% to 50.00%: a 19.15% change

B. from 30.85% to 50.00%: a 19.15% change

C. from 30.85% to 50.00%: a 19.15% change

Did one change more than the others? Of course, the three percentages are all the same, so we would naturally think that the three increases are all the same. But what if the standard deviations characterizing the three different hospitals’ score distributions are different?

Figure 2, below, shows that the three 19.15% changes could be associated with quite different measured gains. When the distribution is wider and the standard deviation is larger, any given percentage change will be associated with a larger measured change than in cases with narrower distributions and smaller standard deviations.

Same Percentage Gains, Different Measured Gains

Figure 2. Same Percentage Gains, Different Measured Gains

And if this is not enough evidence as to the foolhardiness of treating percentages as measures, bear with me through one more example. Imagine another situation in which three different hospitals see their percents agreement with a key survey item change as follows.

A. from 30.85% to 50.00%: a 19.15% change

B. from 36.96% to 50.00%: a 13.04% change

C. from 36.96% to 50.00%: a 13.04% change

Did one change more than the others? Plainly A obtains the largest percentage gain. But Figure 3 shows that, depending on the underlying distribution, A’s 19.15% gain might be a smaller measured change than either B’s or C’s. Further, B’s and C’s measures might not be identical, contrary to what would be expected from the percentages alone.

Figure 3. Percentages Completely at Odds with Measures

Figure 3. Percentages Completely at Odds with Measures

Now we have a fuller appreciation of the scope of the problems associated with the changing unit size illustrated in Part One. Though we think we understand percentages and insist on using them as something familiar and routine, the world that they present to us is as crazily distorted as a carnival funhouse. And we won’t even begin to consider how things look in the context of distributions skewed toward one end of the continuum or the other! There is similarly no point at all in going to bimodal or multimodal distributions (ones that have more than one peak). The vast majority of business applications employing scores, ratings, and percentages as measures do not take the underlying distribution into account. Given the problems that arise in optimal conditions (i.e., with a normal distribution), there is no need to belabor the issue with an enumeration of all the possible things that could be going wrong. Far better to simply move on and construct measurement systems that remain invariant across the different shapes of local data sets’ particular distributions.

How could we have gone so far in making these nonsensical numbers the focus of our attention? To put things back in perspective, we need to keep in mind the evolving magnitude of the problems we face. When Florence Nightingale was deploring the lack of any available indications of the effectiveness of her efforts, a little bit of flawed information was a significant improvement over no information. Ordinal, situation-specific numbers provided highly useful information when problems emerged in local contexts on a scale that could be comprehended and addressed by individuals and small groups.

We no longer live in that world. Today’s problems require kinds of information that must be more meaningful, precise, and actionable than ever before. And not only that, this information cannot remain accessible only to managers, executives, researchers, and data managers. It must be brought to bear in every transaction and information exchange in the industry.

Information has to be formatted in the common currency of uniform metrics to make it as fluid and empowering as possible. Would the auto industry have been able to bring off a quality revolution if every worker’s toolkit was calibrated in a different unit? Could we expect to coordinate schedules easily if we each had clocks scaled in different time units? Obviously not; why should we expect quality revolutions in health care and education when nearly all of our relevant metrics are incommensurable?

Management consultants realized decades ago that information creates a sense of responsibility in the person who possesses it. We cannot expect clinicians and teachers to take full responsibility for the outcomes they produce until they have the information they need to evaluate and improve them. Existing data and systems plainly are not up to the task.

The problem is far less a matter of complex or difficult issues than it is one of culture and priorities. It often takes less effort to remain in a dysfunctional rut and deal with massive inefficiencies than it does to get out of the rut and invent a new system with new potentials. Big changes tend to take place only when systems become so bogged down by their problems that new systems emerge simply out of the need to find some way to keep things in motion. These blogs are written in the hope that we might be able to find our way to new methods without suffering the catastrophes of total system failure. One might well imagine an entrepreneurially-minded consortium of providers, researchers, payors, accreditors, and patient advocates joining forces in small pilot projects testing out new experimental systems.

To know how much of something we’re getting for our money and whether its a fair bargain, we need to be able to compare amounts across providers, vendors, treatment options, teaching methods, etc. Scores summed from tests, surveys, or assessments, individual ratings, and percentages of a maximum possible score or frequency do not provide this information because they are not measures. Their unit sizes vary across individuals, collections of indicators (instruments), time, and space. The consequences of treating scores and percentages as measures are not trivial. We will eventually come to see that measurement quality is the primary source of the differences between the current health care and education systems’ regional variations and endlessly spiralling costs, on the one hand, and the geographically uniform quality, costs, and improvements in the systems we will create in the future.

Markets are dysfunctional when quality and costs cannot be evaluated in common terms by consumers, providers’ quality improvement specialists, researchers, accreditors, and payers. There are widespread calls for greater transparency in purchasing decisions, but transparency is not being defined and operationalized meaningfully or usefully. As currently employed, transparency refers to making key data available for public scrutiny. But these data are almost always expressed as scores, ratings, or percentages that are anything but transparent. In addition to not adding up, these data are also usually presented in indigestibly large volumes, and are not quality assessed.

All things considered, we’re doing amazingly well with our health care and education systems given the way we’ve hobbled ourselves with dysfunctional, incommensurable measures. And that gives us real cause for hope! What will we be able to accomplish when we really put our minds to measuring what we want to manage? How much better will we be able to do when entrepreneurs have the tools they need to innovate new efficiences? Who knows what we’ll be capable of when we have meaningful measures that stand for amounts that really add up, when data volumes are dramatically reduced to manageable levels, and when data quality is effectively assessed and improved?

For more on the problems associated with these kinds of percentages in the context of NCLB, see Andrew Dean Ho’s article in the August/September, 2008 issue of Educational Researcher, and Charles Murray’s “By the Numbers” column in the July 25, 2006 Wall Street Journal.

This is not the end of the story as to what the new measurement paradigm brings to bear. Next, I’ll post a table contrasting the features of scores, ratings, and percentages with those of measures. Until then, check out the latest issue of the Journal of Applied Measurement at http://www.jampress.org, see what’s new in measurement software at http://www.winsteps.com or http://www.rummlab.com.au, or look into what’s up in the way of measurement research projects with the BEAR group at UC Berkeley (http://gse.berkeley.edu/research/BEAR/research.html).

Finally, keep in mind that we are what we measure. It’s time we measured what we want to be.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Graphic Illustrations of Why Scores, Ratings, and Percentages Are Not Measures, Part One

July 1, 2009

It happens occasionally when I’m speaking to a group unfamiliar with measurement concepts that my audiences audibly gasp at some of the things I say. What can be so shocking about anything as mundane as measurement? A lot of things, in fact, since we are in the strange situation of having valid and rigorous intuitions about what measures ought to be, while we simultaneously have entire domains of life in which our measures almost never live up to those intuitions in practice.

So today I’d like to spell out a few things about measurement, graphically. First, I’m going to draw a picture of what good measurement looks like. This picture will illustrate why we value numbers and want to use them for managing what’s important. Then I’m going to draw a picture of what scores, ratings, and percentages look like. Here we’ll see how numbers do not automatically stand for something that adds up the way they do, and why we don’t want to use these funny numbers for managing anything we really care about. What we will see here, in effect, is why high stakes graduation, admissions, and professional certification and licensure testing agencies have long since abandoned scores, ratings, and percentages as their primary basis for making decisions.

After contrasting those pictures, a third picture will illustrate how to blend the valid intuitions informing what we expect from measures with the equally valid intuitions informing the observations expressed in scores, ratings, and percentages.

Imagine measuring everything in the room you’re in twice, once with a yardstick and once with a meterstick. You record every measure in inches and in centimeters. Then you plot these pairs of measures against each other, with inches on the vertical axis and centimeters on the horizontal. You would come up with a picture like Figure 1, below.

Figure 1. How We Expect Measures to Work

Figure 1. How We Expect Measures to Work

The key thing to appreciate about this plot is that the amounts of length measured by the two different instruments stay the same no matter which number line they are mapped onto. You would get a plot like this even if you sawed a yardstick in half and plotted the inches read off the two halves. You’d also get the same kind of a plot (obviously) if you paired up measures of the same things from two different manufacturer’s inch rulers, or from two different brands of metersticks. And you could do the same kind of thing with ounces and grams, or degrees Fahrenheit and Celsius.

So here we are immersed in the boring-to-the-point-of-being-banal details of measurement. We take these alignments completely for granted, but they are not given to us for nothing. They are products of the huge investments we make in metrological standards. Metrology came of age in the early nineteenth century. Until then, weights and measures varied from market to market. Units with the same name might be different sizes, and units with different names might be the same size. As was so rightly celebrated on World Metrology Day (May 20), metric uniformity contributes hugely to the world economy by reducing transaction costs and by structuring representations of fair value.

We are in dire need of similar metrological systems for human, social, and natural capital. Health care reform, improved education systems, and environmental management will not come anywhere near realizing their full potentials until we establish, implement, and monitor metrological standards that bring intangible forms of capital to economic life.

But can we construct plots like Figure 1 from the numeric scores, ratings, and percentages we commonly assume to be measures? Figure 2 shows the kind of picture we get when we plot percentages against each other (scores and ratings behave in the same way, for reasons given below). These data might be from easy and hard halves of the same reading or math test, from agreeable and disagreeable ends of the same rating scale survey, or from different tests or surveys that happen to vary in their difficulty or agreeability. The Figure 2 data might also come from different situations in which some event or outcome occurs more frequently in one place than it does in another (we’ll go more into this in Part Two of this report).

Figure 2. Percents Correct or Agreement from Different Tests or Surveys

Figure 2. Percents Correct or Agreement from Different Tests or Surveys

In contrast with the linear relation obtained in the comparison of inches and centimeters, here we have a curve. Why must this relation necessarily be curved? It cannot be linear because both instruments limit their measurement ranges, and they set different limits. So, if someone scores a 0 on the easy instrument, they are highly likely to also score 0 on the instrument posing more difficult or disagreeable questions. Conversely, if someone scores 100 on the hard instrument, they are highly likely to also score 100 on the easy one.

But what is going to happen in the rest of the measurement range? By the definition of easy and hard, scores on the easy instrument will be higher than those on the hard one. And because the same measured amount is associated with different ranges in the easy and hard score distributions, the scores vary at different rates (Part Two will explore this phenomenon in more detail).

These kinds of numbers are called ordinal because they meaningfully convey information about rank order. They do not, however, stand for amounts that add up. We are, of course, completely free to treat these ordinal numbers however we want, in any kind of arithmetical or statistical comparison. Whether such comparisons are meaningful and useful is a completely different issue.

Figure 3 shows the Figure 2 data transformed. The mathematical transformation of the percentages produces what is known as a logit, so called because it is a log-odds unit, obtained as the natural logarithm of the response odds. (The response odds are the response probabilities–the original percentages of the maximum possible score–divided by one minus themselves.) This is the simplest possible way of estimating linear measures. Virtually no computer program providing these kinds of estimates would employ an algorithm this simple and potentially fallible.

Figure 3. Logit (Log-Odds Units) Estimates of the Figure 2 Data

Figure 3. Logit (Log-Odds Units) Estimates of the Figure 2 Data

Although the relationship shown in Figure 3 is not as precise as that shown in Figure 1, especially at the extremes, the values plotted fall far closer to the identity line than the values in Figure 2 do. Like Figure 1, Figure 3 shows that constant amounts of the thing measured exist irrespective of the particular number line they happen to be mapped onto.

What this means is that the two instruments could be designed so that the same numbers are read off of them when the same amounts are measured. We value numbers as much as we do because they are so completely transparent: 2+2=4 no matter what. But this transparency can be a liability when we assume that every unit amount is the same as all the others and they actually vary substantially. When different units stand for different amounts, confusion reigns. But we can reasonably hope and strive for great things as we bring human, social, and natural capital to life via universally uniform metrics traceable to reference standards.

A large literature on these methods exists and ought to be more widely read. For more information, see http://www.rasch.org, http://www.livingcapitalmetrics.com, etc.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

The Measurement Twilight Zone

June 3, 2009

Measurement is everywhere. Our symbol for justice is a balance scale. We have technical standards for air, water, and food quality. Trade and commerce, from local to global markets, depend on quick and easy ways of knowing what and how much is for sale.

We all depend on measurement, but hardly anyone knows anything about how instruments are calibrated or how meaningful expressions of quantity are created and maintained.

So measurement exists in a kind of twilight zone between the clearest and most rigorous mathematics, on the one hand, and the darkest and most obscure ignorance, on the other. Take temperature, for instance. Virtually everyone over the age of five or so knows how to read a thermometer. But very few people can correctly describe the thermodynamic relationships that make a thermometer work.

We can rely on thermometer manufacturors to do the work of calibrating temperature measures for us. But what happens when we need to measure something for which there are no commercially available solutions?

As demand increases for measures of human and organizational performance, of social capital, and environmental impact, more and more managers, executives, entrepreneurs, accountants, philanthropists, and researchers unknowingly enter into the measurement twilight zone.

In the measurement twilight zone, things are not as they seem. Numbers add up the way they always do, but they no longer stand for constant amounts. We manage what we measure, and so we ask customers, employees, or patients to rate performances, we count right answers on tests, and we compute the percentage of time that some event happens.

But none of these numbers are measures. None of them add up. This is a very serious situation. It is not a rare, academic technicality of no practical consequence. Improving the quality of our measures is an urgent matter that ought to be the focus of a great deal more attention and interest than it currently is.

For instance, do you know that sometimes a 15% difference can stand for as much as or even a lot more than a 39% difference? Did you know that three markedly different percentage values–differences that vary by more than a standard error or even five– might actually stand for the same measured amount? Do you know that the difference between 1 percent and 2 percent can represent 4-8 times the difference between 49 percent and 50 percent?

Scores, ratings, and percentages are termed “ordinal” because, at best, they stand for a rank order of less and more. They do not stand for equal-interval amounts, though they can be a good start at creating real measures.

The general public doesn’t know much about all of this because the math is pretty intense, the software is hard to use, and we have an ingrained cultural prejudice that says all we have to do is come up with numbers of some kind, and–voila!– we have measurement. Nothing could be further from the truth.

My goal in all of this is to figure out how to put tools that work in the hands of the people who need them. You don’t need a PhD in thermodynamics to read a thermometer, so we ought to be able to calibrate similar instruments for other things we want to measure. And the way transparency and accountability demands are converging with economics and technology, I think the time is ripe for new ideas properly presented.

In my 25 years of experience in measurement, people often turn out to not understand what they think they understand. And they then also turn out to be amazed at what they learn when they take the trouble to put some time and care into crafting an instrument that really measures what they’re after.

For instance, did you know that there are mathematical ways of reducing data volume that not only involve no loss of information but that actually increase the amount of actionable value? We are swimming in seas of data that do not usually mean what we think they mean, so being able to ensure things add up properly at the same time we reduce the volume of numbers we have to deal with is an eminently practical aid to understanding and manageability.

Did you know that different sets of indicators or items can measure in a common metric? Or that a large bank of items can be adaptively administered, with the instrument individually tailored and customized for each respondent, organization, or situation, all without compromising the comparability of the measures?

These are highly practical things to be able to do. Markets live and die on shared product definitions and shared metrics. Innovation almost never happens as a result of one person’s efforts; it is almost always a result of activities coordinated through a network structured by a common language of reference standards. We are very far from having the markets and levels of innovation we need in large part because the quality of measurement in so many business applications is so poor.

And there’s lots more where that came from, but I’ll stop there. You can learn a lot more on these topics from a lot of sources. I’ll list a few below.

http://www.rasch.org
http://www.rasch.org/rmt
http://en.wikipedia.org/wiki/Rasch_model
http://www.lexile.com
http://www.winsteps.com
http://www.livingcapitalmetrics.com

William P. Fisher, Jr., Ph.D.

We are what we measure.
It’s time we measured what we want to be.