Archive for the ‘Reliability’ Category

The Counterproductive Consequences of Common Study Designs and Statistical Methods

May 21, 2015

Because of the ways studies are designed and the ways data are analyzed, research results in psychology and the social sciences often appear to be nonlinear, sample- and instrument-dependent, and incommensurable, even when they need not be. In contrast with what are common assumptions about the nature of the constructs involved, invariant relations may be more obscured than clarified by typically employed research designs and statistical methods.

To take a particularly salient example, the number of small factors with Eigenvalues greater than 1.0 identified via factor analysis increases as the number of modes in a multi-modal distribution also increases, and the interpretation of results is further complicated by the fact that the number of factors identified decreases as sample size increases (Smith, 1996).

Similarly, variation in employment test validity across settings was established as a basic assumption by the 1970s, after 50 years of studies observing the situational specificity of results. But then Schmidt and Hunter (1977) identified sampling error, measurement error, and range restriction as major sources of what was only the appearance of incommensurable variation in employment test validity. In other words, for most of the 20th century, the identification of constructs and comparisons of results across studies were pointlessly confused by mixed populations, uncontrolled variation in reliability, and unnoted floor and/or ceiling effects. Though they do nothing to establish information systems deploying common languages structured by standard units of measurement (Feinstein, 1995), meta-analysis techniques are a step forward in equating effect sizes (Hunter & Schmidt, 2004).

Wright and Stone’s (1979) Best Test Design, in contrast, takes up each of these problems in an explicit way. Sampling error is addressed in that both the sample’s and the items’ representations of the same populations of persons and expressions of a construct are evaluated. The evaluation of reliability is foregrounded and clarified by taking advantage of the availability of individualized measurement uncertainty (error) estimates (following Andrich, 1982, presented at AERA in 1977). And range restriction becomes manageable in terms of equating and linking instruments measuring in different ranges of the same construct. As was demonstrated by Duncan (1985; Allerup, Bech, Loldrup, et al., 1994; Andrich & Styles, 1998), for instance, the restricted ranges of various studies assessing relationships between measures of attitudes and behaviors led to the mistaken conclusion that these were separate constructs. When the entire range of variation was explicitly modeled and studied, a consistent relationship was found.

Statistical and correlational methods have long histories of preventing the discovery, assessment, and practical application of invariant relations because they fail to test for invariant units of measurement, do not define standard metrics, never calibrate all instruments measuring the same thing in common units, and have no concept of formal measurement systems of interconnected instruments. Wider appreciation of the distinction between statistics and measurement (Duncan & Stenbeck, 1988; Fisher, 2010; Wilson, 2013a), and of the potential for metrological traceability we have within our reach (Fisher, 2009, 2012; Fisher & Stenner, 2013; Mari & Wilson, 2013; Pendrill, 2014; Pendrill & Fisher, 2015; Wilson, 2013b; Wilson, Mari, Maul, & Torres Irribarra, 2015), are demonstrably fundamental to the advancement of a wide range of fields.

References

Allerup, P., Bech, P., Loldrup, D., Alvarez, P., Banegil, T., Styles, I., & Tenenbaum, G. (1994). Psychiatric, business, and psychological applications of fundamental measurement models. International Journal of Educational Research, 21(6), 611-622.

Andrich, D. (1982). An index of person separation in Latent Trait Theory, the traditional KR-20 index, and the Guttman scale response pattern. Education Research and Perspectives, 9(1), 95-104 [http://www.rasch.org/erp7.htm].

Andrich, D., & Styles, I. M. (1998). The structural relationship between attitude and behavior statements from the unfolding perspective. Psychological Methods, 3(4), 454-469.

Duncan, O. D. (1985). Probability, disposition and the inconsistency of attitudes and behaviour. Synthese, 42, 21-34.

Duncan, O. D., & Stenbeck, M. (1988). Panels and cohorts: Design and model in the study of voting turnout. In C. C. Clogg (Ed.), Sociological Methodology 1988 (pp. 1-35). Washington, DC: American Sociological Association.

Feinstein, A. R. (1995). Meta-analysis: Statistical alchemy for the 21st century. Journal of Clinical Epidemiology, 48(1), 71-79.

Fisher, W. P., Jr. (2009). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement, 42(9), 1278-1287.

Fisher, W. P., Jr. (2010). Statistics and measurement: Clarifying the differences. Rasch Measurement Transactions, 23(4), 1229-1230.

Fisher, W. P., Jr. (2012, May/June). What the world needs now: A bold plan for new standards [Third place, 2011 NIST/SES World Standards Day paper competition]. Standards Engineering, 64(3), 1 & 3-5.

Fisher, W. P., Jr., & Stenner, A. J. (2013). Overcoming the invisibility of metrology: A reading measurement network for education and the social sciences. Journal of Physics: Conference Series, 459(012024), http://iopscience.iop.org/1742-6596/459/1/012024.

Hunter, J. E., & Schmidt, F. L. (Eds.). (2004). Methods of meta-analysis: Correcting error and bias in research findings. Thousand Oaks, CA: Sage.

Mari, L., & Wilson, M. (2013). A gentle introduction to Rasch measurement models for metrologists. Journal of Physics Conference Series, 459(1), http://iopscience.iop.org/1742-6596/459/1/012002/pdf/1742-6596_459_1_012002.pdf.

Pendrill, L. (2014). Man as a measurement instrument [Special Feature]. NCSLi Measure: The Journal of Measurement Science, 9(4), 22-33.

Pendrill, L., & Fisher, W. P., Jr. (2015). Counting and quantification: Comparing psychometric and metrological perspectives on visual perceptions of number. Measurement, 71, 46-55. doi: http://dx.doi.org/10.1016/j.measurement.2015.04.010

Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62(5), 529-540.

Smith, R. M. (1996). A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling, 3(1), 25-40.

Wilson, M. R. (2013a). Seeking a balance between the statistical and scientific elements in psychometrics. Psychometrika, 78(2), 211-236.

Wilson, M. R. (2013b). Using the concept of a measurement system to characterize measurement models used in psychometrics. Measurement, 46, 3766-3774.

Wilson, M., Mari, L., Maul, A., & Torres Irribarra, D. (2015). A comparison of measurement concepts across physical science and social science domains: Instrument design, calibration, and measurement. Journal of Physics: Conference Series, 588(012034), http://iopscience.iop.org/1742-6596/588/1/012034.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, Illinois: MESA Press.

Assignment from Wired’s Predict What’s Next page: “Imagine the Future of Medical Bills”

March 20, 2010

William P. Fisher, Jr.

william@livingcapitalmetrics.com
New Orleans, Louisiana
20 March 2010

Consider the following, formulated in response to Wired magazine’s 18.04 request for ideas on the future of medical bills, for possible use on the Predict What’s Next page. For background on the concepts presented here, see previous posts in this blog, such as https://livingcapitalmetrics.wordpress.com/2010/01/13/reinventing-capitalism/.

Visualize an online image of a Maiuetic Renaissance Bank’s Monthly Living Capital Stock, Investment, and Income Report. The report is shown projected as a vertical plane in the space above an old antique desk. Credits and debits to and from Mary Smith’s health capital account are listed, along with similar information on all of her capital accounts. Lying on the desk is a personalized MRB Living Capital Credit/Debit card, evidently somehow projecting the report from the eyes of Mary’s holographic image on it.

The report shows headings and entries for Mary Smith’s various capital accounts:

  • liquid (cash, checking and savings),
  • property (home, car, boat, rental, investments, etc.),
  • social capital (trust, honesty, commitment, loyalty, community building, etc.) credits/debits:
    • personal,
    • community’s,
    • employer’s,
    • regional,
    • national;
  • human capital:
    • literacy credits (shown in Lexiles; http://www.lexile.com),
    • numeracy credits (shown in Quantiles; http://www.quantiles.com),
    • occupational credits (hireability, promotability, retainability, productivity),
    • health credits/debits (genetic, cognitive reasoning, physical function, emotional function, chronic disease management status, etc.); and
  • natural capital:
    • carbon credits/debits,
    • local and global air, water, ecological diversity, and environmental quality share values.

Example social capital credits/debits shown in the report might include volunteering to build houses in N’Awlins Ninth Ward, tutoring fifth-graders in math, jury duty, voting, writing letters to congress, or charitable donations (credits), on the one hand, or library fines, a parking ticket, unmaintained property, etc. (debits), on the other.

Natural capital credits might be increased or decreased depending on new efficiencies obtained in electrical grid or in power generation, a newly installed solar panel, or by a recent major industrial accident, environmental disaster, hurricane, etc.

Mary’s share of the current value of the overall Genuine National Product, or Happiness Index, is broken out by each major form of capital (liquid, property, social, human, natural).

The monetary values of credits are shown at the going market rates, alongside the changes from last month, last year, and three years ago.

One entry could be a deferred income and property tax amount, given a social capital investment level above a recommended minimum. Another entry would show new profit potentials expressed in proportions of investments wasted due to inefficiencies, with suggestions for how these can be reduced, and with time to investment recovery and amount of new social capital generated also indicated.

The health capital portion of the report is broken out in a magnified overlay. Mary’s physical and emotional function measures are shown by an arrow pointing at a level on a vertical ruler. Other arrows point at the average levels for people her age (globally, nationally, regionally, and locally), for women and women of different ages, living in different countries/cities, etc.

Mary’s diabetes-specific chronic disease management metric is shown at a high level, indicating her success in using diet and exercise to control her condition. Her life expectancy and lifetime earning potentials are shown, alongside comparable values for others.

Recent clinical visits for preventative diabetes and dental care would be shown as debits against one account and as an investment in her health capital account. The debits might be paid out of a sale of shares of stock from her quite high social or natural capital accounts, or from credits transferred from those to her checking account.

Cost of declining function in the next ten years, given typical aging patterns, shown as lower rates of new capital investment in her stock and lower ROIs.

Cost of maintaining or improving function, in terms of required investments of time and resources in exercise, equipment, etc. balanced against constant rate of new investments and ROI.

Also shown:

A footnote could read: Given your recent completion of post-baccalaureate courses in political economy and advanced living capital finance, your increased stocks of literacy, numeracy, and occupational capital qualify you for a promotion or new positions currently compensated at annual rates 17.7% higher than your current one. Watch for tweets and beams from new investors interested in your rising stock!

A warning box: We all pay when dead capital lies unleveragable in currencies expressed in ordinal or otherwise nonstandard metrics! Visit http://www.CapitalResuscitationServices.com today to convert your unaccredited capital currencies into recognized value. (Not responsible for fraudulent misrepresentations of value should your credits prove incommensurable or counterfeit. Always check your vendor’s social capital valuations before investing in any stock offering. Go to http://www.Rasch.org for accredited capital metrics equating information, courses, texts, and consultants.)

Ad: Click here to put your occupational capital stock on the market now! Employers are bidding $$$, ¥¥¥ and €€€ on others at your valuation level!

Ad: You are only 110 Lexiles away from a literacy capital stock level on which others receive 23% higher investment returns! Enroll at BobsOnlineLiteracyCapitalBoosters.com now for your increased income tomorrow! (Past performance is not a guarantee of future results. Your returns may vary. Click here to see Bob’s current social capital valuations.)

Bottom line: Think global, act local! It is up to you to represent your shares in the global marketplace. Only you can demand the improvements you seek by shifting and/or intensifying your investments. Do so whenever you are dissatisfied with your own, your global and local business partners’, your community’s, your employer’s, your region’s, or your nation’s stock valuations.

For background on the concepts involved in this scenario, see:

Fisher, W. P., Jr. (2002, Spring). “The Mystery of Capital” and the human sciences. Rasch Measurement Transactions, 15(4), 854 [http://www.rasch.org/rmt/rmt154j.htm].

Fisher, W. P., Jr. (2005). Daredevil barnstorming to the tipping point: New aspirations for the human sciences. Journal of Applied Measurement, 6(3), 173-9 [http://www.livingcapitalmetrics.com/images/FisherJAM05.pdf].

Fisher, W. P., Jr. (2007, Summer). Living capital metrics. Rasch Measurement Transactions, 21(1), 1092-3 [http://www.rasch.org/rmt/rmt211.pdf].

Fisher, W. P., Jr. (2009, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), 42(9), 1278-1287.

Fisher, W. P.. Jr. (2009). NIST Critical national need idea White Paper: metrological infrastructure for human, social, and natural capital (Tech. Rep. No. http://www.livingcapitalmetrics.com/images/FisherNISTWhitePaper2.pdf). New Orleans: http://www.LivingCapitalMetrics.com.

Fisher, W. P., Jr. (2010). Bringing human, social, and natural capital to life: Practical consequences and opportunities. Journal of Applied Measurement, 11, in press [http://www.livingcapitalmetrics.com/images/BringingHSN_FisherARMII.pdf].

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Draft Legislation on Development and Adoption of an Intangible Assets Metric System

November 19, 2009

In my opinion, more could be done to effect meaningful and effective health care reform with legislation like that proposed below, which has fewer than 3,800 words, than will ever be possible with the 2,074 pages in Congress’s current health care reform bill. What’s more, creating the infrastructure for human, social, and natural capital markets in this way would not only cost a tiny fraction of the projected $847 billion bill being debated, it would be an investment that would pay returns many times larger than the initial investment. See previous posts in this blog for more info on how and why this is so.

The draft legislation below is adapted from The Metric Conversion Act (Title 15 U.S.C. Chapter6 §(204) 205a – 205k). The viability of a metric system for human, social, and natural capital is indicated by the realized state of scientific rigor in the measurement of human, social, and natural capital (Fisher, 2009b). The need for such a system is indicated by the current crisis’s pointed economic demands that all forms of capital be unified within a common econometric and financial framework (Fisher, 2009a). It is equally demanded by the moral and philosophical requirements of fair play and meaningfulness (Fisher, 2004). The day is fast approaching when a metric system for intangible assets will be recognized as the urgent need that it is (Fisher, 2009c).

At some point in the near future, it can be expected that a table showing how to interpret the units of the Intangible Assets Metric System will be published in the Federal Register, just as the International System units have been.

For those unfamiliar with the state of the art in measurement, these may seem like wildly unrealistic goals. Those wondering how a reasonable person might arrive at such opinions are urged to consult other posts in this blog, and the references cited in them. The advantages of an intangible assets metric system for sustainable and socially responsible economic policies and practices are nothing short of profound. As Georg Rasch (1980, p. xx) said in reference to the stringent demands of his measurement models, “this is a huge challenge, but once the problem has been formulated it does seem possible to meet it.” We are less likely to attain goals that we do not actively formulate. In the spirit of John Dewey’s student, Chiang Mon-Lin, what we need are “wild hypotheses and careful tests.” There is no wilder idea with greater potential impact for redefining profit as the reduction of waste, and for thereby mitigating human suffering, sociopolitical discontent, and environmental degradation.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2009a). Bringing human, social, and natural capital to life: Practical consequences and opportunities. In M. Wilson, K. Draney, N. Brown, B. Duckor (Eds.), Advances in Rasch Measurement, Vol. Two (p. in press). Maple Grove, MN: JAM Press.

Fisher, W. P., Jr. (2009b, November). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), 42(9), 1278-1287.

Fisher, W. P. J. (2009c). NIST Critical national need idea White Paper: Metrological infrastructure for human, social, and natural capital (Tech. Rep.). New Orleans: LivingCapitalMetrics.com.

Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Title xx U.S.C. Chapter x §(100) 101a – 101k
METRIC SYSTEM FOR INTANGIBLE ASSETS DEVELOPMENT LAW
(Pub. L. 10-xxx, §x, Intangible Assets Metrics Development Act, July 25, 2010)

§ 100. New metric system development authorized. – A new national effort is hereby initiated throughout the United States of America focusing on building and realizing the benefits of a metric system for the intangible assets known as human, social, and natural capital.

§ 101a. Congressional statement of findings. – The Congress finds as follows:

(1) The United States was an original signatory party to the 1875 Treaty of the Meter (20 Stat. 709), which established the General Conference of Weights and Measures, the International Committee of Weights and Measures and the International Bureau of Weights and Measures.

(2) The use of metric measurement standards in the United States was authorized by law in 1866; with the Metric Conversion Act of 1975 this Nation established a national policy of committing itself and taking steps to facilitate conversion to the metric system.

(3) World trade is dependent on the metric system of measurement; continuing trends toward globalization demand expansion of the metric system to include vital economic resources shown scientifically measurable in research conducted over the last 80 years.

(4) Industries and consumers in the United States are often at competitive disadvantages when dealing in domestic and international markets because no existing systems for measuring intangible assets (human, social, and natural capital) are expressed in standardized, universally uniform metrics. The end result is that education, health care, human resource, and other markets are unable to reward quality; supply and demand are unmatched, consumers make decisions with no or insufficient information; and quality cannot be systematically improved.

(5) The inherent simplicity of the metric system of measurement and standardization of weights and measures has led to major cost savings in certain industries which have converted to that system; similar savings are expected to follow from the development and implementation of a metric system for intangible assets.

(6) The Federal Government has a responsibility to develop procedures and techniques to assist industry, especially small business, as it voluntarily seeks to adopt a new metric system of measurement for intangible assets that have always required management but which have not yet been uniformly and systematically measured.

(7) A new metric system of measurement for human, social, and natural capital can provide substantial advantages to the Federal Government in its own operations.

§ 101b. Declaration of policy. – It is therefore the declared policy of the United States-

(1) to support the development and implementation of a new metric system of intangibles assets measurement as the preferred system of weights and measures for United States trade and commerce involving human, social, and natural capital;

(2) to require that each Federal agency,by a date certain and to the extent economically feasible by the end of the fiscal year 2011, use the new metric system of intangibles measurement in its procurements, grants, and other business-related activities, except to the extent that such use is impractical or is likely to cause significant inefficiencies or loss of markets to United States firms, such as when foreign competitors are producing competing products in non-metric units; and

(3) to seek out ways to increase understanding of the new metric system of intangibles measurement through educational information and guidance and in Government publications.

§ 101c. Definitions

As used in this subchapter, the term-

(1) ‘Board’ means the United States Intangible Assets Metrics Board, established under section 101d of this Title;

(2) ‘engineering standard’ means a standard which prescribes (A) a concise set of conditions and requirements that must be satisfied by a material, product, process, procedure, convention, or test method; and (B) the physical, functional, performance and/or conformance characteristics thereof;

(3) ‘international standard or recommendation’ means an engineering standard or recommendation which is (A) formulated and promulgated by an international organization and (B) recommended for adoption by individual nations as a national standard;

(4) ‘metric system of measurement’ means the International System of Units as established by the General Conference of Weights and Measures in 1960 and as interpreted or modified for the United States by the Secretary of Commerce;

(5) ‘full and open competition’ has the same meaning as defined in section 403 of title 41;

(6) ‘total installed price’ means the price of purchasing a product or material, trimming or otherwise altering some or all of that product or material, if necessary to fit with other building components,and then installing that product or material into a Federal facility;

(7) ‘hard-metric’ means measurement, design, and manufacture using the metric system of measurement, but does not include measurement,design, and manufacture using English system measurement units which are subsequently reexpressed in the metric system of measurement;

(8) ‘cost or pricing data or price analysis’ has the meaning given such terms in section 254b of title 41; and

(9) ‘Federal facility’ means any public building (as defined under section 612 of title 40) and shall include any Federal building or construction project: (A) on lands in the public domain;(B) on lands used in connection with Federal programs for agriculture research, recreation, and conservation programs; (C) on or used  in connection with river, harbor, flood control, reclamation, or power projects; (D) on or used in connection with housing and residential projects; (E) on military installations (including any fort, camp,post, naval training station, airfield, proving ground, military supply depot, military school, any similar facility of the Department of Defense); (F) on installations of the Department of Veterans Affairs used for hospital or domiciliary purposes; or (G) on lands used in connection with Federal prisons, but does not include (i)any Federal building or construction project the exclusion of which the President deems to be justified in the public interest, or (ii) any construction project or building owned or controlled by a State government, local government, Indian tribe, or any private entity.

§101d. United States Intangible Assets Metrics Board

(a) Establishment. – There is established, in accordance with this section, an independent instrumentality to be known as a United States Intangible Assets Metrics Board.

(b) Membership; Chairman; appointment of members; term of office;vacancies. – The Board shall consist of 17 individuals, as follows:

(1) the Chairman, a qualified individual who shall be appointed by the President, by and with the advice and consent of the Senate;

(2) seventeen members who shall be appointed by the President, by and with the advice and consent of the Senate, on the following basis-

(A) one to be selected from lists of qualified individuals recommended by psychometricians and organizations representative of psychometric interests;

(B) one to be selected from lists of qualified individuals recommended by social scientists, the scientific and technical community, and organizations representative of social scientists and technicians;

(C) one to be selected from lists of qualified individuals recommended by environmental scientists, the scientific and technical community, and organizations representative of environmental scientists and technicians;

(D) one to be selected from a list of qualified individuals recommended by the National Association of Manufacturers or its successor;

(E) one to be selected from lists of qualified individuals recommended by the United States Chamber of Commerce, or its successor, retailers,and other commercial organizations;

(F) two to be selected from lists of qualified individuals recommended by the American Federation of Labor and Congress of Industrial Organizations or its successor, who are representative of workers directly affected by human capital metrics for health, skills, motivations, and productivity, and by other organizations representing labor;

(G) one to be selected from a list of qualified individuals recommended by the National Governors Conference, the National Council of State Legislatures, and organizations representative of State and local government;

(H) two to be selected from lists of qualified individuals recommended by organizations representative of small business;

(I) one to be selected from lists of qualified individuals representative of the human resource management industry;

(J) one to be selected from a list of qualified individuals recommended by the National Conference on Weights and Measures and standards making organizations;

(K) one to be selected from lists of qualified individuals recommended by educators, the educational community, and organizations representative of educational interests; and

(L) four at-large members to represent consumers and other interests deemed suitable by the President and who shall be qualified individuals.

As used in this subsection, each ‘list’ shall include the names of at least three individuals for each applicable vacancy. The terms of office of the members of the Board first taking office shall expire as designated by the President at the time of nomination; five at the end of the second year; five at the end of the fourth year;and six at the end of the sixth year. The term of office of the Chairman of such Board shall be six years. Members, including the Chairman, may be appointed to an additional term of six years, in the same manner as the original appointment. Successors to members of such Board shall be appointed in the same manner as the original members and shall have terms of office expiring six years from the date of expiration of the terms for which their predecessors were appointed. Any individual appointed to fill a vacancy occurring prior to the expiration of any term of office shall be appointed for the remainder of that term. Beginning 45 days after the date of incorporation of the Board, six members of such Board shall constitute a quorum for the transaction of any function of the Board.

(c) Compulsory powers. – Unless otherwise provided by the Congress, the Board shall have no compulsory powers.

(d) Termination. – The Board shall cease to exist when the Congress, by law, determines that its mission has been accomplished.

§101e. – Functions and powers of Board. – It shall be the function of the Board to devise and carry out a broad program of planning, coordination, and public education, consistent with other national policy and interests, with the aim of implementing the policy set forth in this subchapter. In carrying out this program,the Board shall-

(1) consult with and take into account the interests, views, and costs relevant to the inefficiencies that have long plagued the management of unmeasured forms of capital in United States commerce and industry, including small business; science; engineering; labor; education; consumers; government agencies at the Federal, State, and local level; nationally recognized standards developing and coordinating organizations; intangibles metrics development, planning and coordinating groups; and such other individuals or groups as are considered appropriate by the Board to the carrying out of the purposes of this subchapter. The Board shall take into account activities underway in the private and public sectors, so as not to duplicate unnecessarily such activities;

(2) provide for appropriate procedures whereby various groups,under the auspices of the Board, may formulate, and recommend or suggest, to the Board specific programs for coordinating intangibles metrics development in each industry and segment thereof and specific dimensions and configurations in the new metric system and in other measurements for general use. Such programs, dimensions, and configurations shall be consistent with (A) the needs, interests, and capabilities of manufacturers (large and small), suppliers, labor, consumers, educators,and other interested groups, and (B) the national interest;

(3) publicize, in an appropriate manner, proposed programs and provide an opportunity for interested groups or individuals to submit comments on such programs. At the request of interested parties, the Board, in its discretion, may hold hearings with regard to such programs. Such comments and hearings may be considered by the Board;

(4) encourage activities of standardization organizations to develop or revise, as rapidly as practicable, policy and IT standards based on the new intangibles metrics, and to take advantage of opportunities to promote (A) rationalization or simplification of relationships,(B) improvements of design, (C) reduction of size variations, (D) increases in economy, and (E) where feasible, the efficient use of energy and the conservation of natural resources;

(5) encourage the retention, in the new metric language of human, social, and natural capital standards, of those United States policy and IT designs, practices, and conventions that are internationally accepted or that embody superior technology;

(6) consult and cooperate with foreign governments, and intergovernmental organizations, in collaboration with the Department of State, and, through appropriate member bodies, with private international organizations, which are or become concerned with the encouragement and coordination of increased use of intangible assets metrics measurement units or policy and IT standards based on such units, or both. Such consultation shall include efforts, where appropriate, to gain international recognition for intangible assets metrics standards proposed by the United States;

(7) assist the public through information and education programs, to become familiar with the meaning and applicability of metric terms and measures in daily life. Such programs shall include –

(A) public information programs conducted by the Board, through the use of newspapers, magazines, radio, television, the Internet, social networking, and other media, and through talks before appropriate citizens’ groups, and trade and public organizations;

(B) counseling and consultation by the Secretary of Education; the Secretary of Labor; the Administrator of the Small Business Administration; and the Director of the National Science Foundation, with educational associations, State and local educational agencies, labor education committees, apprentice training committees, and other interested groups, in order to assure (i) that the new intangible assets metric system of measurement is included in the curriculum of the Nation’s educational institutions, and (ii) that teachers and other appropriate personnel are properly trained to teach the intangible assets metric system of measurement;

(C) consultation by the Secretary of Commerce with the National Conference of Weights and Measures in order to assure that State and local weights and measures officials are (i) appropriately involved in intangible assets metric development and adoption activities and (ii) assisted in their efforts to bring about timely amendments to weights and measures laws; and

(D) such other public information activities, by any Federal agency in support of this subchapter, as relate to the mission of suchagency;

(8) collect, analyze, and publish information about the extent of usage of intangible assets metric measurements; evaluate the costs and benefits of that usage; and make efforts to minimize any adverse effects resulting from increasing intangible assets metric usage;

(9) conduct research, including appropriate surveys; publish the results of such research; and recommend to the Congress and to the President such action as may be appropriate to deal with any unresolved problems, issues, and questions associated with intangible assets metric development, adoption, or usage, such problems, issues, and questions may include, but are not limited to, the impact on different occupations and industries, possible increased costs to consumers, the impact on society and the economy, effects on small business, the impact on the international trade position of the United States, the appropriateness of and methods for using procurement by the Federal Government as a means to effect development and adoption of the intangible assets metric system, the proper conversion or transition period in particular sectors of society, and consequences for national defense;

(10) submit annually to the Congress and to the President a report on its activities. Each such report shall include a status report on the development and adoption process as well as projections for continued progress in that process. Such report may include recommendations covering any legislation or executive action needed to implement the programs of development and adoption accepted by the Board. The Board may also submit such other reports and recommendations as it deems necessary;and

(11) submit to the President, not later than 1 year after the date of enactment of the Act making appropriations for carrying out this subchapter, a report on the need to provide an effective structural mechanism for adopting intangible assets metric units in statutes, regulations, and other laws at all levels of government, on a coordinated and timely basis, in response to voluntary programs adopted and implemented by various sectors of society under the auspices and with the approval of the Board. If the Board determines that such a need exists, such report shall include recommendations as to appropriate and effective means for establishing and implementing such a mechanism.

§101f. – Duties of Board. – In carrying out its duties under this subchapter, the Board may –

(1) establish an Executive Committee, and such other committees as it deems desirable;

(2) establish such committees and advisory panels as it deems necessary to work with the various sectors of the Nation’s economy and with Federal and State governmental agencies in the development and implementation of detailed development and adoption plans for those sectors. The Board may reimburse,to the extent authorized by law, the members of such committees;

(3) conduct hearings at such times and places as it deems appropriate;

(4) enter into contracts, in accordance with the Federal Property and Administrative Services Act of 1949, as amended (40 U.S.C. 471et seq.), with Federal or State agencies, private firms, institutions, and individuals for the conduct of research or surveys, the preparation of reports, and other activities necessary to the discharge of its duties;

(5) delegate to the Executive Director such authority as it deems advisable; and

(6) perform such other acts as may be necessary to carry out the duties prescribed by this subchapter.

§101g. – Gifts, donations and bequests to Board

(a) Authorization; deposit into Treasury and disbursement. – The Board may accept, hold, administer, and utilize gifts, donations,and bequests of property, both real and personal, and personal services, for the purpose of aiding or facilitating the work of the Board. Gifts and bequests of money, and the proceeds from the sale of any other property received as gifts or requests, shall be deposited in the Treasury in a separate fund and shall be disbursed upon order of the Board.

(b) Federal income, estate, and gift taxation of property. – For purpose of Federal income, estate, and gift taxation, property accepted under subsection (a) of this section shall be considered as a gift or bequest to or for the use of the United States.

(c) Investment of moneys; disbursement of accrued income. – Upon the request of the Board, the Secretary of the Treasury may invest and reinvest, in securities of the United States, any moneys contained in the fund authorized in subsection (a) of this section. Income accruing from such securities, and from any other property acceptedto the credit of such fund, shall be dispersed upon the order ofthe Board.

(d) Reversion to Treasury of unexpended funds. – Funds not expended by the Board as of the date when it ceases to exist, in accordance with section 105d(d) of this title, shall revert to the Treasury of the United States as of such date.

§101h. – Compensation of Board members; travel expenses.- Members of the Board who are not in the regular full-time employ of the United States shall, while attending meetings or conferences of the Board or while otherwise engaged in the business of the Board, be entitled to receive compensation at a rate not to exceed the daily rate currently being paid grade 18 of the General Schedule (under section 5332 of title 5), including travel time. While so serving, on the business of the Board away from their homes or regular places of business, members of the Board may be allowed travel expenses,including per diem in lieu of subsistence, as authorized by section5703 of title 5, for persons employed intermittently in the Government service. Payments under this section shall not render members of the Board employees or of the United States for any purpose. Members of the Board who are in the employ of the United States shall be entitled to travel expenses when traveling on the business of the Board.

§101i. – Personnel

(a) Executive Director; appointment; tenure; duties. – The Board shall appoint a qualified individual to serve as the Executive Director of the Board at the pleasure of the Board. The Executive Director, subject to the direction of the Board, shall be responsible to the Board and shall carry out the intangible assets metric development and adoption program, pursuant to the provisions of this subchapter and the policies established by the Board.

(b) Executive Director; salary. – The Executive Director of the Board shall serve full time and be subject to the provisions of chapter 51 and subchapter III of chapter 53 of title 5. The annual salary of the Executive Director shall not exceed level III of the Executive Schedule under section 5314 of such title.

(c) Staff personnel; appointment and compensation. – The Board may appoint and fix the compensation of such staff personnel as may be necessary to carry out the provisions of this subchapter in accordance with the provisions of chapter 51 and subchapter III of chapter 53 of title 5.

(d) Experts and consultants; employment and compensation; annual review of contracts. – The Board may (1) employ experts and consultants or organizations thereof, as authorized by section 3109 of title5; (2) compensate individuals so employed at rates not in excess of the rate currently being paid grade 18 of the General Schedule under section 5332 of such title, including travel time; and (3) may allow such individuals, while away from their homes or regular places of business, travel expenses (including per diem in lieu of subsistence) as authorized by section 5703 of such title 5 for persons in the Government service employed intermittently: Provided, however, that contracts for such temporary employment may be renewed annually.

§101j. – Financial and administrative services; sourceand reimbursement. – Financial and administrative services, including those related to budgeting, accounting, financial reporting, personnel, and procurement, and such other staff services as maybe needed by the Board, may be obtained by the Board from the Secretary of Commerce or other appropriate sources in the Federal Government. Payment for such services shall be made by the Board, in advance or by reimbursement, from funds of the Board in such amounts as may be agreed upon by the Chairman of the Board and by the source of the services being rendered.

§101k. – Authorization of appropriations; availability.- There are authorized to be appropriated such sums as may be necessary to carry out the provisions of this subchapter. Appropriations to carry out the provisions of this subchapter may remain available for obligation and expenditure for such period or periods as maybe specified in the Acts making such appropriations.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Three demands of meaningful measurement

September 28, 2009

The core issue in measurement is meaningfulness. There are three major aspects of meaningfulness to take into account in measurement. These have to do with the constancy of the unit, interpreting the size of differences in measures, and evaluating the coherence of the units and differences.

First, raw scores (counts of right answers or other events, sums of ratings, or rankings) do not stand for anything that adds up the way they do (see previous blogs for more on this). Any given raw score unit can be 4-5 times larger than another, depending on where they fall in the range. Meaningful measurement demands a constant unit. Instrument scaling methods provide it.

Second, meaningful measurement requires that we be able to say just what any quantitative amount of difference is supposed to represent. What does a difference between two measures stand for in the way of what is and isn’t done at those two levels? Is the difference within the range of error, and so random? Is the difference many times more than the error, and so repeatedly reproducible and constant? Meaningful measurement demands that we be able to make reliable distinctions.

Third, meaningful measurement demands that the items work together to measure the same thing. If reliable distinctions can be made between measures, what is the one thing that all of the items tap into? If the data exhibit a consistency that is shared across items and across persons, what is the nature of that consistency? Meaningful measurement posits a model of what data must look like to be interpretable and coherent, and then it evaluates data in light of that model.

When a constant unit is in hand, when the limits of randomness relative to stable differences are known, and when individual responses are consistent with one another, then, and only then, is measurement meaningful. Inconstant units, unknown amounts of random variation, and inconsistent data can never amount to the science we need for understanding and managing skills, abilities, health, motivations, social bonds, and environmental quality.

Managing our investments in human, social, and natural capital for positive returns demands that meaningful measurement be universalized in uniformly calibrated and accessible metrics. Scientifically rigorous, practical, and convenient methods for setting reference standards and making instruments traceable to them are readily available.

We have the means in hand for effecting order-of-magnitude improvements in the meaningfulness of the measures used in education, health care, human and environmental resource management, etc. It’s time we got to work on it.

We are what we measure. It’s time we measured what we want to be.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Reliability Coefficients: Starting from the Beginning

August 31, 2009

[This posting was prompted by questions concerning a previous blog entry, Reliability Revisited, and provides background on reliability that only Rasch measurement practitioners are likely to possess.] Most measurement applications based in ordinal data do not implement rigorous checks of the internal consistency of the observations, nor do they typically use the log-odds transformation to convert the nonlinear scores into linear measures. Measurement is usually defined in statistical terms, applying population-level models to obtain group-level summary scores, means, and percentages. Measurement, however, ought to involve individual-level models and case-specific location estimates. (See one of my earlier blogs for more on this distinction between statistics and measurement.)

Given the appropriate measurement focus on the individual, the instrument is initially calibrated and measures are estimated in a simultaneous conjoint process. Once the instrument is calibrated, the item estimates can be anchored, measures can be routinely produced from them, and new items can be calibrated into the system, and others dropped, over time. This method has been the norm in admissions, certification, licensure, and high stakes testing for decades (Fisher & Wright, 1994; Bezruczko, 2005).

Measurement modelling of individual response processes has to be stochastic, or else we run into the attenuation paradox (Engelhard, 1993, 1994). This is the situation in which a deterministic progression of observations from one end of the instrument to the other produces apparently error-free data strings that look like this (1 being a correct answer, a higher rating, or the presence of an attribute, and 0 being incorrect, a lower rating, or the absence of the attribute):

00000000000

10000000000

11000000000

11100000000

11110000000

11111000000

11111100000

11111111000

11111111100

11111111110

11111111111

In this situation, strings with all 0s and all 1s give no information useful for estimating measures (rows) or calibrations (columns). It is as though some of the people are shorter than the first unit on the ruler, and others are taller than the top unit. We don’t really have any way of knowing how short or tall they are, so their rows drop out. But eliminating the top and bottom rows makes the leftmost and rightmost columns all 0s and 1s, and eliminating them then gives new rows with all 0s and 1s, etc., until there’s no data left. (See my Revisiting Reliability blog for evaluations of five different probabilistically-structured data sets of this kind simulated to contrast various approaches to assessing reliability and internal consistency.)

The problem for estimation (Linacre, 1991, 1999, 2000) in data like those shown above is that the lack of informational overlaps between the columns, on the one hand, and between the rows, on the other, gives us no basis for knowing how much more of the variable is represented by any one item relative to any other, or by any one person measured relative to any other. In addition, whenever we actually construct measures of abilities, attitudes, or behaviors that conform with this kind of Guttman (1950) structure (Andrich, 1985; Douglas & Wright, 1989; Engelhard, 2008), the items have to be of such markedly different difficulties or agreeabilities that the results tend to involve large numbers of indistinguishable groups of respondents. But when that information is present in a probabilistically consistent way, we have an example of the phenomenon of stochastic resonance (Fisher, 1992b), so called because of the way noise amplifies weak deterministic signals (Andò & Graziani, 2000; Benzi, Sutera, & Vulpiani, 1981; Bulsara & Gammaitoni, 1996; Dykman & McClintock, 1998; Schimansky-Geier, Freund, Neiman, & Shulgin, 1998).

We need the noise, but we can’t let it overwhelm the system. We have to be able to know how much error there is relative to actual signal. Reliability is traditionally defined (Guilford 1965, pp. 439-40) as an estimate of this relation of signal and noise:

“The reliability of any set of measurements is logically defined as the proportion of their variance that is true variance…. We think of the total variance of a set of measures as being made up of two sources of variance: true variance and error variance… The true measure is assumed to be the genuine value of whatever is being measured… The error components occur independently and at random.”

Traditional reliability coefficients, like Cronbach’s alpha, are correlational, implementing a statistical model of group-level information. Error is taken to be the unexplained portion of the variance:

“In his description of alpha Cronbach (1951) proved (1) that alpha is the mean of all possible split-half coefficients, (2) that alpha is the value expected when two random samples of items from a pool like those in the given test are correlated, and (3) that alpha is a lower bound to the proportion of test variance attributable to common factors among the items” (Hattie, 1985, pp. 143-4).

But measurement models of individual-level response processes (Rasch, 1960; Andrich, 1988; Wright, 1977; Fisher & Wright, 1994; Bond & Fox, 2007; Wilson, 2005; Bezruczko, 2005) employ individual-level error estimates (Wright, 1977; Wright & Stone, 1979; Wright & Masters, 1982), not correlational group-level variance estimates. The individual measurement errors are statistically equivalent to sampling confidence intervals, as is evident in both Wright’s equations and in plots of errors and confidence intervals (see Figure 4 in Fisher, 2008). That is, error and confidence intervals both decline at the same rate with larger numbers of item responses per person, or larger numbers of person responses per item.

This phenomenon has a constructive application in instrument design. If a reasonable expectation for the measurement standard deviation can be formulated and related to the error expected on the basis of the number of items and response categories, a good estimate of the measurement reliability can be read off a nomograph (Linacre, 1993).

Wright (Wright & Masters, 1982, pp. 92, 106; Wright, 1996) introduced several vitally important measurement precision concepts and tools that follow from access to individual person and item error estimates. They improve on the traditional KR-20 or Cronbach reliability coefficients because the individualized error estimates better account for the imprecisions of mistargeted instruments, and for missing data, and so more accurately and conservatively estimate reliability.

Wright and Masters introduce a new reliability statistic, G, the measurement separation reliability index. The availability of individual error estimates makes it possible to estimate the true variance of the measures more directly, by subtracting the mean square error from the total variance. The standard deviation based on this estimate of true variance is then made the numerator of a ratio, G, having the root mean square error as its denominator.

Each unit increase in this G index then represents another multiple of the error unit in the amount of quantitative variation present in the measures. This multiple is nonlinearly represented in the traditional reliability coefficients expressed in the 0.00 – 1.00 range, such that the same separation index unit difference is found in the 0.00 to 0.50, 0.50 to 0.80, 0.80 to 0.90, 0.90 to 0.94, 0.94 to 0.96, and 0.96 to 0.97 reliability ranges (see Fisher, 1992a, for a table of values; available online: see references).

G can also be estimated as the square root of the reliability divided by one minus the reliability. Conversely, a reliability coefficient roughly equivalent to Cronbach’s alpha is estimated as G squared divided by G squared plus the error variance. Because individual error estimates are inflated in the presence of missing data and when an instrument is mistargeted and measures tend toward the extremes, the Rasch-based reliability coefficients tend to be more conservative than Cronbach’s alpha, as these sources of error are hidden within the variances and correlations. For a comparison of the G separation index, the G reliability coefficient, and Cronbach’s alpha over five simulated data sets, see the Reliability Revisited blog entry.

Error estimates can be made more conservative yet by multiplying each individual error term by the larger of either 1.0 or the square root of the associated individual mean square fit statistic for that case (Wright, 1995). (The mean square fit statistics are chi-squares divided by their degrees of freedom, and so have an expected value of 1.00; see Smith (1996) for more on fit, and see my recent blog, Revisiting Reliability, for more on the conceptualization and evaluation of reliability relative to fit.)

Wright and Masters (1982, pp. 92, 105-6) also introduce the concept of strata, ranges on the measurement continuum with centers separated by three errors. Strata are in effect a more forgiving expression of the separation reliability index, G, since the latter approximates strata with centers separated by four errors. An estimate of strata defined as having centers separated by four errors is very nearly identical with the separation index. If three errors define a 95% confidence interval, four are equivalent to 99% confidence.

There is a particular relevance in all of this for practical applications involving the combination or aggregation of physical, chemical, and other previously calibrated measures. This is illustrated in, for instance, the use of chemical indicators in assessing disease severity, environmental pollution, etc. Though any individual measure of the amount of a chemical or compound is valid within the limits of its intended purpose, to arrive at measures delineating disease severity, overall pollution levels, etc., the relevant instruments must be designed, tested, calibrated, and maintained, just as any instruments are (Alvarez, 2005; Cipriani, Fox, Khuder, et al., 2005; Fisher, Bernstein, et al., 2002; Fisher, Priest, Gilder, et al., 2008; Hughes, Perkins, Wright, et al., 2003; Perkins, Wright, & Dorsey, 2005; Wright, 2000).

The same methodology that is applied in this work, involving the rating or assessment of the quality of the outcomes or impacts counted, expressed as percentages, or given in an indicator’s native metric (parts per million, acres, number served, etc.), is needed in the management of all forms of human, social, and natural capital. (Watch this space for a forthcoming blog applying this methodology to the scaling of the UN Millennium Development Goals data.) The practical advantages of working from calibrated instrumentation in these contexts include data quality evaluations, the replacement of nonlinear percentages with linear measures, data volume reduction with no loss of information, and the integration of meaningful and substantive qualities with additive quantities on annotated metrics.

References

Alvarez, P. (2005). Several noncategorical measures define air pollution. In N. Bezruczko (Ed.), Rasch measurement in health sciences (pp. 277-93). Maple Grove, MN: JAM Press.

Andò, B., & Graziani, S. (2000). Stochastic resonance theory and applications. New York: Kluwer Academic Publishers.

Andrich, D. (1985). An elaboration of Guttman scaling with Rasch models for measurement. In N. B. Tuma (Ed.), Sociological methodology 1985 (pp. 33-80). San Francisco, California: Jossey-Bass.

Andrich, D. (1988). Rasch models for measurement. (Vols. series no. 07-068). Sage University Paper Series on Quantitative Applications in the Social Sciences). Beverly Hills, California: Sage Publications.

Benzi, R., Sutera, A., & Vulpiani, A. (1981). The mechanism of stochastic resonance. Journal of Physics. A. Mathematical and General, 14, L453-L457.

Bezruczko, N. (Ed.). (2005). Rasch measurement in health sciences. Maple Grove, MN: JAM Press.

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2d edition. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Bulsara, A. R., & Gammaitoni, L. (1996, March). Tuning in to noise. Physics Today, 49, 39-45.

Cipriani, D., Fox, C., Khuder, S., & Boudreau, N. (2005). Comparing Rasch analyses probability estimates to sensitivity, specificity and likelihood ratios when examining the utility of medical diagnostic tests. Journal of Applied Measurement, 6(2), 180-201.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334.

Douglas, G. A., & Wright, B. D. (1989). Response patterns and their probabilities. Rasch Measurement Transactions, 3(4), 75-77 [http://www.rasch.org/rmt/rmt34.htm].

Dykman, M. I., & Mcclintock, P. V. E. (1998, January 22). What can stochastic resonance do? Nature, 391(6665), 344.

Engelhard, G., Jr. (1993). What is the attenuation paradox? Rasch Measurement Transactions, 6(4), 257 [http://www.rasch.org/rmt/rmt64.htm].

Engelhard, G., Jr. (1994). Resolving the attenuation paradox. Rasch Measurement Transactions, 8(3), 379.

Engelhard, G. (2008, July). Historical perspectives on invariant measurement: Guttman, Rasch, and Mokken. Measurement: Interdisciplinary Research & Perspectives, 6(3), 155-189.

Fisher, W. P., Jr. (1992a). Reliability statistics. Rasch Measurement Transactions, 6(3), 238 [http://www.rasch.org/rmt/rmt63i.htm].

Fisher, W. P., Jr. (1992b, Spring). Stochastic resonance and Rasch measurement. Rasch Measurement Transactions, 5(4), 186-187 [http://www.rasch.org/rmt/rmt54k.htm].

Fisher, W. P., Jr. (2008, Summer). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-3 [http://www.rasch.org/rmt/rmt221.pdf].

Fisher, W. P., Jr., Bernstein, L. H., Qamar, A., Babb, J., Rypka, E. W., & Yasick, D. (2002, February). At the bedside: Measuring patient outcomes. Advance for Administrators of the Laboratory, 11(2), 8, 10 [http://laboratory-manager.advanceweb.com/Article/At-the-Bedside-7.aspx].

Fisher, W. P., Jr., Priest, E., Gilder, R., Blankenship, D., & Burton, E. C. (2008, July 3-6). Development of a novel heart failure measure to identify hospitalized patients at risk for intensive care unit admission. Presented at the World Congress on Controversies in Cardiovascular Diseases [http://www.comtecmed.com/ccare/2008/authors_abstract.aspx#Author15], Intercontinental Hotel, Berlin, Germany.

Fisher, W. P., Jr., & Wright, B. D. (Eds.). (1994). Applications of probabilistic conjoint measurement. International Journal of Educational Research, 21(6), 557-664.

Guilford, J. P. (1965). Fundamental statistics in psychology and education. 4th Edn. New York: McGraw-Hill.

Guttman, L. (1950). The basis for scalogram analysis. In S. A. Stouffer & et al. (Eds.), Studies in social psychology in World War II. volume 4: Measurement and prediction (pp. 60-90). New York: Wiley.

Hattie, J. (1985, June). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9(2), 139-64.

Hughes, L., Perkins, K., Wright, B. D., & Westrick, H. (2003). Using a Rasch scale to characterize the clinical features of patients with a clinical diagnosis of uncertain, probable or possible Alzheimer disease at intake. Journal of Alzheimer’s Disease, 5(5), 367-373.

Linacre, J. M. (1991, Spring). Stochastic Guttman order. Rasch Measurement Transactions, 5(4), 189 [http://www.rasch.org/rmt/rmt54p.htm].

Linacre, J. M. (1993). Rasch-based generalizability theory. Rasch Measurement Transactions, 7(1), 283-284; [http://www.rasch.org/rmt/rmt71h.htm].

Linacre, J. M. (1999). Understanding Rasch measurement: Estimation methods for Rasch measures. Journal of Outcome Measurement, 3(4), 382-405.

Linacre, J. M. (2000, Autumn). Guttman coefficients and Rasch data. Rasch Measurement Transactions, 14(2), 746-7 [http://www.rasch.org/rmt/rmt142e.htm].

Perkins, K., Wright, B. D., & Dorsey, J. K. (2005). Using Rasch measurement with medical data. In N. Bezruczko (Ed.), Rasch measurement in health sciences (pp. 221-34). Maple Grove, MN: JAM Press.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Schimansky-Geier, L., Freund, J. A., Neiman, A. B., & Shulgin, B. (1998). Noise induced order: Stochastic resonance. International Journal of Bifurcation and Chaos, 8(5), 869-79.

Smith, R. M. (2000). Fit analysis in latent trait measurement models. Journal of Applied Measurement, 1(2), 199-218.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1995, Summer). Which standard error? Rasch Measurement Transactions, 9(2), 436-437 [http://www.rasch.org/rmt/rmt92n.htm].

Wright, B. D. (1996, Winter). Reliability and separation. Rasch Measurement Transactions, 9(4), 472 [http://www.rasch.org/rmt/rmt94n.htm].

Wright, B. D. (2000). Rasch regression: My recipe. Rasch Measurement Transactions, 14(3), 758-9 [http://www.rasch.org/rmt/rmt143u.htm].

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, Illinois: MESA Press.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Reliability Revisited: Distinguishing Consistency from Error

August 28, 2009

When something is meaningful to us, and we understand it, then we can successfully restate it in our own words and predictably reproduce approximately the same representation across situations  as was obtained in the original formulation. When data fit a Rasch model, the implications are (1) that different subsets of items (that is, different ways of composing a series of observations summarized in a sufficient statistic) will all converge on the same pattern of person measures, and (2) that different samples of respondents or examinees will all converge on the same pattern of item calibrations. The meaningfulness of propositions based in these patterns will then not depend on which collection of items (instrument) or sample of persons is obtained, and all instruments might be equated relative to a single universal, uniform metric so that the same symbols reliably represent the same amount of the same thing.

Statistics and research methods textbooks in psychology and the social sciences commonly make statements like the following about reliability: “Reliability is consistency in measurement. The reliability of individual scale items increases with the number of points in the item. The reliability of the complete scale increases with the number of items.” (These sentences are found at the top of p. 371 in Experimental Methods in Psychology, by Gustav Levine and Stanley Parkinson (Lawrence Erlbaum Associates, 1994).) The unproven, perhaps unintended, and likely unfounded implication of these statements is that consistency increases as items are added.

Despite the popularity of doing so, Green, Lissitz, and Mulaik (1977) argue that reliability coefficients are misused when they are interpreted as indicating the extent to which data are internally consistent. “Green et al. (1977) observed that though high ‘internal consistency’ as indexed by a high alpha results when a general factor runs through the items, this does not rule out obtaining high alpha when there is no general factor running through the test items…. They concluded that the chief defect of alpha as an index of dimensionality is its tendency to increase as the number of items increase” (Hattie, 1985, p. 144).

In addressing the internal consistency of data, the implicit but incompletely realized purpose of estimating scale reliability is to evaluate the extent to which sum scores function as sufficient statistics. How limited is reliability as a tool for this purpose? To answer this question, five dichotomous data sets of 23 items and 22 persons were simulated. The first one was constructed so as to be highly likely to fit a Rasch model, with a deliberately orchestrated probabilistic Guttman pattern. The second one was made nearly completely random. The third, fourth, and fifth data sets were modifications of the first one in which increasing numbers of increasingly inconsistent responses were introduced. (The inconsistencies were not introduced in any systematic way apart from inserting contrary responses in the ordered matrix.) The data sets are shown in the Appendix. Tables 1 and 2 summarize the results.

Table 1 shows that the reliability coefficients do in fact decrease, along with the global model fit log-likelihood chi-squares, as the amount of randomness and inconsistency is increased. Contrary to what is implied in Levine and Parkinson’s statements, however, reliability can vary within a given number of items, as it might across different data sets produced from the same test, survey, or assessment, depending on how much structural invariance is present within them.

Two other points about the tables are worthy of note. First, the Rasch-based person separation reliability coefficients drop at a faster rate than Cronbach’s alpha does. This is probably an effect of the individualized error estimates in the Rasch context, which makes its reliability coefficients more conservative than correlation-based, group-level error estimates. (It is worth noting, as well, that the Winsteps and SPSS estimates of Cronbach’s alpha match. They are reported to one fewer decimal places by Winsteps, but the third decimal place is shown for the SPSS values for contrast.)

Second, the fit statistics are most affected by the initial and most glaring introduction of inconsistencies, in data set three. As the randomness in the data increases, the reliabilities continue to drop, but the fit statistics improve, culminating in the case of data set two, where complete randomness results in near-perfect model fit. This is, of course, the situation in which both the instrument and the sample are as well targeted as they can be, since all respondents have about the same measure and all the items about the same calibration; see Wood (1978) for a commentary on this situation, where coin tosses fit a Rasch model.

Table 2 shows the results of the Winsteps Principal Components Analysis of the standardized residuals for all five data sets. Again, the results conform with and support the pattern shown in the reliability coefficients. It is, however, interesting to note that, for data sets 4 and 5, with their Cronbach’s alphas of about .89 and .80, respectively, which are typically deemed quite good, the PCA shows more variance left unexplained than is explained by the Rasch dimension. The PCA is suggesting that two or more constructs might be represented in the data, but this would never be known from Cronbach’s alpha alone.

Alpha alone would indicate the presence of a unidimensional construct for data sets 3, 4 and 5, despite large standard deviations in the fit statistics and even though more than half the variance cannot be explained by the primary dimension. Worse, for the fifth data set, more variance is captured in the first three contrasts than is explained by the Rasch dimension. But with Cronbach’s alpha at .80, most researchers would consider this scale quite satisfactorily unidimensional and internally consistent.

These results suggest that, first, in seeking high reliability, what is sought more fundamentally is fit to a Rasch model (Andrich & Douglas, 1977; Andrich, 1982; Wright, 1977). That is, in addressing the internal consistency of data, the popular conception of reliability is taking on the concerns of construct validity. A conceptually clearer sense of reliability focuses on the extent to which an instrument works as expected every time it is used, in the sense of the way a car can be reliable. For instance, with an alpha of .70, a screening tool would be able to reliably distinguish measures into two statistically distinct groups (Fisher, 1992; Wright, 1996), problematic and typical. Within the limits of this purpose, the tool would meet the need for the repeated production of information capable of meeting the needs of the situation. Applications in research, accountability, licensure/certification, or diagnosis, however, might demand alphas of .95 and the kind of precision that allows for statistically distinct divisions into six or more groups. In these kinds of applications, where experimental designs or practical demands require more statistical power, measurement precision articulates finer degrees of differences. Finely calibrated instruments provide sensitivity over the entire length of the measurement continuum, which is needed for repeated reproductions of the small amounts of change that might accrue from hard to detect treatment effects.

Separating the construct, internal consistency, and unidimensionality issues  from the repeatability and reproducibility of a given degree of measurement precision provides a much-needed conceptual and methodological clarification of reliability. This clarification is routinely made in Rasch measurement applications (Andrich, 1982; Andrich & Douglas, 1977; Fisher, 1992; Linacre, 1993, 1996, 1997). It is reasonable to want to account for inconsistencies in the data in the error estimates and in the reliability coefficients, and so errors and reliabilities are routinely reported in terms of both the modeled expectations and in a fit-inflated form (Wright, 1995). The fundamental value of proceeding from a basis in individual error and fit statistics (Wright, 1996), is that local imprecisions and failures of invariance can be isolated for further study and selective attention.

The results of the simulated data analyses suggest, second, that, used in isolation, reliability coefficients can be misleading. As Green, et al. say, reliability estimates tend to systematically increase as the number of items increases (Fisher, 2008). The simulated data show that reliability coefficients also systematically decrease as inconsistency increases.

The primary problem with relying on reliability coefficients alone as indications of data consistency hinges on their inability to reveal the location of departures from modeled expectations. Most uses of reliability coefficients take place in contexts in which the model remains unstated and expectations are not formulated or compared with observations. The best that can be done in the absence of a model statement and test of data fit to it is to compare the reliability obtained against that expected on the basis of the number of items and response categories, relative to the observed standard deviation in the scores, expressed in logits (Linacre, 1993). One might then raise questions as to targeting, data consistency, etc. in order to explain larger than expected differences.

A more methodical way, however, would be to employ multiple avenues of approach to the evaluation of the data, including the use of model fit statistics and Principal Components Analysis in the evaluation of differential item and person functioning. Being able to see which individual observations depart the furthest from modeled expectation can provide illuminating qualitative information on the meaningfulness of the data, the measures, and the calibrations, or the lack thereof.  This information is crucial to correcting data entry errors, identifying sources of differential item or person functioning, separating constructs and populations, and improving the instrument. The power of the reliability-coefficient-only approach to data quality evaluation is multiplied many times over when the researcher sets up a nested series of iterative dialectics in which repeated data analyses explore various hypotheses as to what the construct is, and in which these analyses feed into revisions to the instrument, its administration, and/or the population sampled.

For instance, following the point made by Smith (1996), it may be expected that the PCA results will illuminate the presence of multiple constructs in the data with greater clarity than the fit statistics, when there are nearly equal numbers of items representing each different measured dimension. But the PCA does not work as well as the fit statistics when there are only a few items and/or people exhibiting inconsistencies.

This work should result in a full circle return to the drawing board (Wright, 1994; Wright & Stone, 2003), such that a theory of the measured construct ultimately provides rigorously precise predictive control over item calibrations, in the manner of the Lexile Framework (Stenner, et al., 2006) or developmental theories of hierarchical complexity (Dawson, 2004). Given that the five data sets employed here were simulations with no associated item content, the invariant stability and meaningfulness of the construct cannot be illustrated or annotated. But such illustration also is implicit in the quest for reliable instrumentation: the evidentiary basis for a delineation of meaningful expressions of amounts of the thing measured. The hope to be gleaned from the successes in theoretical prediction achieved to date is that we might arrive at practical applications of psychosocial measures that are as meaningful, useful, and economically productive as the theoretical applications of electromagnetism, thermodynamics, etc. that we take for granted in the technologies of everyday life.

Table 1

Reliability and Consistency Statistics

22 Persons, 23 Items, 506 Data Points

Data set Intended reliability Winsteps Real/Model Person Separation Reliability Winsteps/SPSS Cronbach’s alpha Winsteps Person Infit/Outfit Average Mn Sq Winsteps Person Infit/Outfit SD Winsteps Real/Model Item Separation Reliability Winsteps Item Infit/Outfit Average Mn Sq Winsteps Item Infit/Outfit SD Log-Likelihood Chi-Sq/d.f./p
First Best .96/.97 .96/.957 1.04/.35 .49/.25 .95/.96 1.08/0.35 .36/.19 185/462/1.00
Second Worst .00/.00 .00/-1.668 1.00/1.00 .05/.06 .00/.00 1.00/1.00 .05/.06 679/462/.0000
Third Good .90/.91 .93/.927 .92/2.21 .30/2.83 .85/.88 .90/2.13 .64/3.43 337/462/.9996
Fourth Fair .86/.87 .89/.891 .96/1.91 .25/2.18 .79/.83 .94/1.68 .53/2.27 444/462/.7226
Fifth Poor .76/.77 .80/.797 .98/1.15 .24/.67 .59/.65 .99/1.15 .41/.84 550/462/.0029
Table 2

Principal Components Analysis

Data set Intended reliability % Raw Variance Explained by Measures/Persons/Items % Raw Variance Captured in First Three Contrasts Total number of loadings > |.40| in first contrast
First Best 76/41/35 12 8
Second Worst 4.3/1.7/2.6 56 15
Third Good 59/34/25 20 14
Fourth Fair 47/27/20 26 13
Fifth Poor 29/17/11 41 15

References

Andrich, D. (1982, June). An index of person separation in Latent Trait Theory, the traditional KR-20 index, and the Guttman scale response pattern. Education Research and Perspectives, 9(1), http://www.rasch.org/erp7.htm.

Andrich, D. & G. A. Douglas. (1977). Reliability: Distinctions between item consistency and subject separation with the simple logistic model. Paper presented at the Annual Meeting of the American Educational Research Association, New York.

Dawson, T. L. (2004, April). Assessing intellectual development: Three approaches, one sequence. Journal of Adult Development, 11(2), 71-85.

Fisher, W. P., Jr. (1992). Reliability statistics. Rasch Measurement Transactions, 6(3), 238  [http://www.rasch.org/rmt/rmt63i.htm].

Fisher, W. P., Jr. (2008, Summer). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-3.

Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977, Winter). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37(4), 827-833.

Hattie, J. (1985, June). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9(2), 139-64.

Levine, G., & Parkinson, S. (1994). Experimental methods in psychology. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Linacre, J. M. (1993). Rasch-based generalizability theory. Rasch Measurement Transactions, 7(1), 283-284; [http://www.rasch.org/rmt/rmt71h.htm].

Linacre, J. M. (1996). True-score reliability or Rasch statistical validity? Rasch Measurement Transactions, 9(4), 455 [http://www.rasch.org/rmt/rmt94a.htm].

Linacre, J. M. (1997). KR-20 or Rasch reliability: Which tells the “Truth?”. Rasch Measurement Transactions, 11(3), 580-1 [http://www.rasch.org/rmt/rmt113l.htm].

Smith, R. M. (1996). A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling, 3(1), 25-40.

Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2006). How accurate are Lexile text measures? Journal of Applied Measurement, 7(3), 307-22.

Wood, R. (1978). Fitting the Rasch model: A heady tale. British Journal of Mathematical and Statistical Psychology, 31, 27-32.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1980). Foreword, Afterword. In Probabilistic models for some intelligence and attainment tests, by Georg Rasch (pp. ix-xix, 185-199. http://www.rasch.org/memo63.htm) [Reprint; original work published in 1960 by the Danish Institute for Educational Research]. Chicago, Illinois: University of Chicago Press.

Wright, B. D. (1994, Summer). Theory construction from empirical observations. Rasch Measurement Transactions, 8(2), 362 [http://www.rasch.org/rmt/rmt82h.htm].

Wright, B. D. (1995, Summer). Which standard error? Rasch Measurement Transactions, 9(2), 436-437 [http://www.rasch.org/rmt/rmt92n.htm].

Wright, B. D. (1996, Winter). Reliability and separation. Rasch Measurement Transactions, 9(4), 472 [http://www.rasch.org/rmt/rmt94n.htm].

Wright, B. D., & Stone, M. H. (2003). Five steps to science: Observing, scoring, measuring, analyzing, and applying. Rasch Measurement Transactions, 17(1), 912-913 [http://www.rasch.org/rmt/rmt171j.htm].

Appendix

Data Set 1

01100000000000000000000

10100000000000000000000

11000000000000000000000

11100000000000000000000

11101000000000000000000

11011000000000000000000

11100100000000000000000

11110100000000000000000

11111010100000000000000

11111101000000000000000

11111111010101000000000

11111111101010100000000

11111111111010101000000

11111111101101010010000

11111111111010101100000

11111111111111010101000

11111111111111101010100

11111111111111110101011

11111111111111111010110

11111111111111111111001

11111111111111111111101

11111111111111111111100

Data Set 2

01101010101010101001001

10100101010101010010010

11010010101010100100101

10101001010101001001000

01101010101010110010011

11011010010101100100101

01100101001001001001010

10110101000110010010100

01011010100100100101001

11101101001001001010010

11011010010101010100100

10110101101010101001001

01101011010000101010010

11010110101001010010100

10101101010000101101010

11011010101010010101010

10110101010101001010101

11101010101010110101011

11010101010101011010110

10101010101010110111001

01010101010101101111101

10101010101011011111100

Data Set 3

01100000000000100000010

10100000000000000010001

11000000000000100000010

11100000000000100000000

11101000000000100010000

11011000000000000000000

11100100000000100000000

11110100000000000000000

11111010100000100000000

11111101000000000000000

11111111010101000000000

11111111101010100000000

11111111111010001000000

11011111111111010010000

11011111111111101100000

11111111111111010101000

11011111111111101010100

11111111111111010101011

11011111111111111010110

11111111111111111111001

11011111111111111111101

10111111111111111111110

Data Set 4

01100000000000100010010

10100000000000000010001

11000000000000100000010

11100000000000100000001

11101000000000100010000

11011000000000000010000

11100100000000100010000

11110100000000000000000

11111010100000100010000

11111101000000000000000

11111011010101000010000

11011110111010100000000

11111111011010001000000

11011111101011110010000

11011111101101101100000

11111111110101010101000

11011111111011101010100

11111111111101110101011

01011111111111011010110

10111111111111111111001

11011111111111011111101

10111111111111011111110

Data Set 5

11100000010000100010011

10100000000000000011001

11000000010000100001010

11100000010000100000011

11101000000000100010010

11011000000000000010011

11100100000000100010000

11110100000000000000011

11111010100000100010000

00000000000011111111111

11111011010101000010000

11011110111010100000000

11111111011010001000000

11011111101011110010000

11011111101101101100000

11111111110101010101000

11011111101011101010100

11111111111101110101011

01011111111111011010110

10111111101111111111001

11011111101111011111101

00111111101111011111110

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Publications Documenting Score, Rating, Percentage Contrasts with Real Measures

July 7, 2009

A few brief and easy introductions to the contrast between scores, ratings, and percentages vs measures include:

Linacre, J. M. (1992, Autumn). Why fuss about statistical sufficiency? Rasch Measurement Transactions, 6(3), 230 [http://www.rasch.org/rmt/rmt63c.htm].

Linacre, J. M. (1994, Summer). Likert or Rasch? Rasch Measurement Transactions, 8(2), 356 [http://www.rasch.org/rmt/rmt82d.htm].

Wright, B. D. (1992, Summer). Scores are not measures. Rasch Measurement Transactions, 6(1), 208 [http://www.rasch.org/rmt/rmt61n.htm].

Wright, B. D. (1989). Rasch model from counting right answers: Raw scores as sufficient statistics. Rasch Measurement Transactions, 3(2), 62 [http://www.rasch.org/rmt/rmt32e.htm].

Wright, B. D. (1993). Thinking with raw scores. Rasch Measurement Transactions, 7(2), 299-300 [http://www.rasch.org/rmt/rmt72r.htm].

Wright, B. D. (1999). Common sense for measurement. Rasch Measurement Transactions, 13(3), 704-5  [http://www.rasch.org/rmt/rmt133h.htm].

Longer and more technical comparisons include:

Andrich, D. (1989). Distinctions between assumptions and requirements in measurement in the social sciences. In J. A. Keats, R. Taft, R. A. Heath & S. H. Lovibond (Eds.), Mathematical and Theoretical Systems: Proceedings of the 24th International Congress of Psychology of the International Union of Psychological Science, Vol. 4 (pp. 7-16). North-Holland: Elsevier Science Publishers.

van Alphen, A., Halfens, R., Hasman, A., & Imbos, T. (1994). Likert or Rasch? Nothing is more applicable than good theory. Journal of Advanced Nursing, 20, 196-201.

Wright, B. D., & Linacre, J. M. (1989). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70(12), 857-867 [http://www.rasch.org/memo44.htm].

Zhu, W. (1996). Should total scores from a rating scale be used directly? Research Quarterly for Exercise and Sport, 67(3), 363-372.

The following lists provide some key resources. The lists are intended to be representative, not comprehensive.  There are many works in addition to these that document the claims in yesterday’s table. Many of these books and articles are highly technical.  Good introductions can be found in Bezruczko (2005), Bond and Fox (2007), Smith and Smith (2004), Wilson (2005), Wright and Stone (1979), Wright and Masters (1982), Wright and Linacre (1989), and elsewhere. The www.rasch.org web site has comprehensive and current information on seminars, consultants, software, full text articles, professional association meetings, etc.

Books and Journal Issues

Andrich, D. (1988). Rasch models for measurement. Sage University Paper Series on Quantitative Applications in the Social Sciences, vol. series no. 07-068. Beverly Hills, California: Sage Publications.

Andrich, D., & Douglas, G. A. (Eds.). (1982). Rasch models for measurement in educational and psychological research [Special issue]. Education Research and Perspectives, 9(1), 5-118. [Full text available at www.rasch.org.]

Bezruczko, N. (Ed.). (2005). Rasch measurement in health sciences. Maple Grove, MN: JAM Press.

Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human sciences, 2d edition. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Choppin, B. (1985). In Memoriam: Bruce Choppin (T. N. Postlethwaite ed.) [Special issue]. Evaluation in Education: An International Review Series, 9(1).

DeBoeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. Statistics for Social and Behavioral Sciences). New York: Springer-Verlag.

Embretson, S. E., & Hershberger, S. L. (Eds.). (1999). The new rules of measurement: What every psychologist and educator should know. Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Engelhard, G., Jr., & Wilson, M. (1996). Objective measurement: Theory into practice, Vol. 3. Norwood, New Jersey: Ablex.

Fischer, G. H., & Molenaar, I. (1995). Rasch models: Foundations, recent developments, and applications. New York: Springer-Verlag.

Fisher, W. P., Jr., & Wright, B. D. (Eds.). (1994). Applications of Probabilistic Conjoint Measurement [Special Issue]. International Journal of Educational Research, 21(6), 557-664.

Garner, M., Draney, K., Wilson, M., Engelhard, G., Jr., & Fisher, W. P., Jr. (Eds.). (2009). Advances in Rasch measurement, Vol. One. Maple Grove, MN: JAM Press.

Granger, C. V., & Gresham, G. E. (Eds). (1993, August). New Developments in Functional Assessment [Special Issue]. Physical Medicine and Rehabilitation Clinics of North America, 4(3), 417-611.

Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, Illinois: MESA Press.

Liu, X., & Boone, W. (2006). Applications of Rasch measurement in science education. Maple Grove, MN: JAM Press.

Masters, G. N. (2007). Special issue: Programme for International Student Assessment (PISA). Journal of Applied Measurement, 8(3), 235-335.

Masters, G. N., & Keeves, J. P. (Eds.). (1999). Advances in measurement in educational research and assessment. New York: Pergamon.

Osborne, J. W. (Ed.). (2007). Best practices in quantitative methods. Thousand Oaks, CA: Sage.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.

Smith, E. V., Jr., & Smith, R. M. (Eds.) (2004). Introduction to Rasch measurement. Maple Grove, MN: JAM Press.

Smith, E. V., Jr., & Smith, R. M. (2007). Rasch measurement: Advanced and specialized applications. Maple Grove, MN: JAM Press.

Smith, R. M. (Ed.). (1997, June). Outcome Measurement [Special Issue]. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 261-428.

Smith, R. M. (1999). Rasch measurement models. Maple Grove, MN: JAM Press.

von Davier, M. (2006). Multivariate and mixture distribution Rasch models. New York: Springer.

Wilson, M. (1992). Objective measurement: Theory into practice, Vol. 1. Norwood, New Jersey: Ablex.

Wilson, M. (1994). Objective measurement: Theory into practice, Vol. 2. Norwood, New Jersey: Ablex.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Wilson, M., Draney, K., Brown, N., & Duckor, B. (Eds.). (2009). Advances in Rasch measurement, Vol. Two (p. in press). Maple Grove, MN: JAM Press.

Wilson, M., & Engelhard, G. (2000). Objective measurement: Theory into practice, Vol. 5. Westport, Connecticut: Ablex Publishing.

Wilson, M., Engelhard, G., & Draney, K. (Eds.). (1997). Objective measurement: Theory into practice, Vol. 4. Norwood, New Jersey: Ablex.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, Illinois: MESA Press.

Wright, B. D., & Stone, M. H. (1999). Measurement essentials. Wilmington, DE: Wide Range, Inc. [http://www.rasch.org/memos.htm#measess].

Key Articles

Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69-81.

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-73.

Andrich, D. (2002). Understanding resistance to the data-model relationship in Rasch’s paradigm: A reflection for the next generation. Journal of Applied Measurement, 3(3), 325-59.

Andrich, D. (2004, January). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I-7–I-16.

Beltyukova, S. A., Stone, G. E., & Fox, C. M. (2008). Magnitude estimation and categorical rating scaling in social sciences: A theoretical and psychometric controversy. Journal of Applied Measurement, 9(2), 151-159.

Choppin, B. (1968). An item bank using sample-free calibration. Nature, 219, 870-872.

Embretson, S. E. (1996, September). Item Response Theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, 20(3), 201-212.

Engelhard, G. (2008, July). Historical perspectives on invariant measurement: Guttman, Rasch, and Mokken. Measurement: Interdisciplinary Research & Perspectives, 6(3), 155-189.

Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.

Fischer, G. H. (1981, March). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59-77.

Fischer, G. H. (1989). Applying the principles of specific objectivity and of generalizability to the measurement of change. Psychometrika, 52(4), 565-587.

Fisher, W. P., Jr. (1997). Physical disability construct convergence across instruments: Towards a universal metric. Journal of Outcome Measurement, 1(2), 87-113.

Fisher, W. P., Jr. (2004, October). Meaning and method in the social sciences. Human Studies: A Journal for Philosophy and the Social Sciences, 27(4), 429-54.

Fisher, W. P., Jr. (2009, July). Invariance and traceability for measures of human, social, and natural capital: Theory and application. Measurement (Elsevier), in press.

Grosse, M. E., & Wright, B. D. (1986, Sep). Setting, evaluating, and maintaining certification standards with the Rasch model. Evaluation & the Health Professions, 9(3), 267-285.

Hall, W. J., Wijsman, R. A., & Ghosh, J. K. (1965). The relationship between sufficiency and invariance with applications in sequential analysis. Annals of Mathematical Statistics, 36, 575-614.

Kamata, A. (2001, March). Item analysis by the Hierarchical Generalized Linear Model. Journal of Educational Measurement, 38(1), 79-93.

Karabatsos, G., & Ullrich, J. R. (2002). Enumerating and testing conjoint measurement models. Mathematical Social Sciences, 43, 487-505.

Linacre, J. M. (1997). Instantaneous measurement and diagnosis. Physical Medicine and Rehabilitation State of the Art Reviews, 11(2), 315-324.

Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3(1), 85-106.

Lunz, M. E., & Bergstrom, B. A. (1991). Comparability of decision for computer adaptive and written examinations. Journal of Allied Health, 20(1), 15-23.

Lunz, M. E., Wright, B. D., & Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3/4, 331-345.

Masters, G. N. (1985, March). Common-person equating with the Rasch model. Applied Psychological Measurement, 9(1), 73-82.

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3-62.

Pelton, T., & Bunderson, V. (2003). The recovery of the density scale using a stochastic quasi-realization of additive conjoint measurement. Journal of Applied Measurement, 4(3), 269-81.

Rasch, G. (1961). On general laws and the meaning of measurement in psychology. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability (pp. 321-333 [http://www.rasch.org/memo1960.pdf]). Berkeley, California: University of California Press.

Rasch, G. (1966). An individualistic approach to item analysis. In P. F. Lazarsfeld & N. W. Henry (Eds.), Readings in mathematical social science (pp. 89-108). Chicago, Illinois: Science Research Associates.

Rasch, G. (1966, July). An informal report on the present state of a theory of objectivity in comparisons. Unpublished paper [http://www.rasch.org/memo1966.pdf].

Rasch, G. (1966). An item analysis which takes individual differences into account. British Journal of Mathematical and Statistical Psychology, 19, 49-57.

Rasch, G. (1968, September 6). A mathematical theory of objectivity and its consequences for model construction. [Unpublished paper [http://www.rasch.org/memo1968.pdf]], Amsterdam, the Netherlands: Institute of Mathematical Statistics, European Branch.

Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 14, 58-94.

Romanoski, J. T., & Douglas, G. (2002). Rasch-transformed raw scores and two-way ANOVA: A simulation analysis. Journal of Applied Measurement, 3(4), 421-430.

Smith, R. M. (1996). A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling, 3(1), 25-40.

Smith, R. M. (2000). Fit analysis in latent trait measurement models. Journal of Applied Measurement, 1(2), 199-218.

Stenner, A. J., & Smith III, M. (1982). Testing construct theories. Perceptual and Motor Skills, 55, 415-426.

Stenner, A. J. (1994). Specific objectivity – local and general. Rasch Measurement Transactions, 8(3), 374 [http://www.rasch.org/rmt/rmt83e.htm].

Stone, G. E., Beltyukova, S. A., & Fox, C. M. (2008). Objective standard setting for judge-mediated examinations. International Journal of Testing, 8(2), 180-196.

Stone, M. H. (2003). Substantive scale construction. Journal of Applied Measurement, 4(3), 282-97.

Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13(2), 181-208.

Wright, B. D. (1968). Sample-free test calibration and person measurement. In Proceedings of the 1967 invitational conference on testing problems (pp. 85-101 [http://www.rasch.org/memo1.htm]). Princeton, New Jersey: Educational Testing Service.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [http://www.rasch.org/memo42.htm].

Wright, B. D. (1980). Foreword, Afterword. In Probabilistic models for some intelligence and attainment tests, by Georg Rasch (pp. ix-xix, 185-199. http://www.rasch.org/memo63.htm). Chicago, Illinois: University of Chicago Press.

Wright, B. D. (1984). Despair and hope for educational measurement. Contemporary Education Review, 3(1), 281-288 [http://www.rasch.org/memo41.htm].

Wright, B. D. (1985). Additivity in psychological measurement. In E. Roskam (Ed.), Measurement and personality assessment. North Holland: Elsevier Science Ltd.

Wright, B. D. (1996). Comparing Rasch measurement and factor analysis. Structural Equation Modeling, 3(1), 3-24.

Wright, B. D. (1997, June). Fundamental measurement for outcome evaluation. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 261-88.

Wright, B. D. (1997, Winter). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45, 52 [http://www.rasch.org/memo62.htm].

Wright, B. D. (1999). Fundamental measurement for psychology. In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every educator and psychologist should know (pp. 65-104 [http://www.rasch.org/memo64.htm]). Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Wright, B. D., & Bell, S. R. (1984, Winter). Item banks: What, why, how. Journal of Educational Measurement, 21(4), 331-345 [http://www.rasch.org/memo43.htm].

Wright, B. D., & Linacre, J. M. (1989). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70(12), 857-867 [http://www.rasch.org/memo44.htm].

Wright, B. D., & Mok, M. (2000). Understanding Rasch measurement: Rasch models overview. Journal of Applied Measurement, 1(1), 83-106.

Model Applications

Adams, R. J., Wu, M. L., & Macaskill, G. (1997). Scaling methodology and procedures for the mathematics and science scales. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study Technical Report: Vol. 2: Implementation and Analysis – Primary and Middle School Years. Boston: Center for the Study of Testing, Evaluation, and Educational Policy.

Andrich, D., & Van Schoubroeck, L. (1989, May). The General Health Questionnaire: A psychometric analysis using latent trait theory. Psychological Medicine, 19(2), 469-485.

Beltyukova, S. A., Stone, G. E., & Fox, C. M. (2004). Equating student satisfaction measures. Journal of Applied Measurement, 5(1), 62-9.

Bergstrom, B. A., & Lunz, M. E. (1999). CAT for certification and licensure. In F. Drasgow & J. B. Olson-Buchanan (Eds.), Innovations in computerized assessment (pp. 67-91). Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc., Publishers.

Bond, T. G. (1994). Piaget and measurement II: Empirical validation of the Piagetian model. Archives de Psychologie, 63, 155-185.

Bunderson, C. V., & Newby, V. A. (2009). The relationships among design experiments, invariant measurement scales, and domain theories. Journal of Applied Measurement, 10(2), 117-137.

Cavanagh, R. F., & Romanoski, J. T. (2006, October). Rating scale instruments and measurement. Learning Environments Research, 9(3), 273-289.

Cipriani, D., Fox, C., Khuder, S., & Boudreau, N. (2005). Comparing Rasch analyses probability estimates to sensitivity, specificity and likelihood ratios when examining the utility of medical diagnostic tests. Journal of Applied Measurement, 6(2), 180-201.

Dawson, T. L. (2004, April). Assessing intellectual development: Three approaches, one sequence. Journal of Adult Development, 11(2), 71-85.

DeSalvo, K., Fisher, W. P. Jr., Tran, K., Bloser, N., Merrill, W., & Peabody, J. W. (2006, March). Assessing measurement properties of two single-item general health measures. Quality of Life Research, 15(2), 191-201.

Engelhard, G., Jr. (1992). The measurement of writing ability with a many-faceted Rasch model. Applied Measurement in Education, 5(3), 171-191.

Engelhard, G., Jr. (1997). Constructing rater and task banks for performance assessment. Journal of Outcome Measurement, 1(1), 19-33.

Fisher, W. P., Jr. (1998). A research program for accountable and patient-centered health status measures. Journal of Outcome Measurement, 2(3), 222-239.

Fisher, W. P., Jr., Harvey, R. F., Taylor, P., Kilgore, K. M., & Kelly, C. K. (1995, February). Rehabits: A common language of functional assessment. Archives of Physical Medicine and Rehabilitation, 76(2), 113-122.

Heinemann, A. W., Gershon, R., & Fisher, W. P., Jr. (2006). Development and application of the Orthotics and Prosthetics User Survey: Applications and opportunities for health care quality improvement. Journal of Prosthetics and Orthotics, 18(1), 80-85 [http://www.oandp.org/jpo/library/2006_01S_080.asp].

Heinemann, A. W., Linacre, J. M., Wright, B. D., Hamilton, B. B., & Granger, C. V. (1994). Prediction of rehabilitation outcomes with disability measures. Archives of Physical Medicine and Rehabilitation, 75(2), 133-143.

Hobart, J. C., Cano, S. J., O’Connor, R. J., Kinos, S., Heinzlef, O., Roullet, E. P., C., et al. (2003). Multiple Sclerosis Impact Scale-29 (MSIS-29):  Measurement stability across eight European countries. Multiple Sclerosis, 9, S23.

Hobart, J. C., Cano, S. J., Zajicek, J. P., & Thompson, A. J. (2007, December). Rating scales as outcome measures for clinical trials in neurology: Problems, solutions, and recommendations. Lancet Neurology, 6, 1094-1105.

Lai, J., Fisher, A., Magalhaes, L., & Bundy, A. C. (1996). Construct validity of the sensory integration and praxis tests. Occupational Therapy Journal of Research, 16(2), 75-97.

Lee, N. P., & Fisher, W. P., Jr. (2005). Evaluation of the Diabetes Self Care Scale. Journal of Applied Measurement, 6(4), 366-81.

Ludlow, L. H., & Haley, S. M. (1995, December). Rasch model logits: Interpretation, use, and transformation. Educational and Psychological Measurement, 55(6), 967-975.

Markward, N. J., & Fisher, W. P., Jr. (2004). Calibrating the genome. Journal of Applied Measurement, 5(2), 129-41.

Massof, R. W. (2007, August). An interval-scaled scoring algorithm for visual function questionnaires. Optometry & Vision Science, 84(8), E690-E705.

Massof, R. W. (2008, July-August). Editorial: Moving toward scientific measurements of quality of life. Ophthalmic Epidemiology, 15, 209-211.

Masters, G. N., Adams, R. J., & Lokan, J. (1994). Mapping student achievement. International Journal of Educational Research, 21(6), 595-610.

Mead, R. J. (2009). The ISR: Intelligent Student Reports. Journal of Applied Measurement, 10(2), 208-224.

Pelton, T., & Bunderson, V. (2003). The recovery of the density scale using a stochastic quasi-realization of additive conjoint measurement. Journal of Applied Measurement, 4(3), 269-81.

Smith, E. V., Jr. (2000). Metric development and score reporting in Rasch measurement. Journal of Applied Measurement, 1(3), 303-26.

Smith, R. M., & Taylor, P. (2004). Equating rehabilitation outcome scales: Developing common metrics. Journal of Applied Measurement, 5(3), 229-42.

Solloway, S., & Fisher, W. P., Jr. (2007). Mindfulness in measurement: Reconsidering the measurable in mindfulness. International Journal of Transpersonal Studies, 26, 58-81 [http://www.transpersonalstudies.org/volume_26_2007.html].

Stenner, A. J. (2001). The Lexile Framework: A common metric for matching readers and texts. California School Library Journal, 25(1), 41-2.

Wolfe, E. W., Ray, L. M., & Harris, D. C. (2004, October). A Rasch analysis of three measures of teacher perception generated from the School and Staffing Survey. Educational and Psychological Measurement, 64(5), 842-860.

Wolfe, F., Hawley, D., Goldenberg, D., Russell, I., Buskila, D., & Neumann, L. (2000, Aug). The assessment of functional impairment in fibromyalgia (FM): Rasch analyses of 5 functional scales and the development of the FM Health Assessment Questionnaire. Journal of Rheumatology, 27(8), 1989-99.

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

W

endt, A., & Tatum, D. S. (2005). Credentialing health care professionals. In N. Bezruczko (Ed.), Rasch measurement in health sciences (pp. 161-75). Maple Grove, MN: JAM Press.

Real-life scenarios illustrating the value of better measurement

June 4, 2009

I’ve seen consultants work hospital employees through a game in which the object is to manage the care of various kinds of patients who enter into the system at different points. Patients might have the same conditions, prognosis, payor, and demographics but come in through the ED, a clinic, or emerge from the OR. Others will vary medically but enter at the same point. Real-world odds are used to simulate decisions and events as the game proceeds via random card draws. The variation in decisions and outcomes across groups of decision-maker/players is fascinating.

It just occurred to me to set up a game like this with two major scenarios contrasting around one single variable: the quality of measurement. One inning or half of the game is status quo, where existing ratings and percentages are contrived and set up within the actual constraints of real data to illustrate the dangers of relying on numbers that are not measures. (Contact me for examples of how percentages can and sometimes do mean exactly the opposite of what they appear to mean.)

In this part of the game, we see the kinds of normal and par for the course inefficiencies, errors, outcomes, and costs that everyone expects to see.

In the second half of the game, we set up the same kind of scenario, but this time decisions are informed by meaningfully calibrated and contextualized measures. Everyone in the system has the same frame of reference, and decisions are coordinated virtually by the way the information is harmonized.

I imagine the two parts of the game might be played simultaneously by two equally experienced groups of managers and clinicians. Each group might be given a systems perspective, and would be encouraged to innovate with comparative effectiveness studies. When they have both arrived at their outcomes, tracked on a scorecard, they are debriefed together, the results are compared, and they are informed about the inner workings of the data they worked from.

Part of the point here would be to show that evidence-based decision-making is only worth as much as the evidence in hand. Evidence that is not constructed on the basis of a scientific theory and that is not mediated by calibrated instrumentation is worth much less than evidence that is theoretically justified and read off calibrated instruments.

It might be useful to imagine a seminar or workshop in which these scenarios are explored as illustrations of the way fully formed metrological systems reduce transaction costs and market frictions by greasing the wheels of health care commerce with efficient lubricants. Maybe the contrast could also be brought out in terms of a survey or multiple choice test.

Variations on the scenarios could be constructed for education or human resource contexts, as well.

Just wanted to put this down in writing. What do you think?