Mass Customization: Tailoring Tests, Surveys, and Assessments to Individuals without Sacrificing Comparability

One of the recurring themes in this blog concerns the technical capacities for more precise and meaningful measurement that remain unrecognized and under-utilized in business, finance, and economics. One of the especially productive capacities I have in mind relates to the techniques of adaptive measurement. These techniques make it possible to tailor measuring tools to the needs of the people measured, which is the diametric opposite of standard practice, which typically assumes it is necessary for people to adapt to the needs of the measuring instrument.

Think about what it means to try to measure every case using the same statements. When you define the limits of your instrument in terms of common content, you are looking for a one-size-fits-all solution. This design requires that you restrict the content of the statements to those that will be relevant in every case. The reason for proceeding in this way hinges on the assumption that you need to administer all of the items to every case in order to make the measures comparable, but this is not true. To conceive measurement in this way is to be shackled to an obsolete technology. Instead of operating within the constraints of an overly-limiting set of assumptions, you could be designing a system that takes missing data into account and that supports adaptive item administration, so that the instrument is tailored to the needs of the measurement situation. The benefits from taking this approach are extensive.

Think of the statements comprising the instrument as defining a hierarchy or continuum that extends from the most important, most agreeable, or easiest-to-achieve things at the bottom, and the least important, least agreeable, and hardest to achieve at the top. Imagine that your data are consistent, so that the probability of importance, agreeability, or success steadily decreases for any individual case as you read up the scale.

Obtaining data consistency like this is not always easy, but it is essential to measurement and to calibrating a scientific instrument. Even when data do not provide the needed consistency, much can be learned from them as to what needs to be done to get it.

Now hold that thought: you have a matrix of complete data, with responses to every item for every case. Now, following the typically assumed design scenario, in which all items are applied to every case, no matter how low a measure is, you think you need to administer the items calibrated at the top of the scale, even if we know from long experience and repeated recalibrations across multiple samples that the response probabilities of importance, agreement, or success are virtually 0.00 for these items.

Conversely, no matter how high a measure is, the usual design demands that all items be administered, even if we know from experience that the response probabilities for the items at the bottom of the scale are virtually 1.00.

In this scenario, we are wasting time and resources obtaining data on items for which we already know the answers. We are furthermore not asking other questions that would be particularly relevant to different individual cases because to include them in a complete data design where one size fits all would make the instrument too long. So we are stuck with a situation in which perhaps only a tenth of the overall instrument is actually being used for cases with measures toward the extremes.

One of the consequences of this is that we have much less information about the very low and very high measures, and so we have much less confidence about where the measures are than we do for more centrally located measures.

If measurement projects were oriented toward the development of an item bank, however, these problems can be overcome. You might develop and calibrate dozens, hundreds, or thousands of items. The bank might be administered in such a way that the same sets of items are applied to different cases only rarely. To the extent that the basic research on the bank shows that the items all measure the same thing, so that different item subsets all give the same result in terms of resolving the location of the measure on the quantitative continuum, comparability is not compromised.

The big plus is that all cases can now be measured with the same degree of meaningfulness, precision and confidence. We can administer the same number of items to all cases, and we can administer the same number of items as you would in your one-size-fits-all design, but now the items are targeted at each individual, providing maximum information. But the quantitative properties are only half the story. Real measurement integrates qualitative meaningfulness with quantitative precision.

As illustrated in the description of the typically assumed one-size-fits-all scenario, we interpret the measures in terms of the item calibrations. In the one-size-fits-all design, very low and very high measures can be associated with consistent variation on only a few items, as there is no variation on most of the items, since they are too easy or hard for this case. And it might happen that even cases in the middle of the scale are found to have response probabilities of 1.00 and 0.00 for the items at the very bottom and top of the scale, respectfully, further impairing the efficiency of the measurement process.

In the adaptive scenario, though, items are selected from the item bank via an algorithm that uses the expected response probabilities to target the respondent. Success on an easy item causes the algorithm to pick a harder item, and vice versa. In this way, the instrument is tailored for the individual case. This kind of mass customization can also be qualitatively based. Items that are irrelevant to the particular characteristics of an individual case can be excluded from consideration.

And adaptive designs do not necessarily have to be computerized, since respondents, examinees, and judges can be instructed to complete a given number of contiguous items in a sequence ordered by calibration values. This effects a kind of self-targeting that effectively reduces the number of overall items administered without the need for expensive investments in programming or hardware.

The literature on adaptive instrument administration is over 40 years old, and is quite technical and extensive. I’ve provided a sample of articles below, including some providing programming guidelines.

The concepts of item banking and adaptive administration of course are the technical mechanisms on which will be built metrological networks of instruments linked to reference standards. See previously posted blog entries here for more on metrology and traceability.

References

Association of Test Publishers. (2001, Fall). Benjamin D. Wright, Ph.D. honored with the Career Achievement Award in Computer-Based Testing. Test Publisher, 8(2). Retrieved 20 May 2009, from http://www.testpublishers.org/newsletter7.htm#Wright.

Bergstrom, B. A., Lunz, M. E., & Gershon, R. C. (1992). Altering the level of difficulty in computer adaptive testing. Applied Measurement in Education, 5(2), 137-149.

Choppin, B. (1968). An item bank using sample-free calibration. Nature, 219, 870-872.

Choppin, B. (1976). Recent developments in item banking. In D. N. M. DeGruitjer & L. J. van der Kamp (Eds.), Advances in Psychological and Educational Measurement (pp. 233-245). New York: Wiley.

Cook, K., O’Malley, K. J., & Roddey, T. S. (2005, October). Dynamic Assessment of Health Outcomes: Time to Let the CAT Out of the Bag? Health Services Research, 40(Suppl 1), 1694-1711.

Dijkers, M. P. (2003). A computer adaptive testing simulation applied to the FIM instrument motor component. Archives of Physical Medicine & Rehabilitation, 84(3), 384-93.

Halkitis, P. N. (1993). Computer adaptive testing algorithm. Rasch Measurement Transactions, 6(4), 254-255.

Linacre, J. M. (1999). Individualized testing in the classroom. In G. N. Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 186-94). New York: Pergamon.

Linacre, J. M. (2000). Computer-adaptive testing: A methodology whose time has come. In S. Chae, U. Kang, E. Jeon & J. M. Linacre (Eds.), Development of Computerized Middle School Achievement Tests [in Korean] (MESA Research Memorandum No. 69). Seoul, South Korea: Komesa Press. Available in English at http://www.rasch.org/memo69.htm.

Linacre, J. M. (2006). Computer adaptive tests (CAT), standard errors, and stopping rules. Rasch Measurement Transactions, 20(2), 1062 [http://www.rasch.org/rmt/rmt202f.htm].

Lunz, M. E., & Bergstrom, B. A. (1991). Comparability of decision for computer adaptive and written examinations. Journal of Allied Health, 20(1), 15-23.

Lunz, M. E., & Bergstrom, B. A. (1994). An empirical study of computerized adaptive test administration conditions. Journal of Educational Measurement, 31(3), 251-263.

Lunz, M. E., & Bergstrom, B. A. (1995). Computerized adaptive testing: Tracking candidate response patterns. Journal of Educational Computing Research, 13(2), 151-162.

Lunz, M. E., Bergstrom, B. A., & Gershon, R. C. (1994). Computer adaptive testing. In W. P. Fisher, Jr. & B. D. Wright (Eds.), Special Issue: International Journal of Educational Research, 21(6), 623-634.

Lunz, M. E., Bergstrom, B. A., & Wright, B. D. (1992, Mar). The effect of review on student ability and test efficiency for computerized adaptive tests. Applied Psychological Measurement, 16(1), 33-40.

McHorney, C. A. (1997, Oct 15). Generic health measurement: Past accomplishments and a measurement paradigm for the 21st century. [Review] [102 refs]. Annals of Internal Medicine, 127(8 Pt 2), 743-50.

Meijer, R. R., & Nering, M. L. (1999, Sep). Computerized adaptive testing: Overview and introduction. Applied Psychological Measurement, 23(3), 187-194.

Raîche, G., & Blais, J.-G. (2009). Considerations about expected a posteriori estimation in adaptive testing. Journal of Applied Measurement, 10(2), 138-156.

Raîche, G., Blais, J.-G., & Riopel, M. (2006, Autumn). A SAS solution to simulate a Rasch computerized adaptive test. Rasch Measurement Transactions, 20(2), 1061.

Reckase, M. D. (1989). Adaptive testing: The evolution of a good idea. Educational Measurement: Issues and Practice, 8, 3.

Revicki, D. A., & Cella, D. F. (1997, Aug). Health status assessment for the twenty-first century: Item response theory item banking and computer adaptive testing. Quality of Life Research, 6(6), 595-600.

Riley, B. B., Conrad, K., Bezruczko, N., & Dennis, M. L. (2007). Relative precision, efficiency, and construct validity of different starting and stopping rules for a computerized adaptive test: The GAIN substance problem scale. Journal of Applied Measurement, 8(1), 48-64.

van der Linden, W. J. (1999). Computerized educational testing. In G. N. Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 138-50). New York: Pergamon.

Velozo, C. A., Wang, Y., Lehman, L., & Wang, J.-H. (2008). Utilizing Rasch measurement models to develop a computer adaptive self-report of walking, climbing, and running. Disability & Rehabilitation, 30(6), 458-67.

Vispoel, W. P., Rocklin, T. R., & Wang, T. (1994). Individual differences and test administration procedures: A comparison of fixed-item, computerized-adaptive, self-adapted testing. Applied Measurement in Education, 7(1), 53-79.

Wang, T., Hanson, B. A., & Lau, C. M. A. (1999, Sep). Reducing bias in CAT trait estimation: A comparison of approaches. Applied Psychological Measurement, 23(3), 263-278.

Ware, J. E., Bjorner, J., & Kosinski, M. (2000). Practical implications of item response theory and computerized adaptive testing: A brief summary of ongoing studies of widely used headache impact scales. Medical Care, 38(9 Suppl), II73-82.

Weiss, D. J. (1983). New horizons in testing: Latent trait test theory and computerized adaptive testing. New York: Academic Press.

Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement, 21(4), 361-375.

Weiss, D. J., & Schleisman, J. L. (1999). Adaptive testing. In G. N. Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 129-37). New York: Pergamon.

Wouters, H., Zwinderman, A. H., van Gool, W. A., Schmand, B., & Lindeboom, R. (2009). Adaptive cognitive testing in dementia. International Journal of Methods in Psychiatric Research, 18(2), 118-127.

Wright, B. D., & Bell, S. R. (1984, Winter). Item banks: What, why, how. Journal of Educational Measurement, 21(4), 331-345 [http://www.rasch.org/memo43.htm].

Wright, B. D., & Douglas, G. A. (1975). Best test design and self-tailored testing (Tech. Rep. No. 19). Chicago, Illinois: MESA Laboratory, Department of Education, University of Chicago [http://www.rasch.org/memo19.pdf] (Research Memorandum No. 19).

Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at livingcapitalmetrics.wordpress.com.
Permissions beyond the scope of this license may be available at http://www.livingcapitalmetrics.com.

Wright, B. D., & Douglas, G. A. (1975). Best test design and self-tailored testing (Tech. Rep. No. 19). Chicago, Illinois: MESA Laboratory, Department of Education,  University of Chicago [http://www.rasch.org/memo19.pdf] (Research Memorandum No. 19).
Advertisements

Tags: , , , , , , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: