Reliability Revisited: Distinguishing Consistency from Error

When something is meaningful to us, and we understand it, then we can successfully restate it in our own words and predictably reproduce approximately the same representation across situations  as was obtained in the original formulation. When data fit a Rasch model, the implications are (1) that different subsets of items (that is, different ways of composing a series of observations summarized in a sufficient statistic) will all converge on the same pattern of person measures, and (2) that different samples of respondents or examinees will all converge on the same pattern of item calibrations. The meaningfulness of propositions based in these patterns will then not depend on which collection of items (instrument) or sample of persons is obtained, and all instruments might be equated relative to a single universal, uniform metric so that the same symbols reliably represent the same amount of the same thing.

Statistics and research methods textbooks in psychology and the social sciences commonly make statements like the following about reliability: “Reliability is consistency in measurement. The reliability of individual scale items increases with the number of points in the item. The reliability of the complete scale increases with the number of items.” (These sentences are found at the top of p. 371 in Experimental Methods in Psychology, by Gustav Levine and Stanley Parkinson (Lawrence Erlbaum Associates, 1994).) The unproven, perhaps unintended, and likely unfounded implication of these statements is that consistency increases as items are added.

Despite the popularity of doing so, Green, Lissitz, and Mulaik (1977) argue that reliability coefficients are misused when they are interpreted as indicating the extent to which data are internally consistent. “Green et al. (1977) observed that though high ‘internal consistency’ as indexed by a high alpha results when a general factor runs through the items, this does not rule out obtaining high alpha when there is no general factor running through the test items…. They concluded that the chief defect of alpha as an index of dimensionality is its tendency to increase as the number of items increase” (Hattie, 1985, p. 144).

In addressing the internal consistency of data, the implicit but incompletely realized purpose of estimating scale reliability is to evaluate the extent to which sum scores function as sufficient statistics. How limited is reliability as a tool for this purpose? To answer this question, five dichotomous data sets of 23 items and 22 persons were simulated. The first one was constructed so as to be highly likely to fit a Rasch model, with a deliberately orchestrated probabilistic Guttman pattern. The second one was made nearly completely random. The third, fourth, and fifth data sets were modifications of the first one in which increasing numbers of increasingly inconsistent responses were introduced. (The inconsistencies were not introduced in any systematic way apart from inserting contrary responses in the ordered matrix.) The data sets are shown in the Appendix. Tables 1 and 2 summarize the results.

Table 1 shows that the reliability coefficients do in fact decrease, along with the global model fit log-likelihood chi-squares, as the amount of randomness and inconsistency is increased. Contrary to what is implied in Levine and Parkinson’s statements, however, reliability can vary within a given number of items, as it might across different data sets produced from the same test, survey, or assessment, depending on how much structural invariance is present within them.

Two other points about the tables are worthy of note. First, the Rasch-based person separation reliability coefficients drop at a faster rate than Cronbach’s alpha does. This is probably an effect of the individualized error estimates in the Rasch context, which makes its reliability coefficients more conservative than correlation-based, group-level error estimates. (It is worth noting, as well, that the Winsteps and SPSS estimates of Cronbach’s alpha match. They are reported to one fewer decimal places by Winsteps, but the third decimal place is shown for the SPSS values for contrast.)

Second, the fit statistics are most affected by the initial and most glaring introduction of inconsistencies, in data set three. As the randomness in the data increases, the reliabilities continue to drop, but the fit statistics improve, culminating in the case of data set two, where complete randomness results in near-perfect model fit. This is, of course, the situation in which both the instrument and the sample are as well targeted as they can be, since all respondents have about the same measure and all the items about the same calibration; see Wood (1978) for a commentary on this situation, where coin tosses fit a Rasch model.

Table 2 shows the results of the Winsteps Principal Components Analysis of the standardized residuals for all five data sets. Again, the results conform with and support the pattern shown in the reliability coefficients. It is, however, interesting to note that, for data sets 4 and 5, with their Cronbach’s alphas of about .89 and .80, respectively, which are typically deemed quite good, the PCA shows more variance left unexplained than is explained by the Rasch dimension. The PCA is suggesting that two or more constructs might be represented in the data, but this would never be known from Cronbach’s alpha alone.

Alpha alone would indicate the presence of a unidimensional construct for data sets 3, 4 and 5, despite large standard deviations in the fit statistics and even though more than half the variance cannot be explained by the primary dimension. Worse, for the fifth data set, more variance is captured in the first three contrasts than is explained by the Rasch dimension. But with Cronbach’s alpha at .80, most researchers would consider this scale quite satisfactorily unidimensional and internally consistent.

These results suggest that, first, in seeking high reliability, what is sought more fundamentally is fit to a Rasch model (Andrich & Douglas, 1977; Andrich, 1982; Wright, 1977). That is, in addressing the internal consistency of data, the popular conception of reliability is taking on the concerns of construct validity. A conceptually clearer sense of reliability focuses on the extent to which an instrument works as expected every time it is used, in the sense of the way a car can be reliable. For instance, with an alpha of .70, a screening tool would be able to reliably distinguish measures into two statistically distinct groups (Fisher, 1992; Wright, 1996), problematic and typical. Within the limits of this purpose, the tool would meet the need for the repeated production of information capable of meeting the needs of the situation. Applications in research, accountability, licensure/certification, or diagnosis, however, might demand alphas of .95 and the kind of precision that allows for statistically distinct divisions into six or more groups. In these kinds of applications, where experimental designs or practical demands require more statistical power, measurement precision articulates finer degrees of differences. Finely calibrated instruments provide sensitivity over the entire length of the measurement continuum, which is needed for repeated reproductions of the small amounts of change that might accrue from hard to detect treatment effects.

Separating the construct, internal consistency, and unidimensionality issues  from the repeatability and reproducibility of a given degree of measurement precision provides a much-needed conceptual and methodological clarification of reliability. This clarification is routinely made in Rasch measurement applications (Andrich, 1982; Andrich & Douglas, 1977; Fisher, 1992; Linacre, 1993, 1996, 1997). It is reasonable to want to account for inconsistencies in the data in the error estimates and in the reliability coefficients, and so errors and reliabilities are routinely reported in terms of both the modeled expectations and in a fit-inflated form (Wright, 1995). The fundamental value of proceeding from a basis in individual error and fit statistics (Wright, 1996), is that local imprecisions and failures of invariance can be isolated for further study and selective attention.

The results of the simulated data analyses suggest, second, that, used in isolation, reliability coefficients can be misleading. As Green, et al. say, reliability estimates tend to systematically increase as the number of items increases (Fisher, 2008). The simulated data show that reliability coefficients also systematically decrease as inconsistency increases.

The primary problem with relying on reliability coefficients alone as indications of data consistency hinges on their inability to reveal the location of departures from modeled expectations. Most uses of reliability coefficients take place in contexts in which the model remains unstated and expectations are not formulated or compared with observations. The best that can be done in the absence of a model statement and test of data fit to it is to compare the reliability obtained against that expected on the basis of the number of items and response categories, relative to the observed standard deviation in the scores, expressed in logits (Linacre, 1993). One might then raise questions as to targeting, data consistency, etc. in order to explain larger than expected differences.

A more methodical way, however, would be to employ multiple avenues of approach to the evaluation of the data, including the use of model fit statistics and Principal Components Analysis in the evaluation of differential item and person functioning. Being able to see which individual observations depart the furthest from modeled expectation can provide illuminating qualitative information on the meaningfulness of the data, the measures, and the calibrations, or the lack thereof.  This information is crucial to correcting data entry errors, identifying sources of differential item or person functioning, separating constructs and populations, and improving the instrument. The power of the reliability-coefficient-only approach to data quality evaluation is multiplied many times over when the researcher sets up a nested series of iterative dialectics in which repeated data analyses explore various hypotheses as to what the construct is, and in which these analyses feed into revisions to the instrument, its administration, and/or the population sampled.

For instance, following the point made by Smith (1996), it may be expected that the PCA results will illuminate the presence of multiple constructs in the data with greater clarity than the fit statistics, when there are nearly equal numbers of items representing each different measured dimension. But the PCA does not work as well as the fit statistics when there are only a few items and/or people exhibiting inconsistencies.

This work should result in a full circle return to the drawing board (Wright, 1994; Wright & Stone, 2003), such that a theory of the measured construct ultimately provides rigorously precise predictive control over item calibrations, in the manner of the Lexile Framework (Stenner, et al., 2006) or developmental theories of hierarchical complexity (Dawson, 2004). Given that the five data sets employed here were simulations with no associated item content, the invariant stability and meaningfulness of the construct cannot be illustrated or annotated. But such illustration also is implicit in the quest for reliable instrumentation: the evidentiary basis for a delineation of meaningful expressions of amounts of the thing measured. The hope to be gleaned from the successes in theoretical prediction achieved to date is that we might arrive at practical applications of psychosocial measures that are as meaningful, useful, and economically productive as the theoretical applications of electromagnetism, thermodynamics, etc. that we take for granted in the technologies of everyday life.

Table 1

Reliability and Consistency Statistics

22 Persons, 23 Items, 506 Data Points

Data set Intended reliability Winsteps Real/Model Person Separation Reliability Winsteps/SPSS Cronbach’s alpha Winsteps Person Infit/Outfit Average Mn Sq Winsteps Person Infit/Outfit SD Winsteps Real/Model Item Separation Reliability Winsteps Item Infit/Outfit Average Mn Sq Winsteps Item Infit/Outfit SD Log-Likelihood Chi-Sq/d.f./p
First Best .96/.97 .96/.957 1.04/.35 .49/.25 .95/.96 1.08/0.35 .36/.19 185/462/1.00
Second Worst .00/.00 .00/-1.668 1.00/1.00 .05/.06 .00/.00 1.00/1.00 .05/.06 679/462/.0000
Third Good .90/.91 .93/.927 .92/2.21 .30/2.83 .85/.88 .90/2.13 .64/3.43 337/462/.9996
Fourth Fair .86/.87 .89/.891 .96/1.91 .25/2.18 .79/.83 .94/1.68 .53/2.27 444/462/.7226
Fifth Poor .76/.77 .80/.797 .98/1.15 .24/.67 .59/.65 .99/1.15 .41/.84 550/462/.0029
Table 2

Principal Components Analysis

Data set Intended reliability % Raw Variance Explained by Measures/Persons/Items % Raw Variance Captured in First Three Contrasts Total number of loadings > |.40| in first contrast
First Best 76/41/35 12 8
Second Worst 4.3/1.7/2.6 56 15
Third Good 59/34/25 20 14
Fourth Fair 47/27/20 26 13
Fifth Poor 29/17/11 41 15


Andrich, D. (1982, June). An index of person separation in Latent Trait Theory, the traditional KR-20 index, and the Guttman scale response pattern. Education Research and Perspectives, 9(1),

Andrich, D. & G. A. Douglas. (1977). Reliability: Distinctions between item consistency and subject separation with the simple logistic model. Paper presented at the Annual Meeting of the American Educational Research Association, New York.

Dawson, T. L. (2004, April). Assessing intellectual development: Three approaches, one sequence. Journal of Adult Development, 11(2), 71-85.

Fisher, W. P., Jr. (1992). Reliability statistics. Rasch Measurement Transactions, 6(3), 238  [].

Fisher, W. P., Jr. (2008, Summer). The cash value of reliability. Rasch Measurement Transactions, 22(1), 1160-3.

Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977, Winter). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37(4), 827-833.

Hattie, J. (1985, June). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9(2), 139-64.

Levine, G., & Parkinson, S. (1994). Experimental methods in psychology. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Linacre, J. M. (1993). Rasch-based generalizability theory. Rasch Measurement Transactions, 7(1), 283-284; [].

Linacre, J. M. (1996). True-score reliability or Rasch statistical validity? Rasch Measurement Transactions, 9(4), 455 [].

Linacre, J. M. (1997). KR-20 or Rasch reliability: Which tells the “Truth?”. Rasch Measurement Transactions, 11(3), 580-1 [].

Smith, R. M. (1996). A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling, 3(1), 25-40.

Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2006). How accurate are Lexile text measures? Journal of Applied Measurement, 7(3), 307-22.

Wood, R. (1978). Fitting the Rasch model: A heady tale. British Journal of Mathematical and Statistical Psychology, 31, 27-32.

Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116 [].

Wright, B. D. (1980). Foreword, Afterword. In Probabilistic models for some intelligence and attainment tests, by Georg Rasch (pp. ix-xix, 185-199. [Reprint; original work published in 1960 by the Danish Institute for Educational Research]. Chicago, Illinois: University of Chicago Press.

Wright, B. D. (1994, Summer). Theory construction from empirical observations. Rasch Measurement Transactions, 8(2), 362 [].

Wright, B. D. (1995, Summer). Which standard error? Rasch Measurement Transactions, 9(2), 436-437 [].

Wright, B. D. (1996, Winter). Reliability and separation. Rasch Measurement Transactions, 9(4), 472 [].

Wright, B. D., & Stone, M. H. (2003). Five steps to science: Observing, scoring, measuring, analyzing, and applying. Rasch Measurement Transactions, 17(1), 912-913 [].


Data Set 1























Data Set 2























Data Set 3























Data Set 4























Data Set 5























Creative Commons License
LivingCapitalMetrics Blog by William P. Fisher, Jr., Ph.D. is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Based on a work at
Permissions beyond the scope of this license may be available at


Tags: , , , , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: