Assessment tools include many different types of scores and these can be confusing to users. One type of score sometimes reported is called a “strand” score. Such scores are intended to indicate a student’s performance on a subtest, subdomain, or standards. Unfortunately, strand scores do not do that very well at all, in part, because they lack reliability and validity.
Even though there is robust evidence for the use of formative assessment to improve teacher practice and student outcomes (Kingston & Nash, 2011; Black & Wiliam), the use of so-called strand scores is misleading. Many researchers have examined the value of strand scores and subtest scores. They consistently conclude that there are very few tests for which those scores add value (Puhan, Sinharay, Haberman, & Larkin 2008; Sinharay & Haberman, 2008; Haberman 2008; Haberman, Sinharay & Puhan, 2009). Although strand scores may appear to be useful, they lack reliability and validity and therefore do not provide useful or unique information.
Evidence: Invalid Scores and Insignificant Effects
Research and assessment experts in both school districts and state departments of education have demonstrated that strand scores for individual students lack reliability and validity (e.g., Chan, Vanden Berk & Denbleyker, 2010). Those researchers concluded that (a) strand scores are substantially less reliable and valid than the total score, (b) strand scores were poor predictors of strand scores on the state tests, (b) the full scale was a better predictor of strand scores on the state test, and (4) test-retest reliability of strand scores was poor and insufficient for interpretation. This is not a problem that is specific to one vendor. It a general problem with strand scores.
In addition, the results of a large scale study funded by the US Department of Education concluded that teachers do not differentiate their instruction or improve student outcomes with the implementation of strand scores—even when teachers were provided with training in the interpretation and use of strand scores (Cordray, Pion, Brandt, Molefe, & Toby, 2012).
Here’s why. Most tests are developed to support the interpretation of full scale scores and not subtest or strand scores. This is true of the prominent computer adaptive tests (e.g., aReading, MAP, STAR). Almost all of the widely used computer adaptive tests have a unidimensional measurement model, which should preclude the use of strand scores. Reporting strand scores violates a fundamental assumption of the measurement model—that there is only one unified underlying dimension assessed.
Testing Standards. The Standards for Educational and Psychological Testing (1999, 2013) is the definitive authority for the development and use of tests and scores. There are standards that explicitly require evidence for each proposed use of a test and score. For example:
- When interpretation of performance on specific items, or small subsets of items, is suggested, the rationale and relevant evidence in support of such interpretation should be provided (Standard 1.10)
- When test interpretation emphasizes differences between two observed scores … reliability data, including standard errors, should be provided for such differences (Standard 2.3)
Before strand scores could be used, there must be evidence from the developers reported with the standard errors of measurement for each strand score. Those standard errors are often so large it becomes clear why strand scores are not reliable or valid.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy & Practice, 5, 7-74.
Chan, C., Vanden Berk, E., & Denbleyker, J. (2010). Linking MAP strand scores to MCA-II Strand Scores. Presentation at the Minnesota Assessment Conference, Minnesota Department of Education with Minneapolis Public Schools.
Cordray, D., Pion, G., Brandt, C., Molefe, A., & Toby, M. (2012). The impact of the Measures of Academic Progress (MAP) program on student reading achievement. Institute of Educational Sciences, US Department of Education (NCEE 2013-4000).
Haberman, S. J. (2008). When can subscores have value? Journal of Educational and Behavioral Statistics, 33, 204–229.
Kingston, N., & Nash, B. (2011). Formative Assessment: A Meta-Analysis and a Call for Research. Educational Measurement: Issues and Practice, 30, 28-37.
Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47, 150–174.
Sinharay, S., Haberman, S. J., & Puhan, G. (2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26, 21–28.