Differential Item Functioning Analysis of the aReading & aMath Item Banks
FastBridge is committed to providing psychometrically sound assessments that provide valid and fair results for all students. One component of this commitment is the evaluation of how our test items function in specific groups based on gender and race/ethnicity. Items that perform differently in two groups may indicate bias. A definitive determination of bias requires expert review and judgement. This report describes the results from the statistical analysis of bias of the FastBridge aReading and aMath item banks.
Overall, more than 99% of the aReading and aMath items showed no statistical bias across Kindergarten through Grade 8. Among the aReading items, eight met the criteria for statistical bias; half reflected gender bias and the other half reflected Hispanic-white bias. Among the aMath items, eight met the criteria for bias; half reflected gender bias and the other half was evenly split across minority-white group comparisons. With so few items exhibiting statistical bias, we can be confident that FastBridge aReading and aMath scores provide fair and equitable assessment of overall reading and math skills.
Differential Item Functioning
Method
aReading and aMath items were evaluated for statistical bias using a method called differential item functioning (DIF). This method determines the probability that students with the same overall ability will answer an item correctly. The probability of item success is computed across the full ability range and averaged at the group level. Thus, DIF indexes the overall performance difference between two groups of students (e.g., female and male) on each item after matching the groups on ability. An item is considered fair, or free from bias when there is no difference across the groups.
A well-researched statistical model, called logistic regression was used to evaluate DIF. In the logistic regression model, the log odds of answering an item correctly is a function of ability (θ) and group membership, G, coded 0 for the reference group and 1 for the focal group.
An item is considered biased if the b2 coefficient is statistically significant. However, because statistical significance is highly dependent on sample size, and with large sample sizes even very small differences can be significant and thus erroneously indicating DIF when it is not present, researchers use effect sizes to identify DIF items. In the logistic model, the effect size is the improvement in model fit, measured by the change in R2 between the reduced model without the group variable, G, and the full model that includes group. Effect sizes are categorized into three levels: negligible, moderate, or large (Jodoin & Gierl, 2001).
Data
The source data were all aReading and aMath test administrations from the fall of 2019. To be included in the DIF analyses, the student record had to indicate the student’s gender and race/ethnicity. Tables 1 and 2 describes the aReading and aMath samples used to conduct the DIF. In most grades, the item scores from more than 50,000 students were used. The samples were representative of the entire population as evidenced by the median national percentile.
Table 1. Descriptive statistics of aReading DIF sample
|
|
Scale Score |
Percentile |
|
Grade |
Count |
Mean |
SD |
Median |
K |
19,445 |
393.3 |
25.7 |
45 |
1 |
37,070 |
435.9 |
30.2 |
45 |
2 |
82,916 |
470.0 |
29.8 |
47 |
3 |
81,415 |
492.8 |
26.3 |
54 |
4 |
83,057 |
506.0 |
24.3 |
56 |
5 |
84,437 |
515.9 |
23.9 |
51 |
6 |
73,720 |
523.5 |
24.3 |
55 |
7 |
60,291 |
529.3 |
25.5 |
54 |
8 |
57,923 |
534.8 |
26.0 |
43 |
Table 2. Descriptive statistics of aMath DIF sample
|
|
Scale Score |
Percentile |
|
Grade |
Count |
Mean |
SD |
Median |
K |
7,988 |
178.7 |
6.8 |
47 |
1 |
34,313 |
189.2 |
8.1 |
43 |
2 |
75,012 |
198.8 |
8.0 |
53 |
3 |
72,800 |
205.7 |
7.6 |
57 |
4 |
73,072 |
210.4 |
8.2 |
56 |
5 |
72,527 |
215.9 |
9.9 |
54 |
6 |
59,602 |
219.5 |
11.2 |
56 |
7 |
50,305 |
222.4 |
12.1 |
51 |
8 |
46,966 |
224.2 |
12.3 |
47 |
Results
Because these are adaptive tests, some items are taken by too few students to conduct DIF analysis. Among the aReading items administered, about 40% did not have sufficient numbers to conduct DIF, and among aMath items administered about 55% did not have sufficient numbers to conduct DIF. To maximize the number of items with enough data to conduct DIF, adjacent grades were combined.
DIF was evaluated on all items for scores from at least 250 students in each group. Simulation research evaluating the power to detect uniform DIF with logistic regression shows that adequate power can be achieved with at least 250 students per group (Jodoin & Gierl, 1999). Table 3 summarizes the DIF results.
Table 3. Total number of items and number of DIF items by grade level and test
|
|
aReading |
|
aMath |
||||
Grade |
Focal Group |
Total Item Count |
Favors Reference |
Favors Focal |
|
Total Item Count |
Favors Reference |
Favors Focal |
K - 1 |
Asian |
153 |
-- |
-- |
|
116 |
-- |
-- |
K - 1 |
Af. Amer. |
152 |
-- |
-- |
|
110 |
-- |
-- |
K - 1 |
Hispanic |
148 |
-- |
-- |
|
119 |
-- |
1 |
K - 1 |
Male |
214 |
-- |
-- |
|
164 |
-- |
-- |
2 - 3 |
Asian |
222 |
-- |
-- |
|
191 |
1 |
-- |
2 - 3 |
Af. Amer. |
233 |
-- |
-- |
|
195 |
-- |
1 |
2 - 3 |
Hispanic |
234 |
-- |
-- |
|
198 |
-- |
-- |
2 - 3 |
Male |
267 |
-- |
-- |
|
236 |
-- |
-- |
4 - 5 |
Asian |
187 |
-- |
-- |
|
235 |
-- |
1 |
4 - 5 |
Af. Amer. |
241 |
-- |
-- |
|
233 |
-- |
-- |
4 - 5 |
Hispanic |
253 |
-- |
-- |
|
231 |
-- |
-- |
4 - 5 |
Male |
276 |
2 |
-- |
|
291 |
1 |
-- |
6 - 8 |
Asian |
195 |
-- |
-- |
|
246 |
-- |
-- |
6 - 8 |
Af. Amer. |
241 |
-- |
-- |
|
275 |
-- |
-- |
6 - 8 |
Hispanic |
254 |
3 |
1 |
|
270 |
-- |
-- |
6 - 8 |
Male |
294 |
2 |
-- |
|
320 |
3 |
-- |
Table 3 shows the total number of items with enough data to conduct DIF for each comparison. The focal groups are listed. The comparison, or reference group for each race/ethnicity comparison was whites. Females constituted the reference group in the male/female comparison. Items counted in the “favors reference” indicate that the direction of DIF favored the reference group. In other words, the reference group was more likely to answer the item correctly after matching on ability. Items counted in the Favors Focal column were easier for the focal group.
None of the items in any comparison in reading or math displayed large DIF. Among the aReading items, none displayed even moderate DIF in Kindergarten through Grade 3. Two items favored females in Grades 4-5, and Grades 6-8. Three items favored Hispanics compared to whites and one item favored whites. Among the aMath items, one item in K-1 favored whites over Hispanics, one item in Grades 2-3 favored whites over African-Americans and one item favored Asians over whites. In Grades 4-5 one item favored Asians over whites, and one item favored females over males. In Grades 6-8 three items favored females.
Collectively, less than one-half of one percent of the items displayed moderate DIF. The other 99% functioned equally across groups. This is strong evidence that the aReading and aMath item banks are functioning fairly for all key demographic groups.
DIF is a necessary but not sufficient condition for item bias. The determination of bias also requires expert judgment. As part of our effort to provide content that is fair and valid, FastBridge researchers conduct DIF analyses on a regular basis. Items flagged as demonstrating DIF may be deactivated. Where appropriates, items may be revised and calibrated for reintroduction into the item bank.
References
Jodoin, M.G., & Gierl, M.J. (2001). Evaluating type 1 error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329-349.