Differential Item Functioning Analysis of the aReading & aMath Item Banks
FastBridge is committed to providing psychometrically sound assessments that provide valid and fair results for all students. One component of this commitment is the evaluation of how our test items function in specific groups based on gender and race/ethnicity. Items that perform differently in two groups may indicate bias. A definitive determination of bias requires expert review and judgement. This report describes the results from the statistical analysis of bias of the FastBridge aReading and aMath item banks.
Overall, more than 99% of the aReading and aMath items showed no statistical bias across Kindergarten through Grade 8. Among the aReading items, eight met the criteria for statistical bias; half reflected gender bias and the other half reflected Hispanicwhite bias. Among the aMath items, eight met the criteria for bias; half reflected gender bias and the other half was evenly split across minoritywhite group comparisons. With so few items exhibiting statistical bias, we can be confident that FastBridge aReading and aMath scores provide fair and equitable assessment of overall reading and math skills.
Differential Item Functioning
Method
aReading and aMath items were evaluated for statistical bias using a method called differential item functioning (DIF). This method determines the probability that students with the same overall ability will answer an item correctly. The probability of item success is computed across the full ability range and averaged at the group level. Thus, DIF indexes the overall performance difference between two groups of students (e.g., female and male) on each item after matching the groups on ability. An item is considered fair, or free from bias when there is no difference across the groups.
A wellresearched statistical model, called logistic regression was used to evaluate DIF. In the logistic regression model, the log odds of answering an item correctly is a function of ability (θ) and group membership, G, coded 0 for the reference group and 1 for the focal group.
An item is considered biased if the b2 coefficient is statistically significant. However, because statistical significance is highly dependent on sample size, and with large sample sizes even very small differences can be significant and thus erroneously indicating DIF when it is not present, researchers use effect sizes to identify DIF items. In the logistic model, the effect size is the improvement in model fit, measured by the change in R2 between the reduced model without the group variable, G, and the full model that includes group. Effect sizes are categorized into three levels: negligible, moderate, or large (Jodoin & Gierl, 2001).
Data
The source data were all aReading and aMath test administrations from the fall of 2019. To be included in the DIF analyses, the student record had to indicate the student’s gender and race/ethnicity. Tables 1 and 2 describes the aReading and aMath samples used to conduct the DIF. In most grades, the item scores from more than 50,000 students were used. The samples were representative of the entire population as evidenced by the median national percentile.
Table 1. Descriptive statistics of aReading DIF sample


Scale Score 
Percentile 

Grade 
Count 
Mean 
SD 
Median 
K 
19,445 
393.3 
25.7 
45 
1 
37,070 
435.9 
30.2 
45 
2 
82,916 
470.0 
29.8 
47 
3 
81,415 
492.8 
26.3 
54 
4 
83,057 
506.0 
24.3 
56 
5 
84,437 
515.9 
23.9 
51 
6 
73,720 
523.5 
24.3 
55 
7 
60,291 
529.3 
25.5 
54 
8 
57,923 
534.8 
26.0 
43 
Table 2. Descriptive statistics of aMath DIF sample


Scale Score 
Percentile 

Grade 
Count 
Mean 
SD 
Median 
K 
7,988 
178.7 
6.8 
47 
1 
34,313 
189.2 
8.1 
43 
2 
75,012 
198.8 
8.0 
53 
3 
72,800 
205.7 
7.6 
57 
4 
73,072 
210.4 
8.2 
56 
5 
72,527 
215.9 
9.9 
54 
6 
59,602 
219.5 
11.2 
56 
7 
50,305 
222.4 
12.1 
51 
8 
46,966 
224.2 
12.3 
47 
Results
Because these are adaptive tests, some items are taken by too few students to conduct DIF analysis. Among the aReading items administered, about 40% did not have sufficient numbers to conduct DIF, and among aMath items administered about 55% did not have sufficient numbers to conduct DIF. To maximize the number of items with enough data to conduct DIF, adjacent grades were combined.
DIF was evaluated on all items for scores from at least 250 students in each group. Simulation research evaluating the power to detect uniform DIF with logistic regression shows that adequate power can be achieved with at least 250 students per group (Jodoin & Gierl, 1999). Table 3 summarizes the DIF results.
Table 3. Total number of items and number of DIF items by grade level and test


aReading 

aMath 

Grade 
Focal Group 
Total Item Count 
Favors Reference 
Favors Focal 

Total Item Count 
Favors Reference 
Favors Focal 
K  1 
Asian 
153 
 
 

116 
 
 
K  1 
Af. Amer. 
152 
 
 

110 
 
 
K  1 
Hispanic 
148 
 
 

119 
 
1 
K  1 
Male 
214 
 
 

164 
 
 
2  3 
Asian 
222 
 
 

191 
1 
 
2  3 
Af. Amer. 
233 
 
 

195 
 
1 
2  3 
Hispanic 
234 
 
 

198 
 
 
2  3 
Male 
267 
 
 

236 
 
 
4  5 
Asian 
187 
 
 

235 
 
1 
4  5 
Af. Amer. 
241 
 
 

233 
 
 
4  5 
Hispanic 
253 
 
 

231 
 
 
4  5 
Male 
276 
2 
 

291 
1 
 
6  8 
Asian 
195 
 
 

246 
 
 
6  8 
Af. Amer. 
241 
 
 

275 
 
 
6  8 
Hispanic 
254 
3 
1 

270 
 
 
6  8 
Male 
294 
2 
 

320 
3 
 
Table 3 shows the total number of items with enough data to conduct DIF for each comparison. The focal groups are listed. The comparison, or reference group for each race/ethnicity comparison was whites. Females constituted the reference group in the male/female comparison. Items counted in the “favors reference” indicate that the direction of DIF favored the reference group. In other words, the reference group was more likely to answer the item correctly after matching on ability. Items counted in the Favors Focal column were easier for the focal group.
None of the items in any comparison in reading or math displayed large DIF. Among the aReading items, none displayed even moderate DIF in Kindergarten through Grade 3. Two items favored females in Grades 45, and Grades 68. Three items favored Hispanics compared to whites and one item favored whites. Among the aMath items, one item in K1 favored whites over Hispanics, one item in Grades 23 favored whites over AfricanAmericans and one item favored Asians over whites. In Grades 45 one item favored Asians over whites, and one item favored females over males. In Grades 68 three items favored females.
Collectively, less than onehalf of one percent of the items displayed moderate DIF. The other 99% functioned equally across groups. This is strong evidence that the aReading and aMath item banks are functioning fairly for all key demographic groups.
DIF is a necessary but not sufficient condition for item bias. The determination of bias also requires expert judgment. As part of our effort to provide content that is fair and valid, FastBridge researchers conduct DIF analyses on a regular basis. Items flagged as demonstrating DIF may be deactivated. Where appropriates, items may be revised and calibrated for reintroduction into the item bank.
References
Jodoin, M.G., & Gierl, M.J. (2001). Evaluating type 1 error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329349.