During the
construction of psychological tests, developers must determine the
effectiveness of each individual test item.
This process of
evaluating each item is called Item
Analysis.
With commercially developed tests, item
analysis is done on a representative sample of the intended test audience
before the test is released.
Item Analysis can identify which items should be reworded (or dropped), and tells us how good a job each individual item does in predicting the overall score.
Item Analysis can
tell us ;
How difficult an individual item is.
How good a job a particular item does in discriminating
between high and low performance on the test
How do we determine what constitutes high or low performance on a psychological test ?
2 ways :
Criterion-Referenced
(domain referenced) Testing : High and
Low performance is determined by the test developer who compares test
performance to a set list of objectives or standards.
Norming Distributions : High or low performance on the test is
determined by comparing an individual score to the score distribution of a
representative sample of test takers.
The item analysis
procedures vary, depending upon
Which criterion
procedure is used.
Test Validity
: does the test actually measure what
it intends to measure.
Item Validity
: does the specific test item correlate
with what you are trying to measure.
We can assess item
validity with respect to both internal and external criterion.
External Criterion
: Data from outside the test which we expect to correlate in some meaningful
way with our test items.
For achievement
tests, an external criterion might be your overall grade in that subject area.
For aptitude tests,
an external criterion might be your supervisor rating of your job performance.
To determine the validity of an individual test item, we correlate the scores on that test item with the external criterion from the domain of interest.
Formula for the
point-biserial coeffiecient for determining item validity. (4.1 in text)
rpb = ( (Yp Y) Sq. Rt. Of (ntnp/((nt np)(nt 1))) / s
nt = total
number of test takers
np = Number of
test takers who get the item correct
Yp = the average of the external criterion scores
Of people who get that test item correct
s = the
standard deviation of the external criterion scores
Example :
A. 30 people take a test to measure job
aptitude.
On question 7, 17
people get the answer right. The
average external criterion score for the entire group
Is 75 and the average external criterion scores
for the individuals who got question 7 right was 80. The standard deviation of the external criterion was 10
The higher the
correlation (closer to 1) , the more accurate (or valid) the test item is.
Test items with 0 or
a very low correlation (<.2) should be reworded or removed from the test.
The point-biserial
coefficient formula can also be used with an internal criterion such
as the total score on the test itself.
Another way to
measure internal consistency (the
relationship between performance on an individual item and performance on the
entire test) is to examine the performance of the low and high performers on an
item.
Split test takers
into three groups, based on overall test performance :
Top 25% or 27%. (depends upon sample size. If more than
thirty test takers use 27%)
Middle 50%
Bottom 25%
Need to know two
things :
Number of people in
high and low groups.
Number of people in
high and low groups who get a particular answer right.
Assume 200 people
take test.
Our top group is the
highest 54 scorers.
Our bottom group is
the lowest 54 scorers.
On question 17, 48 of the top group and 37 of the bottom
group got the question correct.
What is the item difficulty and the item discriminability of this question
?
Item difficulty and
item discriminability are both measures of internal consistency
Item Difficulty Index
(4.2) : p = (Up + Lp ) / (U + L)
Item discrimination Index (4.3) : D = (Up Lp ) / U
Up = Number of
high performers who got question right
Lp = Number of
low performers who got question right
U = Number of
high performers
L = Number of
Low performers
Item difficulty is a measure of overall difficulty of the test item. The lower the p, the more difficult a particular item is.
Item discrimination
tells us how good a job a question does is separating high and low performers.
It is more important
for an item to be discriminable , than it is to be difficult.
Recommended
Difficulty Index for various test items :
Number of Options (k) Optimum Difficulty Index
2 (True-False) .85
4 .74
Open-Ended .50
The higher the value of D (up to 1), the better the job of separating high and low performance.
If D = 1, this means all of the high performance group and none in the lower performance group get a particular question right.
D rarely (if ever) = 1
An item has an acceptable level of discrimination if
D => .30
p and D are not independent probabilities.
Discrimination Indexes less than .30 are sometimes acceptable if we have a very high
p value.
Other Factors may
affect item difficulty and discriminability and are sometimes analyzed as part
of a comprehensive item analysis.
If Gender, age, ethnicity, or socioeconomic status is theorized to possible affect test performance than statistical indexes of Differential Item Functioning can be calculated.
The same procedures discussed earlier are used, but we divide our test takers into the various groupings of interest before calculating difficulty and discrimination Indexes.
Remember, measures of Internal Consistency are not indexes of Validity. Validity can only be assessed by comparing test performance with some external criterion.
In general, as the overall D increases for an exam, the variability and the ability to separate high and low performers increases.
In general, as p increases, your test average will increase as well. (High p = easy test)
Calculating
Discrimination of Criterion Referenced Test Items
We separate our high and low performance group by our criterion standard. (rather than a proportional separation)
Those above the criterion cut off become the higher group, and those below the cutoff score become the lower group.
Item Difficulty is calculated the same way as before (equation 4.2)
Item Discrimination
Index : D = Up/U - Lp/L (4.4)
Analyzing Distractor Options
When we are keenly interested in complete understanding of how our individual test items are working during the test, we can analyze more than just the overall performance on an item.
Distractor analysis involves analyzing the wrong answers given to a particular test question.
The D factor tells us about the distractors as a whole.
High + D means distractors are highly effective with the low performance group.
A High D means the High performance group is being foolwed by distractors a great deal more than the low performance group.
However, more must be done if we are interested in increasing the overall effectiveness of our distractor set.
Simple Distractor Analysis :
Keep your high and low reference groups.
For each group, analyze their wrong choices, and note the associated probabilities.
High Performers Low Performers
A. (+) (+)
B. 5% 19%
C. 12% 6 %
D. 3% 3%
Now, based upon your analysis, you can decide to reword, or completely rework non-effective distractors
Item Characteristic Curves
Another way to graphically analyze the effectiveness of test items is to construct an Item characteristic curve.
This graph shows the percent correct on a particular test question as a function of the total test scores.
In general, performance on any good test item should increase as the overall ability of the test takers increases.
A steep positive slope indicates good discriminability.
The intercept of this curve gives an overall measure of item difficulty.
The lower the y-intercept, the greater the overall difficulty of the test item.
Item Response Theory
Is an extension of item characteristic curves.
In item response theory, the proportion of correct responses to a particular test item is plotted as a function of the (estimated) true ability of individuals.
The equations vary with the specific method of estimating true ability.
Remember that for any psychological test or measurement,
Score =
True Score + Measurement Error
By using these estimates of True ability, we can construct a sample independent item characteristic curve.
The advantage is we dont expect the curve to change from sample to sample.