Item Analysis

 

 

During the construction of psychological tests, developers must determine the effectiveness of each individual test item.

 

This process of evaluating each item is called Item Analysis.

 

       With commercially developed tests, item analysis is done on a representative sample of the intended test audience before the test is released.

 

       Item Analysis can identify which items should be reworded (or dropped), and tells us how good a job each individual item does in predicting the overall score.

 

Item Analysis can tell us ;

 

How difficult an individual item is.

 

How good a job a particular item does in discriminating between high and low performance on the test

 

 

How do we determine what constitutes high or low performance on a psychological test ?

 

 

2 ways :

 

Criterion-Referenced (domain referenced) Testing :  High and Low performance is determined by the test developer who compares test performance to a set list of objectives or standards.

 

Mastery Test :  a type of criterion referenced test designed to measure a limited range of cognitive skills.  Total score is the % of correct answers.   

 

 

Norming Distributions :  High or low performance on the test is determined by comparing an individual score to the score distribution of a representative sample of test takers.

 

 

 

The item analysis procedures vary, depending upon

Which criterion procedure is used.

 

 

                           Item Validity

 

Test Validity :  does the test actually measure what it intends to measure.

 

Item Validity :  does the specific test item correlate with what you are trying to measure.

 

We can assess item validity with respect to both internal and external criterion.

 

External Criterion : Data from outside the test which we expect to correlate in some meaningful way with our test items.

 

For achievement tests, an external criterion might be your overall grade in that subject area.

 

For aptitude tests, an external criterion might be your supervisor rating of your job performance.

 

To determine the validity of an individual test item, we correlate the scores on that test item with the external criterion from the domain of interest.

 

 

 

Item validity and External Criterion

 

 

 

 

Formula for the point-biserial coeffiecient for determining item validity.  (4.1 in text)

 

rpb  =  ( (Yp – Y)  Sq. Rt. Of (ntnp/((nt – np)(nt – 1))) /  s

 

nt = total number of test takers

 

np = Number of test takers who get the item correct

 

Yp  = the average of the external criterion scores

       Of people who get that test item correct

             

Y  = the overall average of the external criterion score

 

s = the standard deviation of the external criterion scores

 

 

Example :

 

A.   30 people take a test to measure job aptitude.

On question 7, 17 people get the answer right.  The average external criterion score for the entire group

Is 75  and the average external criterion scores for the individuals who got question 7 right was 80.  The standard deviation of the external criterion was 10

 

 

Interpreting the item-criterion correlation

 

 

The higher the correlation (closer to 1) , the more accurate (or valid) the test item is.

 

Test items with 0 or a very low correlation (<.2) should be reworded or removed from the test.

 

The point-biserial coefficient formula can also be used with an internal criterion such as the total score on the test itself.

 

Another way to measure internal consistency (the relationship between performance on an individual item and performance on the entire test) is to examine the performance of the low and high performers on an item.

 

Split test takers into three groups, based on overall test performance :

 

Top 25% or 27%.  (depends upon sample size. If more than thirty test takers use 27%)

 

Middle 50%

 

Bottom 25%

 

 

Measuring Internal Consistency of Items

 

 

Need to know two things :

 

Number of people in high and low groups.

 

Number of people in high and low groups who get a particular answer right.

 

 

Assume 200 people take test.

 

Our top group is the highest 54 scorers.

 

Our bottom group is the lowest 54 scorers.

 

On question 17,   48 of the top group and 37 of the bottom group got the question correct. 

 

What is the item difficulty and the item discriminability of this question ?

 

Item difficulty and item discriminability are both measures of internal consistency

 

 

Calculating Internal Consistency Measures

 

 

Item Difficulty Index (4.2) :    p = (Up + Lp ) / (U + L)

 

 

Item discrimination Index (4.3)  :   D = (Up – Lp ) / U

 

 

Up = Number of high performers who got question right

Lp = Number of low performers who got question right

U = Number of high performers

L = Number of Low performers

 

Item difficulty is a measure of overall difficulty of the test item.  The lower the p, the more difficult a particular item is.

 

Item discrimination tells us how good a job a question does is separating high and low performers.

 

It is more important for an item to be discriminable , than it is to be difficult.

 

Recommended Difficulty Index for various test items :

 

Number of Options (k)              Optimum Difficulty Index

2 (True-False)                                .85

4                                                     .74

Open-Ended                                  .50

 

 

Interpreting the Item discrimination Index

 

 

The higher the value of D (up to 1), the better the job of separating high and low performance.

 

 

If D = 1, this means all of the high performance group and none in the lower performance group get a particular question right.

 

 

D rarely (if ever) = 1

 

 

An item has an acceptable level of discrimination if

D => .30

 

 

p and D are not independent probabilities.

 

 

Discrimination Indexes less than .30 are sometimes acceptable if we have a very high

p value.

 

 

 

Other Factors may affect item difficulty and discriminability and are sometimes analyzed as part of a comprehensive item analysis.

 

 

If Gender, age, ethnicity, or socioeconomic status is theorized to possible affect test performance than statistical indexes of Differential Item Functioning can be calculated.

 

 

The same procedures discussed earlier are used, but we divide our test takers into the various groupings of interest before calculating difficulty and discrimination Indexes.

 

 

Remember, measures of Internal Consistency are not indexes of Validity.  Validity can only be assessed by comparing test performance with some external criterion.

 

In general, as the overall D increases for an exam, the variability and the ability to separate high and low performers increases.

 

In general, as p increases, your test average will increase as well.  (High p = easy test)

 

 

 

Calculating Discrimination of Criterion Referenced Test Items

 

 

We separate our high and low performance group by our criterion standard. (rather than a proportional separation)

 

Those above the criterion cut off become the higher group, and those below the cutoff score become the lower group.

 

Item Difficulty is calculated the same way as before (equation 4.2)

 

 

 

Item Discrimination Index :  D = Up/U  -  Lp/L  (4.4)

 

 

 

Analyzing Distractor Options

 

 

When we are keenly interested in complete understanding of how our individual test items are working during the test, we can analyze more than just the overall performance on an item.

 

 

Distractor analysis involves analyzing the wrong answers given to a particular test question.

 

The D factor tells us about the distractors as a whole.

 

High + D means distractors are highly effective with the low performance group.

 

A High – D means the High performance group is being foolwed by distractors a great deal more than the low performance group.

 

However, more must be done if we are interested in increasing the overall effectiveness of our distractor set.

 

 

 

Simple Distractor Analysis :

 

Keep your high and low reference groups.

 

For each group, analyze their wrong choices, and note the associated probabilities.

 

 

High Performers                       Low Performers

 

A.    (+)                                                   (+)

 

B.   5%                                                   19%      

 

C.   12%                                                   6 %     

 

D.   3%                                                   3%      

 

 

 

Now, based upon your analysis, you can decide to reword, or completely rework non-effective distractors

 

 

 

Item Characteristic Curves

 

Another way to graphically analyze the effectiveness of test items is to construct an Item characteristic curve.

 

This graph shows the percent correct on a particular test question as a function of the total test scores.

 

In general, performance on any good test item should increase as the overall ability of the test takers increases.

 

A steep positive slope indicates good discriminability.

 

The intercept of this curve gives an overall measure of item difficulty.

 

The lower the y-intercept, the greater the overall difficulty of the test item.

 

 

 

                     Item Response Theory

 

 

Is an extension of item characteristic curves.

 

In item response theory, the proportion of correct responses to a particular test item is plotted as a function of the (estimated) true ability of individuals.

 

The equations vary with the specific method of estimating true ability.

 

Remember that for any psychological test or measurement,

 

Score  =   True Score  + Measurement Error

 

By using these estimates of True ability, we can construct a sample independent item characteristic curve.

 

The advantage is we don’t expect the curve to change from sample to sample.