Measuring Job Performance


Decisions to retain, promote, or fire people have to be made in every organization.

What is the most accurate way to make these decisions ?

Classically, these decisions were made through patronage and nepotism, where personal relationships within the organization had a large impact on promotion and firing decisions.

While personal relationships are still important today, many industries try to systematically assess worker performance, in order to increase productivity (and ultimately, profits).

Three Major Classes of Worker Productivity Measures:

Production Counts

Personnel Data

Judgmental Methods

Production Counts

Involves attempting to measure what a worker produces on the job.

The worker with the higher production count is assumed to be the better worker.


Various possible production counts :

Lawyers: The number of billable hours in a month

Factory Worker : Number of parts assembled in a day.

Clerical Worker: The number of key strokes made in a day.

However, for many job titles, it is not clear how productivity should be measured in terms of production counts.

This is particularly true for management positions of other professional jobs.


Other difficulties with the use of production counts :

The production count method might change the behavior of the worker.

For example, suppose you measure the productivity of your sales force by measuring the number of sales calls made each day.

If the sales people are aware of this measurement, this might predispose them towards shorter, more numerous calls, which may not be the most effective way of dealing with clients.

Computerized monitoring systems can actually produce more job related stress in certain individuals, and actually decrease overall job performance.

The Hawthorne Effect : When people know their behavior is being monitored, they will change their behavior to create a favorable impression.



Another difficulty with production counts is that the production count may be controlled by factors other than worker performance.

For example, if we take a production count of assembly line workers, their production count is largely controlled by the speed of the assembly line.

If there is little variability in the production count measure, there will be no way to reliably distinguish different levels of worker performance. (Restricted Range problem)

Do to these problems with production counts, they are seldom used as the sole measure of job performance.











Personnel Data

Another way of attempting to objectively measure job performance is to use information from the employee’s personnel file.

Training Session attendance and performance, outside education sought, suggestions to improve productivity made, number of complaints made against, number of work related accidents, are all possible pieces of information that might be found within a personnel file.

The most commonly used personnel index used, however, is employee absenteeism.

The assumption is that the employee who works eight hours a day, day in and day out, will be more productive (and cost the company less in health insurance), than an employee who is frequently absent.

Although it seems simple enough, how absenteeism is defined will have a major impact on employee ratings.


Absenteeism as a measure of job performance

Landy and Farr (1983) have identified over 40 different operational definitions of absenteeism.

Total numbers of days missed, average length of absence, frequency of absence, dividing absence into voluntary and involuntary, are juct a few days to define absenteeism.

How this concept is defined will greatly affect job performance ratings, since these measures are usually only moderately related.

The second major problem with using absenteeism is it does not seem to be a normally distributed work performance variable.

The majority of workers miss very few days of work each year, while a small minority of workers are frequently absent.

Although measuring absenteeism will allow you to distinguish between the two groups, it will give you very little information about the majority of employees.

Also, measures of absence are largely unrelated

On a year to year basis. (low reliability)

Work Samples as measures of Job Performance

Having people perform their normal job tasks

While being supervised (tested) is an additional way of judging job performance.

This is the approach which has been adopted by the U.S. Military to help assess job performance (they also use personnel data and judgement data as well)


In 1980, The Joint-Service Job Performance Measurement Enlisted Standards Project (JPM) began large scale hands on testing of specific tasks associated with the specific job title.

These are tests of maximal performance, usually under timed conditions.

Although the study showed good validity for using these work sample measures, this technique is not necessarily feasible for the majority of industry to use.

This is a time intensive testing procedure, and most businesses can not devote the days or week it would take to test all of their employees.

The Construct of Job Performance

Work samples, measures of absenteeism, production counts, all capture only a part of what we consider to be total job performance.

Recently, research efforts have focused on exactly what good job performance is.

Two major Factors of the Job Performance Construct

1. Performance on specific individual tasks that are part of each workers job description.

2. Behaviors which are necessary for the organization to function smoothly: cooperation and communication skills.

The second category is not captured in any of the three "objective" measures of assessing job performance.

This is one of the primary reasons for the overwhelming use of subjective measures of job performance within industry.


Judgemental measures are used about 80 % of the time as the sole of major determinant of job performance.

Two general types of judgements can be made:

Rankings, in which workers are compared to one another and rank ordered.

Ratings, in which a worker’s performance is compared to some set standard.

Ranking Techniques

Forced Distribution : Dividing up the workforce into categories: High Performance, Average Performance, Low Performance. The distribution is forced in that only a small percentage of workers can receive high or low rankings.

The forced distribution helps to solve the problem of supervisors who like to rate the vast majority of workers at the highest level.




Full Ranking : Instead of sorting workers into general categories, you do a complete rank ordering of all employees, so that no two workers are at the same level of job performance.


Pair-Comparison Method : The supervisor rank orders workers by comparing each worker to every other worker, forcing the supervisor to make relative judgements. This method, although reasonable effective for a small number of employees, as the number of employees increases, this method will become cumbersome and unreliable.


Comparison of Ranking Methods :

Depends upon how much effort you want to put into the rankings, and

Depends upon how you want to use rankings:

For deciding who to lay off, a forced distribution may be most appropriate.


Rating Scales

The single most common way of evaluating worker performance.

Graphic Rating Scales : The supervisor makes a direct judgement about the quality of each workers performance on a specific response scale.

Different Types of Response Scales :

Continous Scales : A score is computed by measuring the distance from one end of the scale.

Verbally Anchored Scale : A small number of discrete categories which is "anchored" on either end with the range of abilities measured. These scales can vary as to the specificity of the verbal anchors.

Numeric Scales : Verbal Anchors which contain a numerical range within each category.

Graphic Scales are simple to use, and allow for computation of scores to compare workers on overall job performance.

Problems with graphic Ratings scales:


Rating Scales sometimes have ambiguous (not precisely defined) anchors.

Different supervisors will use the same graphic scales in slightly different ways.

More validity comparing workers ratings from a single supervisor than comparing two workers who were rated by different supervisors.

One way to get around the ambiguity inherent in graphic rating scales is to use Behavior Based Scales, in which specific work related behaviors are assessed.










Mixed Standard Scale (MSS): Good, average and poor performance is assessed with respect to specific job related behaviors.

A number of different items are used to assess each performance dimension.

For example, an MSS for police officers might measure the dimensions of Judgement, Relations with Others, and Job Knowledge.

The advantages of the MSS is they refer to concrete, observable behavior, and they require relatively simple judgements on the part of the supervisor.


Behaviorally Anchored Rating Scales : Similar to graphic rating scales, but uses specific behaviors to anchor the scale.

The development of BARS requires extensive input from supervisors in order to determine which behaviors are task relevant and assess some important aspect of job performance.

The care taken in developing the BARS helps to reduce across supervisor variability .

Behavioral Observation Scales (BOS) : The BOS was developed by Latham & Wexley (1977) who believed that both graphic rating scales and BARS require supervisors to make vague judgements.

The BOS is a list of "critical" behaviors which the supervisor has to rate in terms of frequency.

Items indicate either desired or undesireable aspects of work performance :

Worker never needs her/his work to be doublechecked _______

Worker misses workdays ________












Comparing Job Performance Rating Systems

Remarkable, thirty years of research has failed to produce one measure of job ratings which is vastly superior to any other measure.

The graphis rating scales, which have been widely criticized for their ambiguity and difficulty in use, do about a good a job as any other rating system.

Sadly, it seems to be more important who is doing the rating system, rather than which rating system is being utilized.

Current research in this topic has in fact shifted somewhat to examining the psychological thought process of the individuals filling out the rating form, rather than focusing on the rating system itself.








The validity of Criterion Measures

Do these judgmental ratings do a good job of assessing job performance ?

Since many organizations use only a single measure of job performance, it is difficult to find converging or diverging evidence which indicates how well your jugement ratings assess job performance.

Some researchers suggest using inter rater reliability to estimate validity. Statistically, this is a good idea, but for practical considerations this is generally not a good idea.

Some researchers state that the more precisely you define job performance, the more accurate a measuring device you can create (construct validity)








Common Rater Errors


  1. Halo Errors : Because of a general impression of a worker, there is little discrimination when rating this worker on different work related behaviors.
  2. Leniancy Errors : A supervisor has a general tendency to rate all workers higher, or all workers lower.
  3. Range Restriction Errors : A supervisor fails to use the enitre response range available, therefore making it difficult to make fine distinctions between the work performance of similar workers.

4. Memory distortions may make it difficult for a supervisor to remember all the work related behavior of a particular worker that she has observed since the previous rating period. This would be especially true when a supervisor is responsible for rating a large number of employees.


Measuring the economic value of good job performance.

Schmidt, Hunter, McKenzie & Muldrow (1979)

Have developed a way to quantify the value of good job performance.

They have developed the standard deviation of performance in order to calculate the value of increased worker productivity.

This method asseses the difference, in $$$ terms, between the value of an average worker (a worker at the 50th percentile) and the value of an exceptional worker ( a worker at the 85th percentile).

The calculation of this statistic requires a series of judgements from several supervisors familiar with the specific job.

Part of their development of this statistic is self-serving. Developers of job performance rating systems must be able to convince the consumer (industry) that their rating systems can have a long-term economic value in hopes that the customer will buy their product.