Abstract:
Alderson (2005) suggests that diagnostic tests should identify strengths and weaknesses in learners'
use of language, focus on specific elements rather than global abilities and provide detailed feedback
to stakeholders. However, rating scales used in performance assessment have been repeatedly
criticized for being imprecise, for using impressionistic terminology (Fulcher, 2003; Upshur &
Turner, 1999; Mickan, 2003) and for often resulting in holistic assessments (Weigle, 2002).
The aim of this study was to develop a theoretically-based and empirically-developed rating scale
and to evaluate whether such a scale functions more reliably and validly in a diagnostic writing
context than a pre-existing scale with less specific descriptors of the kind usually used in proficiency
tests. The existing scale is used in the Diagnostic English Language Needs Assessment (DELNA)
administered to first-year students at the University of Auckland. The study was undertaken in two
phases. During Phase 1, 601 writing scripts were subjected to a detailed analysis using discourse
analytic measures. The results of this analysis were used as the basis for the development of the new
rating scale. Phase 2 involved the validation of this empirically-developed scale. For this, ten trained
raters applied both sets of descriptors to the rating of 100 DELNA writing scripts. A quantitative
comparison of rater behavior was undertaken using FACETS (a multi-faceted Rasch measurement
program). Questionnaires and interviews were also administered to elicit the raters' perceptions of the
efficacy of the two scales.
The results indicate that rater reliability and candidate discrimination were generally higher and that
raters were able to better distinguish between different aspects of writing ability when the more
detailed, empirically-developed descriptors were used. The interviews and questionnaires showed
that most raters preferred using the empirically-developed descriptors because they provided more
guidance in the rating process. The findings are discussed in terms of their implications for rater
training and rating scale development, as well as score reporting in the context of diagnostic
assessment.