Some new proposals and responses in ascertaining the reliability and validity of Japanese university entrance exams

Shiken: JALT Testing & Evaluation SIG Newsletter
Vol. 12 No. 1 Jan. 2008 (p. 7 - 13) [ISSN 1881-5537]

Some new proposals and responses in ascertaining the reliability
and validity of Japanese university entrance exams

Michael Guest (University of Miyazaki)

Abstract

One can easily find criticisms of Japan's university entrance English exams. Claims of a lack of reliability and/or validity are legion, leading to a widespread view that poorly-designed or ill-considered university entrance exams are to blame for outdated and unproductive pedagogical practices in high schools (McVeigh, 2001; Gorsuch, 1998; Chujo, 2006). Most foundational among the critical research is that of Brown and Yamashita (1995), with follow-up research and proposals from Brown (1996, 2000, 2002), Kikuchi (2006), and Ichige (2006). But could it be that some of these viewpoints and interpretations are based upon notions of validity and reliability that do not do justice to the parameters surrounding university entrance exams in Japan? And could some of these criticisms have failed to note the bigger picture? Are some out of date, missing the point, or even be contradictory?
This two-part paper seeks to further re-address the validity and reliability of Japanese university entrance exams by introducing some considerations and variables hitherto unnoted or underappreciated in the critical literature.

Keywords: language testing, test reliability, validity, entrance examinations, Japanese universities

Foreground

In many previous analyses of Japanese university entrance exams researchers have made claims that do not take the following facts into consideration:

Differing test roles and functions – There are significant differences between the National Center Test (Daigaku Nyuushi Senta Shiken in Japanese) and many second-stage university tests (Niji Shiken in Japanese). The first is standardized and nation-wide, the latter local and more variable. The former does not (in most cases) confer entry into a given university. Usually, it serves only as a type of indicator for helping candidates narrow down which second-stage exams to sit for. The second-stage exams are generally those that ultimately confer acceptance (there are other ways of gaining admission to university but the above represents the norm).
Previously, some comments and criticisms regarding Japanese university entrance exams have lumped these two types of tests together. In many respects, when undertaking a critique or analysis, these two exams should be separated (as in Kikuchi, 2006; and Ichige, 2006). Cases in which the tests have some common features or appear similar in quality are duly noted in this paper.

[ p. 7 ]

Number of examinees – Over 500,000 students sat for the National Center Test in 2007 (as reported in the Daily Yomiuri newspaper, Jan. 20th, 2007) The results are expected to be thoroughly fair and objective. The results must also be calculated within a week. This means that subjective and productive features of language (skills such as writing and speaking) would be difficult to test under such circumstances, as would performance-based skills. This also means that the test will have to be machine readable which in turn demands some type of forced-choice response format.
Increase in complexity, length, and stress – Entrance exams are very stressful and time-consuming events for all concerned. English is only one of several subjects tested (albeit a core subject on the National Center Test and many second-stage exams). Adding skills like speaking and writing, or utilizing communicative approaches, even if they were feasible in terms of time, numbers, and objectivity, would simply add to the amount of skills examinees are required to display making the experience that much more stressful. This approach would also give an even stronger advantage to those who have experience in English-speaking countries, jeopardizing the quest for fairness and balance.
However, since fewer examinees sit for the second-stage exams, these tests have more liberty to include questions of translation, exposition, short essay writing, and other productive, non-forced choice formats, although the number of examinees, speed necessitated in marking, and the demands for objectivity, are still substantial.

Brown (2002, p. 97) noted that both the type and purpose of Japanese university entrance exams has not been made clear. And, without understanding test type or purpose, and whether the exam is norm or criterion-referenced, determining validity and reliability will be either misguided or meaningless. I would suggest that the reason why test type and purpose have not been widely noted as such is because 1) any such pronouncement would be made in Japanese, and 2) the test type and purpose are generally already well understood by both those administering and those taking the exams. With this in mind, I would propose the following as common-sense answers to the questions of Japanese university entrance exam test-type, purpose, and referencing:

Test type – Both the National Center and second-stage tests might best be thought of as quasi-aptitude tests with placement functions (albeit a more generalist perspective than most aptitude tests). After all, the primary instrumental goal of the tests is to measure whether examinees display potential for more advanced study of academic English at a Japanese university. These tests are forward-looking. They are not proficiency or achievement tests that confer an award upon, or otherwise sum up, previous study or coursework.

[ p. 8 ]

Test purpose – Results of the National Center Test aim to stratify candidates in terms of which second-stage exams they can or should sit for. Results of second-stage exams allow or confer placement into an appropriate university.
Referencing – The question as to whether the tests are norm or criterion referenced has also been raised by Brown (2002, p.98). There are in fact elements of both. The National Center Test does not have any set pass/fail criterion, but if a certain standard is not reached students will unlikely to be able to sit for certain universities' entrance exams. Second-stage exams would better be classified as norm-referenced simply because admission will usually be based upon the top ranking examinee results without regard to any set pass/fail criterion. If 100 seats are open for a given department at a university, the top 100 examinee scores (determined by each university's entry criterion regarding the relative value of the two tests as a total) will generally gain the seats, regardless of their actual scores. The generalized focus of both tests also places them more firmly in the norm-reference camp, which is consistent with the tests' placement function. (McNamara, 2000; p. 63).

Some responses to standard questions regarding Japanese university entrance exam validity

Now let us consider four frequently stated concerns about the validity of Japanese entrance exams. Considering the aforementioned factors, I believe inquiry into National Center Test and second-stage exam validity and reliability should take on a very different focus than most of those carried out in previous research. With this in mind, some pertinent questions related to the tests' validity have been addressed below.

Measuring validity involves assessing a number of factors. Several of these, based upon a combination of criteria as found in Hughes (1994), Brown (1996), Bachman & Palmer (1996), and McNamara (2000), are presented below in question form.

[ p. 9 ]

1. Do the National Center and second-stage tests actually measure what they propose to measure?

    Even though they play different roles in the university admissions process, the two entrance exam types have a common purpose – that is they serve a largely placement function, and both tests exist primarily as a measure of academic English skill and aptitude for entering Japanese universities. Therefore, we can say that it would likely invalidate such a test if it were to include communicative tasks of the type that Ichige (2006), for one, seems to call for. After all, such a practical, interactive, real-world focus would no longer be consistent with the tests' extant purpose. However, if individual universities seek to measure candidate proficiency through extended interviews or other non-standardized means then validity is maintained.
    Moreover, as we have noted, the primary skill that both tests focus upon is reading, which seems legitimate not only when the administrative constraints are considered but also because it forms the most integral part of the Japanese high-school English curriculum (Rausch, 2000). A wider focus would only serve to increase the difficulty level as well as variation from high school curricula.

2. Does the test involve direct or indirect testing?

    It is argued by Hughes (1994, p.17), for one, that a test is more valid if a language test involves direct testing – meaning that if the goal is to measure a specific skill or function then the test should contain problems or tasks which demand the use of these skills or a clear understanding of the functions. However, since the goal of Japanese university entrance exams is nothing more than the placement of examinees based upon "general" English skills, it would follow that the tasks and problems should be more of the indirect variety, that is, a wide variety of generalized texts and tasks/problems which measure holistic skills.
    Interestingly, Brown & Yamashita (1995) criticize university entrance exams precisely for having too much variety in terms of item-type (p.20) claiming that using a number of differing item-types leads to changing directions and shifting gears and determines only "testwiseness", as opposed to tests that contain few item-types, which are presumed to focus more upon "content". But, if the ability to display a wider variety of English skills – which a wide range of item-types would demand - determines only testwiseness, then testwiseness is not necessarily a bad thing and is certainly a better indicator of both test validity and reliability. Tellingly, Brown later (2000, p.6) seems to call for a wider-range of cognitive skills to be measured.

[ p. 10 ]

3. What is the balance between objectivity and subjective scoring on the two exams?

    As stated earlier, strict uniformity in scoring is essential for the National Center Test (and is expected of, though less essential for, the second-stage exams). This also serves an administrative need to present university entrance exams as "objective" although other factors beyond uniform scoring actually realize objectivity. What this means is that both tests have to maintain competing constraints between the generalized need for testing holistically and yet allowing for discrete-item like clarity and precision in answering and scoring. It is a given that there will always be some tension between maintaining this balance. One cannot demand the measuring of a wide variety of skills, both productive and receptive (to increase construct validity), while also demanding that strict objectivity in scoring be maintained. It seems one quality will always be maintained at the expense of the other. Therefore, maintaining an overall balance between the two becomes a worthy goal.
    Oddly, Brown and Yamashita (1995) conclude that a focus upon receptive skills represents, ". . . an endorsement of a discrete-point and receptive view of language teaching" (p. 27), without considering the restraints that the National Center Test in particular has in this regard. This criticism might be more appropriate if these exams were achievement tests but in fact a test with a placement function can hardly be said to "endorse" a certain view of language teaching. Rather, its function is to stratify examinees, and in the case of university entrance exams, to do it efficiently and objectively. Unintentional "endorsement" may be seen in terms of washback onto high school pedagogy but even justifications for such washback have been widely questioned (Watanabe, 1996; Guest 2000; Mulvey, 2001; Stout, 2003).

4. What is the balance between integrated and discrete questions or tasks on the National Center and second-stage exams?

   Integrated questions are those which measure holistic skills and general abilities, as opposed to discrete questions which focus on minutiae and limited-range items (Hughes, 1994; p.19). The main problem with too much of a focus on discrete-item questions is that they increase the random element of a test – that is the examinee must have a specific piece of knowledge in order to successfully answer the question. If the examinee has happened to memorize or master that particular item or pattern, they will receive credit, but if they have not, they will get no credit for answering the question even though their general language knowledge or skill might be quite high. Therefore, discrete-item based tasks are likely to reflect only a very narrow area of language skill.

[ p. 11 ]

    Focus upon individual lexical items or specific grammar patterns tend to fall into this category. If a test is very specific in its goals and has a clear and set criterion then one might be able to justify more discrete-item questions and the specific skills measured by a very narrowly defined criterion would render an integrative approach invalid. However, it is not necessary that a multiple-choice item also be a discrete-point item (Oller, 1979), and the two should not be conflated, as they seem to be in Brown and Yamashita (p. 9). Therefore, it behooves more generalized tests, such as both Japanese university entrance exams, to have more integrative items, as an indicator of validity. Also, an integrative approach generally warrants more passage-dependant tasks so as to allow the examinee to display a more holistic comprehension of a complete passage (as opposed to a narrow item within that passage). Therefore, criticisms that more passage-dependant items somehow threaten test validity and reliability (Brown and Yamashita, p.28) would seem to contradict a call for a more integrative approach.
    Brown and Yamashita (p.27) also argue that topic-dependent readings decrease test validity, since knowledge of the topic at hand may allow examinees to correctly identify answers without applying actual English reading skills. But since any extended text will be to some extent topic-dependent, the ideal would be to vary topics considerably. It would also be incumbent upon researchers to note whether tasks and questions measure topical knowledge per se (i.e., specialized lexis, content-oriented questions) or general language skills.
    A related factor in establishing validity that has not been discussed in any of the literature thus far is the genre of the texts. Certainly having a variety of texts, from stories and dialogues, to narratives, to scientific reports, to opinion pieces, to visuals and graphics, means testing a wider variety of learner reading schemas, approaches, and perspectives, which encompasses a broader spectrum of authentic English.
    If a large number of texts are used (as on the National Center Test) these don't have to be tailored to a specific audience (given the wide, general scope of the Center Test). After all, the exams seek to test the examinees' overall aptitude for English, not proficiency in a specific subject using an English text. Having a wide range of genres, voices and perspectives would help avoid topic-dependency as well as criticisms of having too narrow a focus.

Closing

This section of the paper is designed not as an argument as to whether Japanese university entrance exams meet standards of validity or not. It is though an attempt to clarify some of the variables which have hitherto been under-appreciated, misunderstood or overlooked and need to be re-considered before claims of validity or invalidity can be made.
    In the next issue, we will address Japanese university entrance exam reliability.

[ p. 12 ]

References

Bachman, L. & Palmer, A. (1996). Language Testing in Practice. Oxford Univ. Press, New York

Brown, J.D. (1996). Testing in Language Programs. Upper Saddle River, NJ: Prentice Hall Regents.

Brown, J.D. (2000). University entrance examinations: Strategies for creating positive washback on English language teaching in Japan. Shiken: JALT Testing & Evaluation SIG newsletter, 3 (2) 4-8. Retrieved online at http://jalt.org/test/bro_5.htm

Brown, J.D. (2002). English language entrance examinations: A progress report. In A. S. Mackenzie & T. Newfields (Eds.) Curriculum Innovation, Testing and Evaluation: Proceedings of the 1st Annual JALT Pan-Sig Conference. (pp. 94 - 104). Retrieved online at http://jalt.org/pansig/2002/HTML/Brown.htm

Brown, J.D. & Yamashita, S. (1995). English language entrance exams at Japanese universities: What do we know about them? JALT Journal, 17 (1), 7-30.

Gorsuch, G. (1998). Yakudoku EFL instruction in two Japanese high school classrooms: An exploratory study. JALT Journal, 20 (1), 6-32.

Hughes, A. (1989). Testing for language teachers. Cambridge University Press, U.K.

Ichige, Y. (2006).Validity of center examinations for assessment of communicative ability. On Cue 14 (2), 13-22.

Kikuchi, K. (2006). Revisiting English entrance examinations at Japanese universities after a decade. JALT Journal. 28 (1), 77-96.

McNamara, T. (2000). Language testing. Oxford University Press, New York.

McVeigh, B. (2001). Higher education, apathy and post-meritocracy. The Language Teacher, 25 (10), 29-32.

Mulvey, B. (2001). The role and influence of Japan's university entrance exams: A reassessment. The Language Teacher, 25 (7), 11-15.

Oller, J. W., (1979). Language tests at school: A pragmatic approach. London: Longman.

Rausch, A. (2000). Readiness of Japanese teachers of English for incorporating strategies instruction in the English curriculum. Explorations in Teacher Education, 8, (2) 12 - 17.

Stout, M. (2003). Not guilty as charged: Do the university entrance exams in Japan affect what is taught? ETJ Journal, 4 (1), 1-7.

Watanabe, Y. (1996). Does grammar translation come from the entrance exams? Preliminary findings from classroom-based research. Language Testing 13 (318-333).

NEWSLETTER: Topic Index

Date Index
TEVAL SIG: Main Page

Links

Join

http://jalt.org/test/gue_1.htm

[ p. 13 ]