Possible answers for the nine questions about testing/assessment which were in the March 2010 issue of this newsletter appear below. |
Readers might recognize this as Cohen's d – one common effect size measure. Another way to calculate effect size, known as Glass's delta, is to divide the mean scores between an experimental and control group by the standard deviation of the control group. The Pearson r is also a common effect size index. A good introduction to that is provided by Ferguson (2009). Additional effect size indices have been suggested by Hedge (1981), Hedge and Olkin (1985), and Rosnow, Rosenthal, and Rubin (2000).
[ p. 26 ]
Further reading: Carson, C. (n.d.) The effective use of effect size indices in institutional research. Retrieved March 14, 2010 from http://www.keene.edu/ir/effect_size.pdf Cohen, J. (1988). Statistical power for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Cortina, J. M. & Nouri, H. (2000). Effect size for ANOVA designs. Thousand Oaks, CA: Sage Publications. Effect Size. (2010, March 10). Wikipedia: The Free Encyclopedia. Retrieved March 9, 2010 from http://en.wikipedia.org/wiki/Effect_size Ferguson, C. J. (2009). An Effect Size Primer: A Guide for Clinicians and Researchers. Professional Psychology: Research and Practice, 40 (5) 532 - 538. DOI: 10.1037/a0015808 Graziano, A. M. & Raulin, M. L. (2000). Online Glossary to Research Methods: A Process of Inquiry (4th Edition). Retrieved March 11, 2010 from http://web.squ.edu.om/med-Lib/MED_CD/E_CDs/SPSS/glossary/glosse.htm Hedges, L. V. (1981). Distribution Theory for Glass's Estimator of Effect size and Related Estimators. Journal of Educational and Behavioral Statistics, 6 (2) 107-128. DOI: 10.3102/10769986006002107 Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. San Diego, CA: Academic Press. Levine, T. R. & Hullett, C. R. (2002). Eta Squared, Partial Eta Squared, and Misreporting of Effect Size in Communication Research. Human Communication Research, 28 (4) 612-625. Retrieved March 14, 2010 from www.informaworld.com/index/912219870.pdf Morris, S. B. (2008). Estimating Effect Sizes From Pretest-Posttest-Control Group Designs. Organizational Research Methods, 11 (2) 364-386. DOI: 10.1177/1094428106291059 Rosnow, R. L., Rosenthal R., & Rubin, D. B. (2000). Contrasts and correlations in effect-size estimation. Psychological Science, 11 (6) 446-453. DOI: 10.1111/1467-9280.00287 U.S. Department of Education Institute of Education Science & What Works Clearinghouse. (2008). WWC Standards (Version 1): Improvement Index. Retrieved March 9, 2010 from http://ies.ed.gov/ncee/wwc/references/iDocViewer/Doc.aspx?docId=20&tocId=4 Valentine, J. C. & Cooper, H. (2003). Effect size substantive interpretation guidelines: Issues in the interpretation of effect sizes. Washington, DC: What Works Clearinghouse. Retrieved March 14, 2010 from http://ies.ed.gov/ncee/wwc/pdf/essig.pdf Wilkinson, L. & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54 (8) 594 - 604. Retrieved March 14, 2010 from http://www.loyola.edu/library/ref/articles/Wilkinson.pdf |
[ p. 27 ]
The basket method dates from at least the 1970s and was used by AT&T's Assessment Center (Byham, 1970). In 2003 a variant of it was used to help establish CEFR guidelines. One common form of the basket method could be described as a modified Angoff method in which raters make yes/no decisions as to whether a given performance fulfills a specified criterion.
Further reading: Angoff, W. H. (1971, 1984). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.) (pp. 508-600). Washington, DC: American Council on Education. Retrieved on March 14, 2010 from http://www.ets.org/portal/site/ets/menuitem.c988ba0e5dd572bada20bc47c39215 09/ ?vgnextoid=78c5c2f348b46010VgnVCM10000022f95190RCRD&vgnextchannel=dcb3be3a864f4010VgnVCM10000022f95190RCRD Byham. W. C. (1970, July/August). Assessment centers for spotting future managers. Harvard Business Review, 59, 150-167 Cizek, G. J. & Bunch, M. B. (Eds.) (2007). Standard Setting: A Guide to Establishing and Evaluating Performance Standards on Tests (New Edition). Thousand Oaks, CA: Sage Publications. Cross, L. H., Impara, J. C., & Frary, R. B. (1984). A comparison of three methods for establishing the minimum standards on the national teacher examinations. Journal of Educational Measurement, 21 (2) 113-129. George, S. George, Haque, M. S. & Oyebode, F. (2006) Standard setting: Comparison of two methods. BMC Medical Education 6 (46). DOI: 10.1186/1472-6920-6-46 Kaftandjieva, F. (2009). Basket Procedure: The breadbasket or the basket case of standard setting methods? In N. Figueras & J. Noijons (Eds.) Linking to the CEFR levels: Research perspectives. (pp. 21-34). Arnheim: CITO/ EALTA. Retrieved March 11, 2010 from http://www.coe.int/t/dg4/linguistic/EALTA_PublicatieColloquium2009.pdf Rock, D. A., Davies, E. L., & Werts, C. (1980). An empirical comparison of judgmental approaches to standard setting procedures (Research report #0-7). Princeton, NJ: Educational Testing Service. |
[ p. 28 ]
3 Q: What is the university entrance exam item below probably attempting to measure? How could this item be improved?
Further reading: Bothell, T. W. (2001) 14 rules for writing multiple-choice questions. Retrieved on March 20, 2010 from http://testing.byu.edu/.../14%20Rules%20for%20Writing%20Multiple-Choice%20Questions.pdf Christensen, C. A. (2005). The role of orthographic-motor integration in the production of creative and well-structured written text for students in secondary school. Educational Psychology, 25 (5) 441 - 453 DOI: 10.1080/01443410500042076 [ p. 28 ] Gray, R. (2004). Grammar correction in ESL/EFL writing classes may not be effective. The Internet TESL Journal, 10 (11). Retrieved on March 15, 2010 from http://iteslj.org/Techniques/Gray-WritingCorrection.htmlHaladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15 (3), 309 - 334. Kehoe, J. (1995). Writing multiple-choice test items. Practical Assessment, Research & Evaluation, 4 (9). Retrieved March 20, 2010 from http://PAREonline.net/getvn.asp?v=4&n=9 . This paper has been viewed 80,293 times since 11/13/1999. Truscott, J. (1996). The case against grammar correction in L2 writing classes. Language Learning 46 (2) 327-369. Retrieved on March 15, 2010 from http://hss.nthu.edu.tw/~fl/faculty/John/Grammar_ Correction_in_L2_Writing_Class.pdf |
Setting aside the issue of whether or not there is actually a difference between the "[dis]agree" and "strongly [dis]agree" response options, if this were the only item seeking to measure student confidence in the survey, respondents would probably tend to give inflated responses since it is easier to agree with (or to express ambivalence about) survey items than it is to disagree with them. A well-designed survey should either use items that are neutrally worded or else counterbalance the previous item with one which has a different nuance, as in this example:
[ p. 30 ]
Notice that these two items differ not only in terms of nuance, but also response descriptors. Example 1 has a agreement-based descriptor scale, but Example 2 has a veracity-based descriptor scale. When using similar questions to measure the same construct, altering the nuance and descriptors and nuances might enhance the robustness and range of the overall instrument.
Further reading: Draper, S. W. (2009, December 23). The Hawthorne, Pygmalion, Placebo and other effects of expectation: Some notes. Retrieved on March 15, 2010 from http://www.psy.gla.ac.uk/~steve/hawth.html#Preface Jones, R. A. (1981). Self-fulfilling Prophecies: Social, Psychological, and Physiological Effects of Expectancies. Hillsdale, NJ: Psychology Press. Mizumoto, A., & Takeuchi, O. (2009). Comparing frequency and trueness scale descriptors in a Likert scale questionnaire on language learning strategies. JLTA Journal, 12, 116 - 130. Van Bennekom, F. (2007). How Question Format Affects Survey Analysis. Retrieved on March 16, 2010 from http://www.greatbrook.com/survey_question.htm Zdep, S. M. & Irvine, S. H. (1970). A reverse Hawthorne effect in educational evaluation. Journal of School Psychology 8, 85 - 95. |
[ p. 31 ]
Let us consider a Japanese university entrance exam as an example. If over 99% of the exam applicants are ethnic Japanese, there is probably little value in seeing how Japanese and non-Japanese perform differently on the test. Likewise, if over 99% of the applicants are between ages 17 and 20, there may be little rationale for exploring how age differences impact performance. One non-construct variable that probably should be explored, however, is gender. If an EFL exam claims to measure only "language proficiency" yet males and females perform very differently on that exam, we are left with some questions that merit exploration. Within the given population, do men and women actually differ in terms of language proficiency? Or is there some sort of test bias that might disadvantage one gender? To answer those questions and ascertain the different validity of an exam, many different kinds of evidence would need to be examined.
Further reading: Clason, D. L. & Dormody, T. J. (1994) Analyzing data measured by individual Likert-type items. Journal of Agricultural Education, 35 (4) 31-35. Retrieved on March 16, 2010 from http://pubs.aged.tamu.edu/jae/pdf/ Vol35/35-04-31.pdf
|
Further reading: Marczyk, G., DeMatteo, D., & Festinger, D. (2005). Essentials of research design and methodology. New York: John Wiley & Sons. Marion, R. (2004). The Whole Art of Deduction: Defining Variables and Formulating Hypotheses. Retrieve March 27, 2010 from http://sahs.utmb.edu/pellinore/intro_to_research/wad/vars_hyp.htm |
[ p. 33 ]
Further reading: Genesee, F. & Upshur , J. A. (1996). Classroom-Based Evaluation in Second Language Education (Cambridge Language Education). New York: Cambridge University Press. Test Rubric: Problems Associated with Rubrics. (2002). In S. A. Mousavi. An Encyclopedic Dictionary of Language Testing. (3rd Ed.). (pp. 755-757). Taipei: Tung Hua Book Company. |
[ p. 34 ]
A: The correct answer is (B). In truncation, the extreme top and/or bottom scores of a test are removed from consideration. Truncation may also occur if the observation period is shorter than the events under investigation, such as in a mortality study.
Further reading: Mandel, M. (2007). Censoring and truncation - Highlighting the differences. The American Statistician, 61 (4) 321 - 324. DOI: 10.1198/000313007X247049. |
Acknowledgements Many thanks to Lars Molloy, Ed Schaeffer, and Chris Weaver for feedback on this article. The responsibility for any errors herein rests with the author. |
[ p. 35 ]