Research Development Project

 

Historical and Contemporary views of Phenomenon of Interest

 

Cheating on Tests

Introduction: As defined in the Merriam Webster’s Collegiate Dictionary (2003), cheating in the context of testing is the “depriv[ing] of something valuable by use of deceit or fraud and violat[ing] rules dishonestly”. According to Cizek (1999) “cheating means using fraudulent means to project oneself as possessing knowledge perpetrated by violating the rules”.   
               Cizek(1999) stated that cheating on test dates back to the Garden of Eden when Adam was deceived to eat the forbidden fruit (p.127). He observed that cheating occurred because of the desire for good grades. He noted the nonchalant attitude of the general public toward those who cheat as if their (Cheaters) actions were inconsequential. He attributed the nonchalance to the switch from norm-referenced to fixed-standard evaluation systems. 
               The earliest forms of testing and subsequently concern about examination security dates back to ancient China. “Examinees developed strategies to aid them in answering questions, including creating miniature books of notes, writing passages on undergarments, copying notes on back of fans and hiring impersonators”(Cohen & Wollack, 2006, p. 361).

Classification: Cohen & Wollack (2006) classified cheating as individual and collaborative. Collaborative cheating is enhanced with the computer adaptive testing (CAT) because test items are used over longer period of time. On the other hand individual cheating behaviors are common with the paper-pencil tests. Examples of individual cheating behavior include use of unauthorized materials, copying answers from neighbor, and plagiarism while collaborative cheating include an examinee allowing another to copy, examinees’ agreeing to share answers, and attempts to capture test bank by a group of test takers. What many people don’t realize is that teachers and administrators also play a significant role in cheating. According to Cohen & Wollack, teachers’ and administrators’ alleged cheating behaviors include  “changing students’ answers after the test, teaching to the test, illegal coaching during a test, using live exams as practice exams, reading off answers during a test, providing extra time, disallowing low-achieving students from testing” (p. 369).

            Cheating could also be classified in ascending order of magnitude as: “cheating by taking, giving, or receiving information, cheating through the use of forbidden materials or information, and cheating by circumventing the process of assessment” (Cizek, 1999, p.39).

             

 

Preventing & Detecting Cheating in Paper and pencil Tests 

            Despite research evidence that cheating is on the increase, few culprits are caught and fewer are punished for this behavior. As a consequence this behavior has been greatly reinforced in students.

            However, measures taken by the ancient Chinese government in preventing cheating in civil service examination include: securing testing rooms, cheaters were stripped of all pre-earned certificates, and in extreme cases examinees found guilty were beheaded (Cohen & Wollack, 2006).

More recently, Cizek(1999) reported ways of preventing cheating in paper-pencil tests especially the multiple choice format.

Cizek discussed the preventive measures in paper and pencil tests in three main subheadings namely: classroom, institutional and educators’ perspectives. In the classroom situations the mechanisms are improved test administration conditions, attentive proctoring, documented or communicated information about academic honesty, designing good tests, controlled testing situations, etc. The institutional perspective includes established policies, procedures and honor codes. Educators’ perspectives entailed avoiding the general practice of teaching too closely to the test.

The prevalent cheating behaviors with the paper-pencil test are answer copying and impersonation. Item over-exposure was not a big threat in the paper-pencil tests because a number of techniques prevented this namely: shortened testing windows, alternate forms, replacing items, and minimizing opportune item pre knowledge. Generally, the more complex forms of cheating were not perpetrated with the paper and pencil tests and fortunately most occurred during the testing and were observed by effective proctoring.

            Cohen and Wollack (2006) reported counter measures including observation which involved using proctors or test administrators that were informed about the cheating behaviors. Electronic countermeasures included: the use of video surveillance systems, screening with handheld metal detector, using specialized equipment to verify candidates’ identity, such as with retinal scans or fingerprinting, and verifying candidates’ information against a database for authenticity. Psychometric measures involve use of statistical and data analysis approaches.

               The select type formats are more prone to cheating especially answer copying behavior. As a consequence most of the studies on cheating behavior centered on detecting answer copying in the multiple choice test format. Cizek (1999) classified the methods of detecting answer copying on tests as observational and statistical. He highlighted that observation, though subjective, still remains the most acclaimed method (p.133).  Statistical methods were applied to cases of answer copying and impersonation triggered by observed evidence. He referenced two statistical models for detecting cheating given by Saupe (1960), namely: chance and empirical. 
               Early research by Bird (1927, 1929) and Crawford (1930) relied on the empirical approach to detecting answer copying behavior. By the empirical technique the mean error similarity of pairs of suspected cheaters were compared to those of a norm group. The norm group consisted of sample of examinees that were geographically separated or that were seated in nonadjacent locations while testing. The main idea was to compare the distribution of the identical errors of the cheaters with those of the norm group which consisted of those that couldn’t have cheated due to their sitting positions. Statistical significance was determined using confidence intervals or hypothesis testing. Cheating was corroborated when the mean identical errors of the suspected cheaters exceeded the upper confidence limit of the norm group or if p value was smaller than a given alpha level of significance. 

In his article, Bird (1929) improved upon his 1927 study. This study showed that forewarning students about consequence of cheating worked in reducing but not eliminating the behavior. This buttressed the point that punishing this behavior, in line with the philosophy of capital punishment would be more effective.

The technique introduced by Crawford (1930) was significant because, it was possible to determine who copied and who was copied. The method entailed using the test-retest method to determine consistency of performance and backing evidence with previous academic records (pp.779-780). By this method the examinee with greater discrepancy in performance was regarded as the person that copied.

In 1974, Angoff introduced and evaluated eight empirical answer copying detection indices labeled “A” to “H”. These indices involved several independent and dependent variables.  Based on validity and reliability studies, index B was identified as the most viable answer copying detection technique. Index B consisted of the dependent variable of the number of identical wrong responses of suspected examinees’. A standardized test statistic was computed for comparison. He highlighted that the best two indices namely “B” and “H” have been in use at Educational Testing Service”. (p. 49).

               However the chance methods which are based on the assumption that identical errors follow a known statistical distribution are preferred and used currently. By the chance methods the mean identical errors of suspected cheaters are compared to that given by assumed statistical distribution of identical errors. The first was derived by Dickenson (1945). Other prominent chance methods were Anikeef’s (1954) and Saupe’s (1960). 

            Dickenson (1945) derived the number of ways students could respond to multiple choice items. By this technique given N answer choices, probable identical error (IE) was defined as ratio of number of distracters (N-1) to the number of different combinations of answers (N2). The range of probable percentage of identical errors varied from zero to twice the mean of identical errors. Examinees with average identical errors falling outside this range were detected as cheaters.

            Anikeef’s index for detecting answer copying in multiple choice test formats, assumed that for any given number of wrong answers (n) by a pair of examinees, that the number of identical wrong answers follows the binomial distribution. This method was tested on an experimental group and was found to be effective in detecting collaborative behavior not observed during test. However, it was shown to be least effective in detecting answer copying from several sources. (p.177) 

            The most current statistical methods for detecting answer copying in multiple choice tests relied on Item Response Theory (IRT) such as g2 derived by Frary et al (1977) and Wollack’s w(1997). Wollack (1997) regarded the earlier approaches as classical techniques and concluded that they depended largely on performance of all examinees. Classical techniques of detecting answer copying although useful were inefficient because they relied on sampling distributions, identical wrong responses, and performance of examinees. However with the IRT more powerful techniques have emerged that are promising for large scale testing. For instance, Wollack’s cheating index (w) not only compared identical wrong answers of suspected cheaters but also their identical right answers.  Besides by this technique the paper of the copier was compared with that of the copied.

All said I am of the view that researchers should focus more on preventive measures for cheating instead of observing and detecting the behavior. After all, prevention is better than cure.

 

Preventing Cheating in Computer Adaptive Testing

            As a result of increased use of computers in testing which is traced back to the 1960s (Brzezinski, 1984) and the prevalence of the computer adaptive testing (CAT) since early 90’s more advanced forms of cheating emerged. Sadly cheating behaviors occurred before the test, in the forms of item pre-knowledge and memorization and even those perpetrated during the test were not observable. This cheating behavior is shown to be facilitated by security problem of CAT. Item exposure occurs as a result of prolonged use of test items over a period of time. Depending on the relationship between the trait distribution of the examinees and the information structure of the item bank, different items are exposed at differing rates (http://EdRes.org/scripts/cat).

The Sage Handbook of Quantitative Methodology for the Social Sciences (2004) reported that with CAT answer copying was undermined since examinees were tested with different questions. However a major threat on test validity since CAT was introduced is item security and item pool usage.  By IRT-CAT certain items are selected while some others are not thus creating the problem of item over-exposure and underutilized item bank. Also with CAT came the problem of underestimation and overestimation of ability. This occurred as a result of the fact that the ability estimates were based on the first few items administered. 

            Computer based test is prone to item over exposure because testing windows are longer, tests are administered multiple times per window, several times in a year, etc. Besides every time tests are given, there is a chance that examinees will share test content with others who are yet to take it.  Thus item exposure is facilitated when items are used frequently. The extent of item over exposure is assessed using three criteria: item exposure rates, efficiency of bank utilization and average item overlap percentage. Exposure is monitored by computing these three statistics. These three criteria are determined to a large extent by the item selection method (Cohen and Wollack, 2006).

            The techniques applied in controlling item over exposure in CAT are classified as: Randomization, Stratification and Probability.

Test security reaches its maximum in CAT when every item has an equal probability of being administered as is the case with randomization techniques (e.g. Bergstrom, Lunz, & Gershon, 1992; Kingsbury & Zara, 1989). By randomization, items to be administered are selected from a set of targeted items.  However, the randomization techniques are less precise in estimating ability and require more items in the item bank.

By stratification, items are divided into homogenous sets based on item discrimination parameters, then items are selected from each set based on their difficulty levels (e.g., Chang & Ying, 1999; Chang et al., 2001).

 As illustrated in the Sage Handbook of Quantitative Methodology of Social Sciences (2004), the maximum information item selection approach involved selecting items with maximum Fisher item information. By this approach items with higher discrimination were selected. Thus with this approach items with high discrimination are over-exposed while others were never selected. In this situation it is recommended that items with lower discrimination should also be used to prevent over exposure.

Although maximum information approach leads to more accurate estimation of examinees ability it possesses the potential of over exposing items with high psychometric properties. Also in the handbook, Chang noted that a security strategy in CAT would be to set a minimum for item sharing and item pooling index. A table of lower bounds for item overlap rates could also serve as a guide for practitioners for evaluating test security.

            The most popular probabilistic approach for controlling item exposure was introduced by Hetter-Sympson (1985, 1997). Van der Linden (2003) reported that the Sympson-Hetter technique laid the foundation for probabilistic control of item exposure. By this technique items were selected before exposure control constraints were applied. Thus it involved long simulation in setting the control parameters. This paper reviewed this approach of item exposure control especially its demerits. Alternatives were derived by tweaking the control parameters used in the Sympson-Hetter technique. Subsequently simulated studies were conducted to evaluate their performance. The data used for the simulation was based on the Law School Admission Test (LSAT).  Two of the alternatives were shown to be viable in estimating ability.

Van der Linden & Veldkamp (2004) introduced another method of probabilistic control of item exposure. Simulation study was carried out using the Law School Admission Test (LSAC). This technique was shown to have negligible impact on the statistical properties of the ability estimator. The results showed that the method was effective in controlling item exposure. In general by probabilistic methods selection probabilities are assigned based on the frequency with which the items are selected for a particular subgroup under the maximum Information criterion.

A cheating strategy that CAT is immune to was discussed by Gerson (1995). He showed that students who tried to play clever with the adaptive nature of CAT by purposely answering initial items incorrectly in order to get easier questions failed woefully. By this strategy the examinee took a set of items not targeted to his ability thus defeating the purpose of the test. The consequences of this strategy for the examinee and test developer were examined. This technique negatively affected the examinee ability estimate and standard error of measurement (SEM).

Indeed controlling item exposure is essential because it nips cheating in the bud. It is worthwhile exploring and incorporating item exposure control techniques in operational CAT because in the long run it will pay off in restoring reliability and validity of decisions made with test results.  

 

Detecting Cheating in Computer Adaptive Testing

            Cheating on tests often conjures up students as culprits; teachers are rarely mentioned even as their role can have far reaching consequences. There are many ways a teacher could influence cheating behavior even as research on this is scarce. Some of the few available research focus on detecting the teachers’ role via unusual score gains and response patterns (Jacob and Lewitt, 2003). Some teachers aid cheating by focusing on teaching to the test and other inappropriate test preparation practices (Mehrens, 1991).

            With IRT cheating behavior is detected by comparing the response pattern with the underlying model of response. Person fit methods involved detecting response patterns for which IRT does a poor job of predicting the probability of success on an item. Examples of person misfit included item pre-knowledge and cheating (copying answers from a neighbor, prohibited aids, etc.)

McLeod and Lewis (1999) applied fit index in detecting items that have been memorized in the item pool. In general, the index was most effective when sources had higher ability estimates, when the percentage of beneficiaries was higher, and when there were more sources. Since test items were used over time, sources combined item lists and provided them to the beneficiaries. The beneficiaries memorized these items and were able to answer them correctly in the test. Memorizing item lists from more sources resulted in higher ability estimates. Low ability test takers were most helped by regular sources while high ability test takers were most helped by proficient sources.

Impara, J., Caveon L., Kingsbury, G., et al. (2005) analyzed data forensic methods such as aberrance, collusion, and score volatility. Examinees misdemeanor on CAT were detected by checking for unusual item response patterns or time. The analyses examined four types of test security risk namely collusion, cheating, piracy, and volatile retakes. This technique was illustrated with an educational assessment program that used CAT. It was shown to be effective.

In his article Segall (2002) used an item response model to estimate and characterize item-preview and score-gain distributions observed in on-demand high-stakes testing programs. A simulation study was conducted that evaluated the performance of the Markov Chain Monte Carlo (MCMC) procedure for three levels of compromise namely: zero, moderate and severe. Simulation study showed that the procedure (MCMC) provided information on extent of test compromise and impact on test scores for a small group of examinees, when some items are known to be secure. 

Information response models are really efficient and easily applied in detecting cheaters in multiple choice tests. It saves one the trouble of sampling since response patterns of the suspected cheaters are compared for violation of model. Besides cheating is simply detected for particular individuals, and is independent of response patterns of other individuals that took the test as was the case with the classical techniques.

                                                                                                                                                        

Cheating in Practice: GRE Cases

               Cizek(1999) acknowledged that testing organizations such as the Educational Testing Service (ETS) for the most part took precautionary measures when cheating incidence occurred by invalidating the scores. All test takers benefit when ETS can assure admissions decision makers of the validity of test scores.  He reported that a committee (Crocker, Geisinger, Loyd, & Webb, 1994) with the responsibility of ensuring test security at ETS responded to cheating by withholding scores, investigating by collecting evidence and penalizing according to gravity if found guilty.

For instance, a cheating case was reported with the CAT-GRE on August 6, 2002, it was discovered that a number of Asian-language web sites posted questions of the operational GRE General Test. As a result the ETS retired the CAT GRE General test and reintroduced the Paper and Pencil forms in China, Hong Kong, Taiwan, and Korea (www.ets.org, August 20, 2002)

 

 ETS is undertaking the change at the request of the GRE Board, the policy setting body of the examination, following an investigation that uncovered a number of Asian-language web sites offering questions from live versions of the computer-based GRE General Test. The Web sites included both questions and answers illegally obtained by test takers who memorize and reconstruct questions and share them with other test takers. The web sites are located in China and Korea, and easily accessed in Hong Kong and Taiwan (p. 127).
 
               According to James Livingood of the GRE Client Relations Specialist Higher Education Division, the ETS used a split-test administration to restore the validity of scores for students in China (including Hong Kong), Korea, and Taiwan. Thus General Test is offered in two parts. The Analytical Writing section is offered on computer; the Verbal and Quantitative sections are offered as paper-based. Test takers are required to take both the computer-based and paper-based parts of the GRE General Test in the same testing year, but must take the computer-based Analytical Writing portion first. 
               A choice of two paper-based administrations will be offered during the 2007-08 testing year, spaced out so as to minimize congestion in the computer-based test (CBT) centers. 
               Also the paper-based Verbal and Quantitative sections that were administered were retired from use after each administration, thereby removing the unfair advantage some past students gained by memorizing questions in advance of the test. The Analytical Writing section is administered on computer. Test takers word process their essay responses so that the responses can be reviewed by ETS Essay Similarity Detection (ESD) software. The software ensures that every scored essay has a high degree of original critical thinking and analytical writing (e-mail, September 30, 2007)

            Another case of organized collaborative cheating in GRE-Test was reported In the Sage Handbook of Quantitative Methodology for the Social Sciences. Kaplan Educational Centers tested the exposure of items in GRE-CAT by sending its employees to memorize items. Kaplan discovered that within a short period of time that their employees’ memorized items overlapped which indicated that most of the items have been over exposed. As a result of this event ETS retired the items and introduced new ones. Again this proved that the security weakness of CAT was as a result of prolonged window of testing (Davey & Nering, 2002).

            Not surprisingly these techniques worked because by the item selection algorithms it was possible for a group of people to reproduce a big chunk of the item bank by adopting response styles that could expose items of varying difficulty levels. It is also possible that nearly all items designed to measure the high ability examinees would be exposed particularly if there were very few items in the bank (Cohen & Wollack, 2006).

 

 

 

 

 

 

 

 

 

Annotated References

Angoff, W.H.(1974). The development of Statistical indices for detecting         cheaters. Journal of American Statistical Association, 69(345), 44-49.

Anikeef, A.M. (1954). Index of collaboration for test administrators. Journal          of Applied Psychology, 38(3), 174-177.

Bergstrom, B.A., Lunz, M.E., & Gerson, R.C. (1992). Altering the level of          difficulty in computer adaptive testing. Applied Measurement in          Education, 5, 137-149.

Bird, C. (1929). An Improved method of detecting cheating in objective          examinations. Journal of Educational Research, 19(5), 341-348.

 Brzezinski, E.J.(1984). Microcomputers and testing: where are we and how Did we get there. Interwest Applied Research: In the Special Issue of Educational Measurement: Issues and Practice, Vol. 3, No. 2.

Chang, H.H. & Ying, Z. (1999). A-stratified multistage computerized adaptive testing. Applied Psychological Measurement, 23, 211-222.

Chang, H.H., Quian, J., & Ying, Z. (2001). a-stratified multistage computerized adaptive testing with b blocking. Applied Psychological Measurement, 20, 213-229.

Cizek, G.J. (1999).Cheating on Tests: How to do it, detect it, and prevent it.          Mahwah, N.J: Lawrence Erlbaum Associates.

Cohen, A.S. & Wollack, J.A. (2006). Test administration, security, scoring,         and reporting.  In  R.L. Brennan(Ed.). Educational Measurement. 4th         Edition 356-386

Crawford, C.C. (1930). Dishonesty in objective tests. School Review, 38(10),       776-781

Dickenson, H.P. (1945). Identical errors and deception. Journal of       Educational Research, 38(7), 534-542.

Frary, R.B., Tideman, T.N., and Watts, T.M. (1977). Indices of cheating on        multiple-choice tests. Journal of Educational Statistics, 2, 235-256

Gerson, R. (1995) Does Cheating on CAT pay: Not!

        Paper presented at the annual meeting of American Educational         Research Association, San Francisco, CA.

Hetter, R.D., & Sympson, J.B. (1997). Item exposure control in CAT-ASVAB.         In W.A. Sands, B.K. Waters, & J.R. McBride (Eds.), Computerized        adaptive testing: From Inquiry to operation (pp. 141-144). Washington,        DC: American Psychological Association.

Impara, J., Kingsbury, G., et al. (2005).  Detecting Cheating in              Computer Adaptive Tests using data forensics.  Paper presented at              the annual meeting of the national council on measurement in              education and national association of test directors, Montreal,              Canada.

Jacob, B.A., & Levitt, S.D. (2003a). Catching cheating teachers: The results              of an unusual experiment in implementing theory. In W.G. Gale &              J.R. Pack (Eds.), Brookings-Wharton papers on urban affairs (pp.              185-220). Washington, D.C.: Brookings Institutional Press.

Jacob, B.A., & Levitt, S.D. (2003b). Rotten apples: An investigation of the              prevalence and predictors of teacher cheating. Quarterly Journal              of Economics, 118(3), 843-877.

Kaplan, D. (2004). The Sage Handbook of Quantitative Methodology for              the social sciences (4th ed.), 117-133.

Kingsbury, G.G., & Zara, A.R. (1989). Procedures for selecting items for              computerized adaptive tests. Applied Measurement in Education,              2, 359-375

McLeod, L.D., & Lewis, C. (1999). Detecting item memorization in the CAT            environment. Applied Measurement in Education, 23(2), 147-160.

Mehrens, W.A. (1991, April). Defensible/Indefensible instructional              preparation for high stakes achievement tests: An exploratory              trialogue. Paper presented at the annual meeting of the American              Educational Research Association, Chicago.

Merriam-Webster, Incorporated Springfield, Massachusetts, U.S.A.          publication.(2003).Merriam Webster’s Collegiate Dictionary.(11th ed.).

Saupe, J.L. (1960). An empirical model for the collaboration of suspected          cheating on multiple-choice tests. Educational and Psychological           Measurement, 20(3), 475-490.

Segall, D.O. (2002). An item response model for characterizing test          compromise. Journal of Educational and Behavioral Statistics, 27,          163-179.

Van der Linden, W.J., & Veldkamp, B.P. (2004). Constraining item-exposure          rates in computerized adaptive testing with shadow tests. Journal of          Educational and Behavioral Statistics, 29, 273-291.

Van der Linden, W.J. Some Alternatives to Sympson-Hetter item-         exposure control in computerized adaptive testing. Journal of          Educational and Behavioral Statistics, 28, 249-265.

Wollack, J.A.(1997). A nominal response model approach for detecting          answer copying. Applied Psychological Measurement, 21(4), 307-320


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Following Pages

Pre-Post Statements of Interest

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PERSONAL STATEMENT
Re: Ph.D. Application for Fall, 2007 (Ifeoma Chika Iyioke)

 

            I use this statement to express my desire to study Measurement and Quantitative Methods at the doctoral level having obtained a master’s degree in Applied Statistics from Michigan State University. Upon admission into the program I would like to specialize in Measurement (i.e. applications of quantitative methods to issues relating to classroom and large scale assessment, instrument development, etc.)

            Having attained this stage, my readiness to advance higher was premised on a very solid undergraduate study during which I remained in the top 20% of the class and majoring in computer science and statistics. That attainment was bolstered further by my master’s program here at MSU.

            However, my undergraduate experience lacked practical contents. As a practically oriented person, I always seek for the application and implementation of concepts and theories to real life situations. Thankfully, MSU had the answer. My Applied Statistics study has put me on a great career path. I am particularly proud that I could study as well as teach in a dynamic educational setting. I gained far more practical experience than I had ever wished for. To cap it off, I was thrilled at my outstanding performance in the master’s degree exam and in the marked improvement of my grade point average. Given the progressive improvement in my performance, I am very confident that my desire for a career in education measurement is a conscientious and determined progress towards my professional goal. My experiences both as a graduate student and teaching assistant really sharpened my intellectual and professional interest.

            Educational researchers are effectively utilized in the testing industry, at the state level and national departments of education where program evaluation and development are key activities. But having experienced teaching both as a post graduate and as a graduate teaching assistant (please see attached resume), I have no doubt that I am academically inclined.            My guiding philosophy is that excellence is attained through honest labor. Consequently, with a strong self motivation and a great drive for acquiring knowledge, my immediate aspiration is to acquire knowledge and skills that will prepare me for a faculty position in which I can both teach and do research. I have no doubt that a career in Measurement and Quantitative Method will help me to attain great heights in academia.
            The Measurement specialty of the Department will enable me to acquire skills for large scale assessment and program evaluation, a glimpse of which I experienced not too long ago. As part of the requirements for my master’s program, I participated in a sampling survey project. The goal of the project was to develop a short test to measure writing literacy of undergraduates in Natural Science or in Arts and Letters. We were able to estimate the proportion of students who are writing literate. For my part, I was responsible for determining the sampling plan, precision and cost required. The result had a margin of error of .04, and with 95% confidence.

It would be a dream come true to actually study in a college that serves as a model for other programs of professional education. I just read that the 2007 U.S. News and World Report annual survey ranked the college's elementary and secondary education graduate programs as the best in the nation for the twelfth consecutive year! Also that the college had eight programs ranked in the top 10. I have long known that the college is nationally and internationally known for its research on teaching and learning.

That colossal reputation is reflected in the level of research and quality of the MQM faculty. Due to lack of space, I’d name just two: I am particularly interested in Professor Mark Reckase’s expertise areas which include computer applications to testing, assessment using performance tasks, and standard setting. Computerized adaptive tests are still not effectively utilized in a developing country such as Nigeria (a country I’d be focusing on). I want to acquire more skills in the methodology of statistics and computer programming that I could apply in solving problems of examination malpractice and standard setting. Also, Professor Sharif Shakrani’s research interests catch my fancy too. In particular are such areas as analysis of the effects of national and states accountability systems on student achievement, and use of research in setting educational policy. I intend to research about issues affecting education in Nigeria such as effects of national and state system on student achievement and performance especially in science and technology.

In addition, numerous MQM graduate students are researching about areas of study that are exceedingly exciting to me. They include Raymond Mapuranga’s large-scale assessments, and multiple matrix sampling, and Raj Subedi’s computer-based testing and multi-level modeling.

            A Ph.D. degree in education will be the icing on the cake in my dream to contribute to the educational problems in Nigeria. In particular, pursuing a career in Measurement and Quantitative Methods will definitely help me to reach my professional goal of embarking on a life time of teaching and research on educational issues.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

End of Semester Statement of Interest

 

Introduction: Let me start by expressing my sincere gratitude to David Wong, Ph.D., the professor for this class. Without his guidance I wouldn’t have been able to accomplish the semester long research development project.  He really put me on track and set the stage for exploring my phenomenon of interest.

 At the outset, it was quite a challenge trying to choose a phenomenon of interest. Much as it may have been easier for some of my classmates who have had both practical and life experiences, it wasn’t the same for me. More importantly, I didn’t want to choose a topic just for the sake of completing the class requirements. I really wanted a phenomenon that I could carry on with for apprenticeship and possibly dissertation. To start with it was difficult narrowing down my interest but thankfully, discussions with fellow graduate students notably Subedi, D., Wyse, A., Jia, Y., etc., helped me to narrow the scope of my phenomenon of interest. However, I wasn’t surprised about how difficult it was for me to come up with a phenomenon of interest. Although I had always wanted to specialize on cheating in large scale assessments, I had no idea about specific areas of specialization. Also I didn’t realize that one could achieve a lot if you seek help from fellow students and faculty members.

After jumping the hurdle of specifying a phenomenon of interest, the next was searching for available resources. I started out by merely searching the MSU library for articles. I almost gave up hope because I couldn’t find much on the topic. I thought I could accomplish the task set forth without seeking out information from colleagues and faculty members. This quagmire was resolved by Dr. Wong’s suggestions which inspired me to communicate more with experts in my field of study. I was really surprised at the e-mail exchanges that followed from my inquiries and at how much I accomplished afterwards.

 Through recommendations of Dr. Roeber, E., (MQM professor), Dr. Martineau J., (state of Michigan Department of Education), Wyse A., to mention but a few,  I was able to get in touch with experts on cheating especially Dr. Wollack J. (University of Wisconsin, Madison), who gave me resources that I never knew were available. I also didn’t realize that their was an organization called Caveon Security that specialized in detecting and preventing cheating, I contacted their president  Dr. Fremer, J., who also gave me some resources.

 

Interest(s): I am particularly an advocate of psychological and educational testing because I have had quite a good record with classroom testing and probably have been reinforced to continue in academic endeavors.  However, a testing experience that remains indelible in my mind was a high stake achievement test I took in my country (Nigeria) called the Joint Admission and Matriculation Board examination (JAMB). The test is meant for all candidates seeking admission into the university.  Since its inception, JAMB has been conducted in a paper-and-pencil multiple choice format. As a consequence, it has been riddled with all forms of malpractices and result cancellations are rampant; yet it still remains the criterion for selecting individuals for higher education. When it was time for me to take the test, I barely made the cutoff score for admission into the university in spite of my efforts. In my undergraduate program, I wasn’t surprised to see that most of the top scorers (who apparently cheated) in the JAMB exam turned out to be the least performing students.

The evolution of the Computer Adaptive Tests (CAT) aroused my interest. I am of the view that this mode of administering test could help to control the tendency by unqualified candidates to cheat their way into institutions of higher learning. This has been a canker worm in the educational sector and I believe that the security potential of computerized test could curb the anomalies in testing and invalidity in the use of test result for admission purposes.

My interest in preventing and detecting cheating was borne out of mere desire to participate in ensuring “fairness” in educational testing.  I never knew that “fairness” was just a tip off the iceberg considering the negative consequences the cheating behavior had on the technical quality of the test.  In the course of researching for my RDP paper I realized that exam malpractice not only made the test scores unreliable but also invalidates the decisions made with the results.

I am pleasantly surprised to know that I have acquired a fair amount of statistical skills that could be used in tackling the problem of cheating. In the course of my exploration I came to understand that statistical techniques such as confidence intervals and hypothesis testing could be applied in detecting cheating behavior. Interestingly for the first time in my life, I realized that theoretical concepts (hypothesis testing and confidence intervals), which I learned could actually be used as tools for solving real life problems such as cheating. This pro-seminar class really clarified my interest and defined my potentials of applying statistical techniques to educational problems. I gained a wealth of knowledge about the different empirical techniques used in detecting and preventing cheating in the different modes of test administration namely:  paper-pencil-pencil, computer-based and computer adaptive testing. I am gratified that I am beginning to transfer my statistical knowledge to real life problems, which previously seemed a far cry from reality. 

I really acquired a breadth of knowledge while embarking on the series of assignments during the semester. I am determined to be part of educational experts on combating the problem of cheating in educational testing. I am particularly happy that I can now relate with the techniques for detecting and preventing cheating both in the paper-based and computer-based test. As a matter of fact, before the pro-seminar class I had no idea of how to pursue my interest in cheating. I am convinced that unless something drastic happens that I’ll follow the path paved by this seminar class.

  When I look back at how much I have improved over the semester I am reinforced and hopeful that I could accomplish my life long dream of becoming a renowned academic.


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

FOLLOWING PAGES

 

RDP ANNOTATIONS (TOTAL = 20)

 

ARRANGED CHRONOLOGICALLY

 

TWO PER PAGE (EXCEPT TWO)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Bird, C. (1927). The detection of cheating on objective examinations. School and          Society, 25(635), 261-262.

 

Bird introduced the first empirical method of detecting answer copying in objective examination. Four examinees were identified by proctors as exhibiting cheating behavior. Examination papers of suspected copier and copied were compared for identical errors. 20 papers were selected at random and each compared with those of the four suspects for identical errors. On the average the identical error of paired comparisons with the twenty papers was 4.  Compared with the identical errors of paired comparison of the copier and copied, it was obvious that their errors couldn’t have occurred by chance. To increase the power of detection the random sample was increased to 100 and comparisons of the 100 pairs of examinees yielded a mean identical error of 4.35 and measurement error of .16. An upper confidence limit was determined and again it was quite obvious that the four examinees cheated.   

            This research gave me intuitive idea of how to apply statistical techniques in detecting cheating. It is also noteworthy that this particular research could not be generalized to all objective tests as the number of identical errors varies with the number of questions, difficulty of items and subtlety.

 

 

 

 

 

Bird, C. (1929). An Improved method of detecting cheating in objective        examinations. Journal of Educational Research, 19(5), 341-348.

 

In this article, Bird applied an empirical and experimental technique in detecting answer copying. Statistical detection was triggered by report from proctors.  There were two groups of the examinees: the experimental group, forewarned about the method of detection while the control group was given minimal instruction of focusing on their papers. Two forms of objective test were given and the examinees took the test in two different locations. The study compared the distribution of identical errors of reported cheaters with those of geographically separated examinees who couldn’t have cheated. A 68% upper confidence limit for the geographically separated paired comparisons and suspected paired comparisons were computed. Paired comparisons with identical errors exceeding the average upper limit of the confidence interval of geographically separated comparisons were confirmed to have cheated. In all 29 examinees were corroborated as having cheated in the examination.

        I am really fascinated about this method of detecting cheating by comparing the distribution of identical errors of honest examinees with those of the suspected cheaters. I am particularly interested in the sampling considerations in determining the chance distribution of identical error namely: ensuring different location, letter grade combinations and comparison with others in different locations with same grade as the neighbor.

 

Crawford, C.C. (1930). Dishonesty in objective tests. School Review, 38(10),       776-781.

 

In a nutshell this technique of detecting answer copying involved: counting the number of errors and the number of identical errors of cheaters, determining the quotient, number of identical errors divided by the number of errors for the pair of suspected cheaters, computing percentage of errors of papers chosen at random from those not sitting together, comparing the identical errors of the random sample with those of the suspected cheaters and finally by determining statistical significance of difference between identical errors  of cheaters and those of random sample (pp.778-779). This method was used to corroborate cases of reported answer copying. Suspected examinees identical errors were compared with average identical errors of random sample by statistical technique of confidence interval. The test-retest method was used to identify the “copier” and the “copied”. The examinee with greater discrepancy in performance over two test administrations was regarded as the person that copied. Past academic records also helped in taking decision about the copier.

        Crawford really detailed the method of detecting cheaters through identical error analysis, and indeed he left nothing to be desired of the method.

 

 

 

 

 

 

 

Dickenson, H.P. (1945). Identical errors and deception. Journal of Educational         Research, 38(7), 534-542.

 

This was the first chance method of detecting answer copying behavior. Dickenson derived a number of ways students could respond to multiple choice items. Given N choices and 2 examinees by applying probability principle, the number of different combinations of answers was given by N2.   The probable identical pairs of responses were given by (N).  The identical error (IE) was given as ratio of number of distracters (N-1) to the number of different combinations of answers (N2). Using this formula a table was generated that contained average percentages of different multiple choices and the deception range of identical errors (p. 535). The range of probable percentage of identical errors varied from zero to twice the mean of identical errors. These results were shown to be consistent when applied to a hypothetical test.  A formula was given for the upper bound of identical error  IEm = 2(T1P1 + T2P2 +……+TnPn).  where T1, T2…….,Tn represented the number of different item types, while P1, P2,….,Pn represented the corresponding mean number of error for the items.

            The distribution with which the cheaters are compared was obtained by intuitive application of combination principle of probability. It was really interesting learning that one could apply combinatorics in detecting cheating behavior.  

Anikeef, A.M. (1954). Index of collaboration for test administrators. Journal of          Applied Psychology, 38(3), 174-177.

 

This paper discussed a chance index for detecting answer copying in multiple choice test formats. By this method the distracters selected by the suspected cheaters were compared for similarity. It was assumed that given any number of wrong answers (n) by a pair of examinee that the number of identical wrong answers follows the binomial distribution with mean np and variance npq where p is the probability of same distracter was selected given an incorrect response. For a multiple choice with N options, p was given by reciprocal of N. A table was generated at different confidence levels for different number of observed wrong responses. The table showed the numbers of identical errors for given numbers of wrong responses needed for establishing collaboration. This method was tested on an experimental group and was found to be effective in detecting collaborative behavior not observed during test. Method was shown to be least effective in detecting answer copying from several sources. (p.177) 

It is really interesting that statistical distributions such as the binomial and the normal could be applied in quantifying and detecting cheating especially in large scale assessments.

 

 

 

 

 

 

 

 

 

 

Angoff, W.H.(1974). The development of Statistical indices for detecting         cheaters. Journal of American Statistical Association, 69(345), 44-49.

 

This paper introduced indices for detecting answer copying based on both identical wrong and identical right answers. Chance distributions of identical errors were based on pairs of “honest” examinees and were used as norm for detecting future cheaters (p.44). The norm group consisted of three samples taken from SAT data.  He introduced and evaluated eight answer copying indices labeled A to H.  Index B was identified as the most viable answer copying detection technique. Index B consisted of the independent variable of the number of wrong responses and dependent variable of number of identical wrong responses of the suspected examinees. Using the B index, a standardized statistic was computed for comparing identical errors of norm group with those of suspected pairs of examinees. It was highlighted that the best two indices namely “B and H have been in operational use at Educational Testing Service”. (p. 49).

            Angoff Method took the detection of answer copying a step higher by a more detailed analysis of answers of suspected examinees. Thus this method was shown to be more powerful in detecting cheaters.

Frary, R.B., Tideman, T.N., & Watts, T.M. (1977). Indices of cheating on         multiple-choice tests. Journal of Educational Statistics, 2, 235-256.

 

This paper investigated techniques used to detect answer copying in a multiple choice test based on correct and wrong responses. Two indices denoted g1 and g2 were developed and compared. The item statistics were independent of the examinees ability and expected identical responses of a pair of examinees was dependent on the two examinees. This was unlike the earlier detection methods where the expected identical errors as opposed to identical responses depended on the performance of other examinees on the test. g2 was shown to be more reliable technique because it had higher power of detection and controlled false detection. By this technique one examinee was identified as copier and the other as source. The source’s response was fixed and the indices were standardized statistics

This technique was more efficient than the earlier approaches for detecting answer copying because one doesn’t have to go through the rigorous process of obtaining a random sample with which to compare.

 

 

 

 

 

 

 

 

 

 

 

 

Kingsbury, G.G., & Zara, A.R. (1989). Procedures for selecting items for      computerized adaptive tests. Applied Measurement in Education, 2,   359-375.

 

This study evaluated classical item selection techniques in computerized adaptive tests, the pre structured and variable step sized procedures. The stradaptive test was the most efficient of all fixed step size techniques. The variable step sized procedures such as the maximum information and Bayesian method were shown to be more précised in estimation of ability although they showed problem of inefficiency in searching for maximal information and minimal posterior variance items, respectively for administration in large item pools. Three efficient techniques were introduced based on their ability to eliminate item search delay. The self adapted testing allowed the examinee to choose desired difficulty stratum while the testlets techniques handled context effects. Finally procedures of constrained computerized adaptive testing(C-CAT) were added to be used in conjunction with the classical approaches for maintaining high measurement precision and short test length in applied testing.

            This study really detailed the different item selection techniques and their relative merits and demerits. It is really good to juxtapose these techniques in evaluating their worth.

Bergstrom, B.A., Lunz, M.E., & Gershon, R.C.(1992). Altering the level of          difficulty in computer adaptive testing. Applied Measurement in           Education, 5, 137-149.

 

In this paper the impact of test difficulty and length on examinee ability estimation in computer adaptive testing was considered. It was shown that changing the difficulty level of the test doesn’t affect the estimate of ability and that the change in number of items required for a given standard error of measurement was negligible. This result was obtained from the study of 225 examinees randomly assigned to three test conditions; hard, medium, easy based on difficulty levels 50%, 60% and 70% respectively.  

This study is important in recalibration and the design of computer adaptive test that could estimate ability of examinees efficiently.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Gerson, R. (1995) Does Cheating on CAT pay: Not!

        Paper presented at the annual meeting of American Educational         Research Association, San Francisco, CA.

 

This article examined how cheating behavior occurred in computer adaptive test when examinee purposely answered initial items incorrectly in order to get easy questions and subsequently corrected the wrong answers. By this strategy the examinee took a set of items not targeted to his ability thus defeating the purpose of the test. The consequences of this strategy for the examinee and test developer were examined. This technique negatively affected the examinee ability estimate and standard error of measure (SEM).

Computer Adaptive Tests could be fairer if decisions are based on the number of items answered correctly within the allotted time instead of the number answered without mistakes. Time could be used for discriminating individuals even for a norm referenced test. Allowing for review in computer adaptive tests would be reinforcing for the examinees and also improve ability estimates.

Wollack, J.A. (1997). A nominal response model approach to detect            answer copying. Applied Psychological Measurement, 21, 307-320.

 

 

Wollack introduced an item response theory based statistic (w) for detecting answer copying. The statistic was computed on estimated examinees ability and item parameters. The source’s response was fixed and the copier’s response was compared for identical responses. The statistic was determined as the difference between observed identical responses and the sum of the probabilities of identical responses of all items divided by the standard deviation.

 He compared this statistic to the best classical test theory statistic (g2). The comparison was based on: copying conditions, test lengths, and sample sizes. w was effective in controlling false detection and it’s power of detection was dependent on test length and percentage of items copied. Power of detection was increased because sitting chart was provided, that indicated potential copiers and sources to be investigated.  Based on this result w was adopted as more useful technique for detecting answer copying.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

McLeod, L.D., & Lewis, C. (1999). Detecting item memorization in the CAT            environment. Applied Measurement in Education, 23(2), 147-160.

 

This research applied fit index in detecting items that have been memorized in the item pool. In general, the index was most effective when sources had higher ability estimates, when the percentage of beneficiaries was higher, and when there were more sources. Since test items were used overtime, sources combined item lists and provided them to the beneficiaries. The beneficiaries memorized these items and were able to answer them correctly in the test. Memorizing item lists from more sources resulted in higher ability estimates. Low ability test takers were most helped by regular sources while high ability test takers were most helped by high proficient sources.

            This work is interesting and emphasizes the need for test developers to work out strategies for securing test items as well as replacing exposed items. It is better to focus on controlling item exposure rather than on cheating behavior because new behaviors would always emerge.

Cizek, G.J. (1999).Cheating on Tests: How to do it, detect it, and prevent it.          Mahwah, N.J: Lawrence Erlbaum Associates.

 

 

Cizek classified cheating in educational and psychological testing as: “cheating by taking, giving, or receiving information, cheating through the use of forbidden materials or information, and cheating by circumventing the process of assessment” (p.39). Examples of these classes included: answer copying, use of books or notes, and impersonation respectively.

He emphasized that the select test format is more prone to cheating and supported this with ample empirical research. The methods of detecting answer copying on test were classified as observational and statistical. He highlighted that observation although subjective still remains the most acclaimed method (p.133).  Statistical methods are applied to cases of answer copying and impersonation triggered by observation. He referenced two statistical models for detecting cheating given by Saupe (1960) namely: chance and empirical. The demerits of statistical methods described includes: limited applicability, not been able to detect complex strategies, and mostly relevant for large scale testing.

Cizek reported ways of preventing cheating in three broad subheadings namely: classroom, institutional and educators’ perspectives. In the classroom situations the mechanisms included: improved test administration conditions, attentive proctoring, documented or communicated information about academic honesty, designing good tests, controlled testing situations etc. The institutional perspective included: established policies, procedures & Honor codes. The Educators’ perspective included avoiding the general practice of teaching to the test. 

I gained a wealth of information about cheating especially how it is perpetrated, perceived, prevented and detected. The book really gave a detailed literature review on the phenomenon of cheating.

 

 

 

 

 

 

 

 

 

 

 

 

Chang, H.H., Quain, J., & Ying, Z. (2001). a-stratified multistage computer          adaptive testing. Applied Psychological Measurement, 23, 211-222.

 

This paper improved on the computer adaptive testing item selection procedure introduced by Chang & Ying (1996). Chang and Ying’s approach was based on grouping items in the bank based on their discriminating power.  There was k such groups in ascending order of discriminatory power. At each of the k stages of item selection, the item with difficulty level closest to the current estimated examinee ability was administered. This technique assumed that item discriminating power was independent of difficulty although in practice they are shown to be positively correlated.

By the current technique items were first grouped into M difficulty levels and then items within each of the difficulty levels were grouped into k discriminating levels. Finally the K groups were formed by picking items of the same discriminating level across the M difficulty levels. Hence this technique allowed for distribution of items of varying difficulties within each discriminating level and provided a better match of items across examinees. Thus at each of K levels of item selection, the item closest to the current ability estimate of the examinee was administered. Simulation studies were conducted with a retired item bank of the GRE quantitative test and this technique was shown to control item exposure.

This technique has potential of solving some of the problems associated with the computerized adaptive testing namely: content balancing, item over exposure and item selection in constrained CAT.

 

 

 

 

 

 

 

 

 

Segall, D.O. (2002). An item response model for characterizing test         compromise. Journal of Educational and Behavioral Statistics, 27,163-179.

 

In this article an item response model was used to estimate and characterize item-preview and score-gain distributions observed in on-demand high-stakes testing programs. A simulation study was conducted that evaluated the performance of the Markov Chain Monte Carlo (MCMC) procedure for three levels of compromise namely: zero, moderate and severe. Simulation study showed that the procedure (MCMC) provided information on extent of test compromise and impact on test scores for a small group of examinees, when some items are known to be secure. 

            The information response model is really an efficient technique of detecting cheaters in multiple choice tests. It saves one the trouble of sampling since response patterns of the suspected cheaters are compared for violation of model.

Van der Linden, W.J. Some Alternatives to Sympson-Hetter item-exposure         control in computerized adaptive testing (2003). Journal of                 Educational and Behavioral Statistics, 28, 249-265.

 

The Sympson-Hetter technique laid the foundation for probabilistic control of item exposure. By this technique items were selected before exposure control constraints were applied. Thus it involved long simulation in setting the control parameters. This paper reviewed this approach of item exposure control especially its demerits. Alternatives were derived by tweaking the control parameters used in the Sympson-Hetter technique. Subsequently simulated studies were conducted to evaluate their performance. The data used for the simulation was based on the Law School Admission Test (LSAT).  Two of the alternatives were shown to be viable in estimating ability.

            It is certainly a worthwhile effort controlling item exposure although it may compromise precision of ability estimate.  Besides, in this age of high speed Internet one may never be able to detect all cheating behaviors.

 

 

 

 

 

 

 

 

 

Kaplan, D. (2004). The Sage Handbook of Quantitative Methodology for the        social sciences (4th ed.), 117-133.

 

In this article CAT was discussed and shown to measure an individual’s ability with more precision. This as explained was because items with difficulty level commensurate with examinee’s ability were administered. CAT has many advantages over the paper and pencil form namely:  new question format, faster data analysis, and faster scoring and score reporting.  One major threat on test validity since CAT was introduced is item security and item pool usage.  By IRT-CAT certain items are selected while some others are not thus creating the problem of item over-exposure and underutilized item bank. Hua-Hua Chang reported that Kaplan Educational Centers tested the exposure of items in GRE-CAT. Another cheating case experienced with the CAT-GRE was that on August 6, 2002, it was discovered that a number of Asian-language web sites posted questions of the operational GRE General Test. Models for CAT includes: the three parameter, two parameter and one parameter logistic models. Lord’s Maximum information process exposes items at differing rates. Also the max-I approach resulted in underestimation and overestimation of ability. A security strategy in CAT would be to set a minimum for item sharing and item pooling index. A table of lower bounds for item overlap rates served as a guide for practioners for evaluating test security.

            Controlling item exposure is a key step towards validating the decisions about examinees test scores with the computer adaptive testing.

Van der Linden, W.J., & Veldkamp, B.P. (2004). Constraining item-exposure           rates in computerized adaptive testing with shadow tests. Journal of          Educational and Behavioral Statistics, 29, 273-291.

 

This paper considered a technique of probabilistic constraints imposed on items that have tendency of been over exposed. This approach was efficient because constraints were applied before items were selected. For each examinee eligible items were retained in the pool while the ineligible ones were removed. Simulation study was carried out using the Law School Admission Test (LSAC). This technique was shown to have negligible impact on the statistical properties of the ability estimator. The results showed that the method is effective in controlling item exposure.

            Cheaters are unrelenting and always improvise means of outwitting the test developers. This approach is an excellent way of controlling cheating behavior.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Impara, J.,  Caveon L., Kingsbury, G., et al. (2005).  Detecting cheating in      Computer Adaptive Tests using data forensics. Paper presented at the      annual meeting of the national council on measurement in education and      national association of test directors, Montreal, Canada.

 

This paper analyzed data forensic methods such as aberrance, collusion, and score volatility. Examinees misdemeanor on CAT were detected by checking for unusual item response patterns or time. The analyses examined four types of test security risk namely collusion, cheating, piracy, and volatile retakes. This technique was illustrated with an educational assessment program that used computerized adaptive testing (CAT). It was shown to be effective.

            Although data forensic were effective in detecting unusual response patterns, the ultimate goal should be to incorporate exposure control measures in CAT. It is essential that test developers change course and focus on item security rather than behavior of individual test takers.

Cohen, A.S. & Wollack, J.A. (2006). Test administration, security, scoring,        and reporting.  In  R.L. Brennan(Ed.). Educational Measurement. 4th           Edition 356-386

 

In this article it was mentioned that the earliest forms of testing and subsequently concern about examination security dates back to ancient China. “Examinees developed strategies to aid them in answering questions, including creating miniature books of notes, writing passages on undergarments, copying notes on back of fans and hiring impersonators”(p. 361).

Teachers and administrators’ alleged cheating behaviors included: “changing students’ answers after the test, teaching to the test, illegal coaching during a test, using live exams as practice exams, reading off answers during a test, providing extra time, disallowing low-achieving students from testing” (p. 369).

Technological advancement and improved psychometric theories resulted in efficiency in test administration, scoring, and reporting, however they brought about more security concerns. People are motivated to cheat because of” factors such as technology, widespread testing, and increased stakes associated with tests” (p. 356). Technology-based cheating strategies includes: communicating answers through text messaging, electronic sharing of test material via websites, e mail or chat rooms, and using calculators, pagers, voice recorders, personal digital assistants(PDAs), concealed video cameras, cell phones & i pods to scan and copy test questions etc. On the basis of sophistry cheating could be classified as hi-tech or low-tech.

The authors classified cheating into individual and collaborative based on impact on testing program. Collaborative cheating was enhanced with the computer adaptive testing (CAT) were the item selection procedure is such that a team of individuals could reproduce a large chunk of the item bank. Item over exposure in CAT was most common with maximum information item selection method. Counter measures for curbing cheating includes: human observation and prevention, electronic, and psychometric.

            The importance of maintaining test security cannot be over-emphasized since it is at the core of reliability of test data and validity of decisions made with them.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

FOLLOWING PAGES

ANNOTATED GLOSSARY

ARRANGED ALPHABETICALLY

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Annotated Glossary

 

A

Aberrance                                 : unusual response patterns or time

 

Aberrance rate                         : proportion of test items that indicate cheating

 

Aberrance score                      : statistic that compares observed with expected                                                         response model

 

Aberrance Threshold              : percentage of aberrance that indicates cheating

 

Adaptive testing                      : technique of testing where test items are selected                                                      according to examinees estimated ability

 

Alpha                                         : type I error rate

 

Alternative hypothesis            : statement of effect

 

Average item overlap            : percentage of items common to two test takers at                                                      the same ability level

 

Average item overlap rate    : average of the item overlaps of all pairs of                                                      examinees.

 

B

Bank Width                          : number of items in the item bank

 

b blocking                               : grouping items by difficulty to prevent exposure

 

Bayesian                                  : item selection method where items that most                                                     reduces the posterior variance are chosen

 

 

C

CAT                                     : computer based test where items are tailored to                                                   examinees’ ability

 

C-CAT                                     : constrained item selection method in CAT

 

Chance distribution               : distribution of identical errors of examinees                                                      seated in nonadjacent locations

 

Chance Methods                  : technique of detecting cheating by comparison of                                                    unusual responses to a known distribution

Cheating                                 : violating rules of testing

 

Cheating index                       : statistical index for detecting cheating

 

Collaboration                         : dissemination of information amongst test takers                                                     while testing

 

Collusion                                 : examinees’ sharing answers when testing

 

Countermeasures                  : techniques used to detect and prevent cheating

 

 

D

Deception                              : cheating by answer copying

 

Detection technique             : quantitative comparison of identical errors of                                                     cheaters with those of the chance distribution

 

Difficulty level                        : probability of a correct response

 

Dishonesty                              : irregularities during examination

 

 

E

Efficiency of Bank Utilization : percentage of items in the test bank that were                                                     administered to at least one examinee

 

Empirical Methods                 : technique of detecting cheating by comparison of                                                      unusual to chance response patterns

 

Exposure rate                         : percentage of tests administered in which an                                                     item was included

 

 

F

Fit Index                               : tracks unusual response patterns 

 

 

H

High score aberrance rate  : proportion of passed tests that are unusual

 

High-score Threshold          : Percentile above which test score is “high”

 

Honor codes                         : Institutional communal prevention of cheating

 

Hypothesis testing                 : procedure for proving research question

 

I

Identical errors                      : the number of identical wrong responses between                                                    two examinees’

 

Index of Collaboration         : determining within specified levels of confidence                                                    whether answer was shared between examinees

 

Individual security breach    : one individual working independently to cheat

 

Item Response Theory           : statistical framework underlying CAT where   (IRT)                                          items are selected by their psychometric properties

 

Item exposure                        : tendency of over selecting some items in the pool

 

Item exposure control           : constraining item selection

 

Item exposure rate                : the proportion of times that and item is selected be                                                     administered in a CAT

 

Iterative adjustment              : repetitive setting of exposure control parameters

 

Item overlap rate                  : ratio of the expected number of overlapping items                                                    encountered by two randomly sampled examinees

 

Item selection                         : choosing an item to be administered in CAT

 

Item sharing (Xa)                     : the number of common items shared by a group                                                    of a test takers. X2 has a hypergeometric distribution

 

Item Sharing index                : the expected value of item sharing

 

Item pooling(Ya)                   : the number of items shared by a test taker                                                    with other a test takers who have already taken the                                                    test   Y1 =X2 But Ya ¹ Xa+1 for a ³ 2

 

Item pooling index                : the expected value of item pooling

 

 

L

 

Low score aberrance rate   : proportion of failed tests that are aberrant

 

 

 

 

 

M

MCMC                             : a re-sampling procedure for estimating parameters                                                

Mean identical errors    : the number of item choices less one, divided by the                                             square of the number of answer choices

 

Mean score                      : average test score

 

Memorization                : prior knowledge about questions on a test

 

N

Null hypothesis                : statement of no effect

 

O

Objective Tests                : format in which answer choices are supplied

 

Observational methods : using proctors or administrators to track suspicious                                             behavior or obtain physical evidence of cheating

 

P

Piracy                                : unauthorized use of published test instrument

 

Psychometric                   : entail the use of statistical and data analysis                                              approaches to prevent or detect cheating

 

 

S

 

Score volatility                 : extreme variance in scores of an examinee

 

Standard Error of Measurement : measure of deviation of observed from true  (SEM)                                              ability

                                                  

Shadow tests                  : set of items that satisfy exposure constraints 

 

Statistical evidence        : identical errors exceeding the upper limit of confidence

 

Statistical significance    : the number of identical errors in proximate papers in                                              excess of the upper limits of chance distribution

 

Statistical indices             : techniques for detecting cheating

 

Stratification                     : dividing test items into more homogenous subgroups                                              based on a common criteria

 

Sympson-Hetter method : model probabilistic technique of item exposure control

 

 

T

Type 1 error                           : false positives

 

Type 2 error                           : false negatives

 

Test administrator                 : personnel in charge of proctoring the exam

 

Test compromise                  : pre-knowledge of items on a test

 

Testlets                                   : pre-structured test segments

                                

Test Security                          : controlling item exposure in CAT

 

 

V

Validity                                   : proof of accuracy of inference about person’s                                                   knowledge

 

 

 

 

 

 

 

 

 

 

 

 


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

FOLLOWING PAGES

 

SCHOLARS, ORGANIZATIONS, PUBLICATIONS (JOURNALS) & CONFERENCES

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Scholars

Cizek, G., Ph.D.

Cizek, G., research interests include standard setting, testing policy, classroom assessment and cheating on tests. He is the author of more than 250 journal articles, book chapters, conference papers and other publications. His work is important because it is at the core of validity of the inferences made from test scores which is especially essential in high stakes examination.

 

Fremer, J., Ph.D

 Fremer, J. research interest is on developing strategies to deter and detect test cheating. John Fremer is President and one of the Founders of Caveon Test Security, a company that helps test program sponsors, testing agencies, and others to improve security practices in all phases of test development, administration, reporting, and score use. John has 40 years of experience in the field of testing. I like his practical experience in cheating issues.

 

McLeod, L., Ph.D

She is an expertise on psychometrics, patient-reported outcomes, health Status and quality of life assessment and health outcomes strategy. She has over 10 years of experience in instrument development and has conducted many psychometric evaluations of both paper-and-pencil and computer-administered instruments. She contributed to research in computerized adaptive testing such as item response theory. Her work on detecting item memorization in computer adaptive tests is particularly important in preventing cheating on large scale assessment.

 

Stocking, M., Ph.D

His research is on item exposure control in computer adaptive tests. He has done quite extensive work in this area. His work is important because controlling item exposure helps to prevent cheating behavior and therefore improves the validity of inferences made with the test scores.

 

Van der Linden, W., Ph.D

His research interest includes test theory applications to behavioral measurement, computerized adaptive testing, optimal test assembly, response-time models, statistical procedures for test equating, and statistical decision theory and its applications to standard setting, selection, classification, mastery, and placement decisions. His work on exposure control in computer adaptive tests is particularly important in preventing cheating by memorization of the item in the item bank.

 

Wollack, J., Ph.D.

Wollack, J. research is on test security with specific interest in detection of cheating. He is also interested the impact of violating assumptions underlying common measurement models. His work on test security especially detecting cheating is significant because cheating reduces reliability of test data and validity of decisions made.

Organizations

 

American Educational Research Association (AERA)

 

American Psychological Association (APA)

 

Association of Test Publishers (ATP)

 

Association for Assessment in Counseling (AAC)

 

Educational Testing Service (ETS)

 

The Psychological Corporation/Harcourt Assessment

 

National association of test directors (NATD)

 

National Council on Measurement in Education (NCME)

 

 

Publications (Journals)

 

Applied Psychological Measurement

 

Applied Measurement in Education

 

British Journal of Mathematical and Statistical Psychology

 

Educational Measurement

 

Educational Measurement: Issues and Practice.

 

Educational and Psychological Measurement

 

Journal of Educational Measurement

 

Journal of Educational and Behavioral Statistics

 

Journal of American Statistical Association

 

Journal of Applied Psychology

 

Journal of Educational Research

 

Journal of Educational Statistics

 

Psychometrika

 

 

Conferences

 

American Educational Research Association conference

 

National council on Measurement in Education conference

 

National and international testing conference