|
|
|
|
GOR 2001 Homepage
Indices
Management - Staff only
|
GOR 2001 - contentThis is the http://kiwi.uni-psych.gwdg.de/congress/gor-2001/contrib/contrib/preckel-franzis/preckel-franzis Document. Main Author: Preckel, Franzis Co-Authors: Thiemann, Helge; Institution: Psychologisches Institut IV, Westfälische Wilhelms-Universität Münster Contribution Title: Testing intellectual giftedness on the Web: Development of a new Figural Matrices Test online versus paper-and-pencil-version. Authors Email: preckfr@psy.uni-muenster.de URLs:
Abstract German (version: 25/06/2002 - 07:47, size: 0) English: A new figural matrices test for the assessment of intellectual giftedness will be presented. Item construction was based on a rule taxonomy that takes recent empirical findings on the difficulty of item components into account. These components are number and type of rules and drawing features. The construction of the answer alternatives was also guided by the rule taxonomy which allows the validation of the theoretical assumptions of item construction by means of error and distractor analysis. The test was presented as an online-version (www.begabungsprofil.de) and as a paper-and-pencil-version. Participants were members of the Mensa Society, pupils of special schools or programs for the intellectually gifted and other volunteers. The collected data were analyzed by classical test theory, by modern item response theory (1PL- and 2PL-models), and by rough set analysis referring to the rule taxonomy. The item- and test-properties of the online-version and of the paper-and-pencil-version will be compared. Limitations and possibilities of the collection of psychometric test data via internet will be discussed. Article (version: 25/06/2002 - 07:47, size: 15520) 1. Introduction
Computer assisted testing (CAT) offers the advantages of increased standardization and economy of test applications. Online tests via internet can be seen as a special case of computer assisted testing which bears new opportunities and risks for psychometric assessment. Wilhelm and McKnight (2000) stress the lack of person mediated communication (e.g. comprehension of the test instruction) and the reduction in experimental control (e.g. self selection of the participants, use of auxiliary means) as the two main sources of distinctiveness between CAT and online-testing. One can find numerous tests on the internet but many of them are at best suitable for amusement and cannot be considered as valid assessment tools. In this paper we would like to examine the equivalence of a paper-and-pencil- (P&P) and an online-version of a newly developed figural matrices test for the assessment of intellectual giftedness (HBMT, HochbegabungsMatrizenTest). To our knowledge at present, there is no assessment tool available in Germany that is suitable for the assessment of high intellectual ability. Existing tests suffer from the problem of ceiling effects and the reliability of the measurement is not guaranteed for high scoring subjects. Quantitative differences between average and high intellectual ability are best identified with item-material loading high on the g-factor (Robinson & Janos, 1987; Rost, 2000). In factor-analytic studies, figural matrices tasks show mean loadings of .80 on the g-factor (Jensen, 1998). Several different research approaches demonstrate that figural matrices tests measure processes that are central to analytic intelligence (e.g. Carroll, 1993; Marshalek, Lohman & Snow, 1983). Therefore this item type seems to be especially suitable for the differentiation between average and high intellectual ability
The equivalence of CAT- and P&P-versions can be examined under the aspects of quantitative and qualitative equivalence (van de Vijver, 1994). Whereas qualitative equivalence refers to construct validity, quantitative equivalence addresses mainly the question whether the norm data of one test version can be applied to the other test version. A meta-analytic study on the equivalence of CAT- and P&P-cognitive ability tests by Mead and Drasgow (1993) which included 159 studies found that for timed-power-tests (tests without time constraints or very generous time allocations) both aspects of equivalence are met if the CAT-version is an exact copy of the P&P-version. Wilhelm and McKnight (2000) examined the equivalence of P&P- and online-tests of deductive reasoning tasks. Results showed that reliable and valid data could be collected with online-tests for the item type under study. In addition, the influence of test modality was examined by the application of a mixture distribution Rasch-model to the whole dataset (P&P- and online-data). No significant correlation was found between modality and class membership of the participants, indicating "that the administration method has no substantial influence on the answer patterns participants generate under different administration methods ..." (Wilhelm & McKnight, 2000, pp.13-14). Item construction of the HBMT was based on a rule taxonomy. Therefore, the equivalency of both test modalities of the HBMT can be examined under a qualitative aspect referring to the internal structure of the data. The internal structure is defined by the generating principles underlying item construction. The examination of the quantitative equivalence of online- with P&P-tests proves to be problematic because comparability of the samples working on different versions can only be realized in an at least quasi experimental setting.
2. Rule-based item-construction
According to Carpenter, Just and Shell´s (1990) processing theory of matrix completion tasks two cognitive processes are relevant for task solution: correspondence finding which is based on abstraction capacity and goal-management which involves working memory capacity. Abstraction capacity is assumed to be equally required in matrix completion tasks (Embretson, 1995). The working memory load is influenced by task complexity that is determined by three features: number and kind of the sub problems as well as distinctiveness of the sub problems (Carpenter et al., 1990; Vodegel Matzen, 1994; Embretson, 1998). Carpenter et al. (1990) show that problem solvers build up a complete representation of all elements of a task by successively working on the different sub-problems. With a growing number of sub-problems, working memory load and task difficulty increase, because the problem solver has to decompose the task into sub problems and manage the various solutions.
The item construction of the HBMT was mainly based on the findings of Carpenter et al. (1990) and on studies referring to their research results (Embretson, 1998; Vodegel Matzen, 1994). All of these studies find that the number of sub problems or rules is the most important determinant of task difficulty. Tasks with multiple rules require the problem solver to identify which figural elements of the task are determined by the same rule. This process of correspondence finding is made more difficult by ambiguous stimuli. Drawing features like overlay or fusion of figural elements cause ambiguity in the distinctiveness of figural elements. Table 1 presents the types of rules and the drawing features used in the HBMT-tasks.
3. Method
Twenty six matrices completion tasks were generated, which contain two up to five rules. The items consist of 16 entries, arranged in a 4x4 matrix. The target entry in the lower right corner was left blank. Each item has nine answer alternatives: the correct solution, seven distractors and the alternative "no alternative correct" to prevent a falsificatory exclusion strategy (Gittler, 1989). The distracters were build systematically and contain one or more rule omissions (1, ..., a-1 where a = number of rules). The items were roughly ordered with respect to the number of rules. Before working on the 26 items, the participants worked on 11 trial tasks where all rules were explained in detail. There were no time constraints allotted. Data was collected by a paper-and-pencil- (P&P) and by an online-version. Both versions were supplemented by questionnaires on demographic data. The adaptation of the online-version preserved the similarity between both test versions as far as possible.
4. Results
Item difficulties show a mean of .59 for the P&P-version and of .69 for the online-version. Item discrimination (point-biserial) yields a mean of .36 for both test versions. Item slopes of the standardized slope of the item characteristic curve of a two parameter IRT model show a mean of .64 for the P&P-version and respectively .72 for the online-version. An internal consistency (Cronbach´s alpha) of .83 was estimated for both versions. The split-half reliability (equal length Spearman-Brown) is .81 for the p&p-version and .83 for the online-version. The results (raw score) in both versions were correlated to diverse criteria. Correlations are shown in table 3.
The size of the correlations was checked for differences between both samples by Fisher-Z-transformation. The only significant difference between both samples was found for the correlation of sex and score (Fisher's Z=.286, p=.00). Whereas there is no correlation between sex and score for the p&p-sample the 81 woman who took part in the online-test tended to score lower. Research shows that there are no differences in aptitude to solve matrix completion tasks between men and women (e.g. Mills & Tissot, 1995). Therefore, the finding of women in the internet sample scoring lower is not representative for the item type under study. It is unclear how to explain this result and numerous confounding factors might play a role here.
Psychometric quality was also assessed by fitting alternative IRT-models to the data. The fit of different models can be estimated by a comparison of the geometric mean of the marginal maximum likelihood of the different models (Rost, 1996). Differences between the models for both samples were not very striking, but the 2PL-model tends to fit the data better than the 1PL-model. The 3PL-model does not improve model fit compared to the 2PL-Model. In addition model fit can be assessed by information criteria or so called penalty functions because they take the number of estimated parameters (AIC-index) and the sample size (BIC-index) into account. The lower the index-value the better the model fits the data.
The rank correlation (Spearman) for the 2PL-item difficulties between both versions is r=.94, which indicates that both versions have a comparable order of item difficulties.
To assess the impact of the generating item structure on item difficulty, a multiple regression with item difficulty of the 2PL-model as dependent variable and frequencies of each rule type and a variable for drawing features as independent variables was calculated. The 2PL-item difficulties are normally distributed. The regression yielded a significant effect of item structure for both the p&p-version, F(6, 18)= 6.20, p<.00, and the online-version, F(6, 18)= 8.02, p<.00. The multiple correlation is R=.82 for the p&p-version and R=.85 for the online-version (corrected R2: p&p .57, online .64) indicating that item difficulty can be predicted well by the generating item structure.
5. Discussion
The psychometric properties of the online version proved to be satisfactory compared to the properties of the p&p-version. Item discrimination, reliability and convergent validity are high in both versions. The high correlation between the estimated 2PL-item parameters for both versions indicates that the order of item difficulties is nearly equal. The item difficulty can very well be predicted by the generating item structure. According to the results of Mead and Drasgow (1993), the absence of time constraints and maximal similarity (identity besides modality) between test versions contribute to the equivalence of the CAT- and p&p-version of a test. Both conditions were fulfilled for the online- and p&p-HBMT. Moreover, participants received identical written instructions and trial tasks where all possible item components like rules or rule direction were explained and exercised. Even if the participants of the online-version did not read the instructions carefully, they had to work on the eleven example tasks before working on the scored tasks. Therefore, it is likely that they were familiar with the task demands when working on the test. In addition, the good results from the online sample prove that the participants must have understood the instructions. The explanation of the item components in advance has several advantages: In the first place, test fairness is enhanced because participants have comparable conditions with respect to their knowledge of task principles. Secondly, the rule based item construction and the training of item-components ensures that the correct solution can be identified unambiguously. Last but not least, it can be argued that the knowledge and training of the task components circumscribe the (mixture of) latent factor(s) responsible for the answers participants produce. This leads to a reduction of error variance and improves the interpretation of results.
Nevertheless, the absolute difference in item difficulty proved to be problematic. The participants of the online-version had a significantly higher score than the participants of the p&p-version (Mann-Whitney-U=30423,5, Z=-6.21, p<.00). This can be explained by sampling effects and the test taking situation: the online-sample was highly motivated and self selected and it might be reasonable to assume that many of the online participants have some experiences in (online) test taking. Moreover, there was no experimental control of the use of auxiliary means. The purpose of collecting numerous data via internet with gifted samples was not served in the study presented here. Despite the fact that over 20.000 different computers accessed the online-test, only 358 persons worked through the whole test. The presented test was quite time consuming and although it could be completed offline, the time investment and the unchanged item type imposed great demands on the motivation and patience of the participants. The use of shorter and more varied tasks seems to be more profitable for online-assessment.
6. References
Andersen, E. B. (1973). A goodness of fit test for the Rasch model. Psychometrika, 38, 123-140. Carpenter, P.A., Just, M.A., & Shell, P. (1990). What one intelligence test measures: A theoretical account of the processing in the Raven Progressive Matrices Test. Psychological Review, 97, 404-431. Carroll, J.B. (1993). Human cognitive abilities: A survey of factor-analytic studies. Cambridge, MA: Cambridge University Press. Embretson, S. (1995). The role of working memory capacity and general control processes in intelligence. Intelligence, 20, 169-189. Embretson, S. (1998). A cognitive design system approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3, 380-396. Gittler, G. (1990). Dreidimensionaler Würfeltest. Ein Rasch-skalierter Test zur Messung des räumlichen Vorstellungsvermögens. Testmanual. Beltz Test. Hornke, L.F., & Habon, M. (1984). Regelgeleitete Konstruktion und Evaluation von nicht-verbalen Denkaufgaben. Wehrpsychologische Untersuchungen, 19, 1-153. Jensen, A. (1998). The g-factor: The science of mental ability. Westport, CT: Praeger. Marshalek, B., Lohman, D. F., & Snow, R.E. (1983). The complexity continuum in the radex and hierarchical models of intelligence. Intelligence, 7, 107-127. Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. Psycgological Bulletin, 114, 449-458. Mills, C. J., & Tissot, S. L. (1995). Identifiying academic potential in students from under-represented populations: Is using the Ravens Progessive Matrices a good idea? Gifted Child Quarterly, 39, 209-217. Raven, J. C. (1962). Advanced Progressive Matrices, Set II. London: H. K. Lewis. Robinson, N. M., & Janos, P. M. (1987). The contribution of intelligence tests to the understanding of special children. In J.D. Day & J.B. Borkowski (Eds.), Intelligence and exceptionality: New directions for theory, assessement, and instructional practices (pp. 21-56). Ablex Publishing Corporation. Rost, D. H. (Hrsg.) (2000). Hochbegabte und hochleistende Jugendliche. Münster: Waxmann. Rost, J. (1996). Lehrbuch Testtheorie, Testkonstruktion. Göttingen: Huber. Vodegel Matzen, L. B. L., van der Molen, M. W, & Dudink, C. M. (1994). Error analysis of Raven test performance. Personality and Individual Differences, 16, 433-445. Ward, J., & Fitzpatrick, F. (1973). Characteristics of matrices items. Perceptual and Motor Skills, 36, 987-993. Wilhelm, O., & McKnight, P. E. (2000). Ability and achievement testing on the world wide web. Unpublished Manuscript. University of Mannheim, University of Washington. |