The question of replicability
Are the Egyptian findings an artefact of a biased composition of the control group?
Facts v. formal logic
Sampling problems: how serious?
Test selection and administration
Statistical difficulties: are they real?
Summary and conclusion
Author: M. I. SOUEIF
Pages: 35 to 43
Creation Date: 1977/01/01
In their methodological commentary Fletcher and Satz stated the following: "... of several more recent cross-cultural neuropsychological studies, only the Egyptian study has produced evidence for impairment in higher adaptive cortical functions attributable to prolonged cannabis use". The authors proceeded to cite studies carried out in Jamaica, Greece and Costa Rica, that "failed to show any differences between chronic cannabis users and controls on measures of higher integrative adaptive functions". Fletcher and Satz are, thus, raising the question of replicability which is, undoubtedly, basic in the evaluation of empirical research. The quest for obtaining the same findings, however, is based on certain assumptions which, if not explicitly shown to be fulfilled, can generate ambiguity followed by confusion. An assumption of major importance in this respect is the use of identical tools of research and procedures of administration, and the comparability of subjects reported upon in the works to be compared. But this was not the case in the three studies cited above. When the same tools are not used in the works to be considered, the least that should be provided is a demonstration of reliable and meaningful relationships between different instruments that are supposed to assess the same area of functioning. The burden lies with Fletcher and Satz to provide statistical estimates of such relationships as a first step to justify their critique of the Egyptian study. In recent publications (Soueif, 1975 b; 1976 a) we tried to shed light on some inherent differences between the tests we used and those utilized in two Jamaican studies (Bowman and Phil, 1973; Rubin and Comitas, 1974), making it clear how such differences might contribute to the disparities between the Jamaican findings and ours. However, we could, also, show that where tested functions were thought to be the same and the testees were comparable (regarding level of education and residency) the Jamaican findings did not differ from the corresponding Egyptian results.
Recently, Dr. V.K. Varma, who has been collaborating with Prof. N. N. Wig, within the context of a WHO sponsored feasibility study, decided to try a replication of our research on groups of long-term heavy cannabis takers and matched controls in India. To ensure use of identical tools we provided them (at the request of Dr. D. Cameron of WHO) with a complete set of our tests together with a fully detailed description of the instructions we had followed in the administration of the tests. Following is an excerpt from a recent personal communication received from Dr. Varma: "... I have been interested in cannabis research especially with regard to the psycho-social effects of long-term, heavy cannabis use and in this connexion I collaborated with Prof. N. N. Wig . . . in a WHO-sponsored feasibility study. Earlier we have had occasion to obtain literature regarding your previous publications and the psychological tools used by you. In a small study conducted by us, we also found significant difference between long-term, heavy users and matched controls, on I.Q., memory quotient, time perception etc. Thus your study as well as ours seems to stand in contrast with the more recent reports coming from Jamaica, Greece and Costa Rica."
Fletcher and Satz underlined the fact that our control group had a better standing than users on literacy. We admit that it would have been preferable to have the two groups, the experimentals and controls, exactly equated on this parameter. However, the argument the two authors presented, based on that fact, does not seem convincing. To begin with, the way Fletcher and Satz paraphrase our first hypothesis does not do justice to our original formulation. Following is the way they put it: "The first of Soueif's 3 hypotheses states that deficits in brain function are associated only with higher literacy levels". But our original formulation reads as follows: "The lower the level of literacy the smaller the size of function deficit associated with cannabis taking" (Soueif, 1976 a). It would be readily seen that whereas the first statement suggested dichotomy the latter implies gradation. The former formulation makes it easy for Fletcher and Satz to present their argument which would run something like:
Soueif's users are either literates or illiterates;
Soueif associates brain function deficits with literates only;
In Soueif's work literacy favours controls;
His study takes into account the fact that better test scores go with literacy;
Therefore, the fact that literate controls show better test results than users is an artefact of the original better standing of the controls;
Therefore, test results have nothing to do with cannabis consumption.
But this line of argument does not take into account the totality of our results. We have been able to demonstrate that the discrepancies between takers and non-takers are more apparent when the compared groups are literate (significant differences favouring controls on ten out of sixteen test variables); they, also, appear, though not so strongly, among semi-literates (significant differences on five test variables), and disappear among illiterates. Two comments are warranted here: ( a) that our original formulation, suggesting gradation rather than dichotomy in literacy, seems more adequate than Fletcher and Satz's paraphrasing in accounting for the data presented, and ( b) that the excess in the number of high school subjects (44) and university graduates (6) in our controls does not account for all the data we presented. Objective test scores favoured controls, not only in the case of literates, but also in semi-literates (those who knew how to read and write but went no further up the educational ladder). At the same time, test scores did not differentiate between users and non-users among illiterates.
Fletcher and Satz, apparently intending to add strength to their argument, state the following: "The potential influence of literacy differences on the over-all direction of the results between the groups is highlighted by the fact that no significant differences were apparent when literacy levels favoured the user group (in the Upper Egyptian analysis)". This seems hinting that we were arguing against any influence of literacy level on test results, which is not the case. On the contrary, we have emphasized such influence in the first of our six specific predictions (Soueif, 1976 a). The important point, here, is whether our findings, regarding the differences between objective test scores obtained by users and non-users at different levels of "literacy-illiteracy" should or should not be considered an artefact of a biased composition of our control group. The basic position defended by Fletcher and Satz implies that no brain function deficit is associated with cannabis consumption. Had this been the case we would have expected another result in the Upper Egyptian analysis different from what we obtained. With literacy favouring takers that much (of the takers 62.3 per cent were illiterates, and 29.6 per cent were semi-literates; of the controls 72.5 per cent were illiterates and 16.3 per cent were semi-literates) we would have expected takers to obtain significantly better scores on the tests than controls. The fact is, however, that on 13 out of 16 test variables there were no significant differences between takers and non-takers. We propose to account for this fact as follows: hypothetically (according to our basic position) low test scores are expected to be associated with drug use; at the same time high scores are expected to correlate with high literacy; the end result, as represented by the obtained scores, would be that the two tendencies would negate each other showing no reliable differences between takers and non-takers. Such interpretation gains more support from the fact that in a number of cases we found that literate takers would score better on our tests than illiterate non-takers (cf. tables IV, V and VI, Soueif, 1971). This should not, however, confuse the main issue, that when groups of takers and controls are equated (or approximately equated) for literacy, controls earn significantly better scores than takers.
Fletcher and Satz raise logical objections against the "relationship suggested by Soueif ... that deficits are more likely to emerge with younger users". It should first be noted that this is a fact which was empirically demonstrated twice; with heavy takers (who took the drug over 30 times a month) and with moderates (30 times or less per month), and in both cases the experimentals were compared with age-equated controls (Soueif, 1976 a). We should, therefore, try to explain it rather than brush it aside. A tentative explanation has, already, been suggested (Soueif, 1976 b). The explanation is based on two key concepts: the "initial level of proficiency", and the "level of arousal or activation" (Martin, 1973). The former concept (acting as an intervening variable) furnishes an antecedent condition which can be utilized to help prediction, while the latter plays the role of a hypothetical construct containing surplus meaning which helps anchoring our findings in whatever relevant knowledge exists at the next lower level in the explanatory hierarchy (neurophysiology in this case) (MacCorquodale and Meehl, 1955). What Fletcher and Satz find as creating a conceptual difficulty is paralelled by more or less similar findings already reported in the clinical literature. Binder, using Thurstone's test battery for the assessment of Primary Mental Abilities, tested a group of young (mean age 29.2 years) and a group of old schizophrenics (mean age 53.6 years) and compared them with matched groups of normals. What is relevant in Binder's results is that he found that older schizophrenics differed much less from normal controls than did younger patients when compared with their matched controls (Binder, 1956). This finding is analogous, though emerging in another area of study, to ours, concerning age as a moderating variable in determining brain dysfunction.
Fletcher and Satz mention "other serious sampling problems ... in the Egyptian study". One such problem is the fact that our subjects were derived from the prison population. The authors proceed to say, "it may be that the results with this sample are not generalizable to the Egyptian population". We never claimed that our subjects were selected to constitue a representative sample of the Egyptian population. In fact we explicitly cautioned the reader against straightforward generalization of our findings. Following is an example of such cautionary note: "At the present stage of its development, however, our hypothesis seems to raise more questions than answers. For one thing, it would be ill-advised to generalize such a formula to other samples of subjects ... and other cultural settings on an a priori basis. Cross-validating studies are needed to define the limits beyond which the formula should not be extended" (Soueif, 1976 a).
Fletcher and Satz count, as another serious sampling problem, that consuption of other narcotics in the user group was uncontrolled (30 per cent used opium), and that this was among the facts which favoured the non-user group. This implies that the subgroup who used opium over and above cannabis should have earned consistently lower test scores than cannabis takers who did not take opium. In this way the grand averages obtained by the group of takers as a whole would be pulled down, thus standing in a worse position relative to the averages obtained by controls. But this is not the case. First of all such deductions should not be made on an a priori basis. And the truth is that we made an analysis of the relevant data comparing between the medians obtained by those who took opium and those who did not. On eleven out of sixteen test variables the differences were not statistically significant. Moreover, on three out of the five differentiating tests those who took opium plus cannabis obtained significantly better scores than "cannabis only" users! The three tests were the following: Part K of the GATB ( t= 2.69), Bender Gestalt, Copy ( t= 11.00), and Bender Gestalt, Recall ( t= 4.32). The remaining two tests which favoured the "cannabis only" group, were Part C of the GATB and "Length estimation, irrespective of direction of discrepancy" (Soueif, 1975 a; 1977). Many comments could be made here, but the most relevant is that, with such profile of findings one cannot conclude that opium taking systematically worsened the performance of our experimentals on objective tests compared with the performance of the controls.
Fletcher and Satz raise the problem of possible floor effects of our psychometric tools. In the case of the Egyptian study such criticism is irrelevant, for the simple reason that the floor effect phenomenon acquires significance when the tests utilized are of the power type, and the test items are rather difficult. This implies that the test items should have the nature of problems and the required performance should be of the nature of problem solving. But this is not the case in our study. Six test variables of ours were very much nearer to the pure speed type. Anastasi makes the following distinction: "A pure speed test is one in which individual differences depend entirely on speed of performance. Such a test is constructed from items of uniformly low difficulty ..." (Anastasi, 1976, p. 122). The remaining tests, though they did not emphasize speed, required subjects to perform very simple tasks. In fact we were quite aware of the rightness of this point from the beginning, and we stated that explicitly in our early reports. Following is an example: "In addition to considerations of reliability and validity the choice of tests was influenced by two main factors: ( a) they should be widely known among psychologists in Egypt and abroad, and ( b) should be as culture free as possible, so that they would serve to make possible future comparisons across cultural researches" (Soueif, 1971). Anastasi mentioned skewness of the distribution of test scores as an important criterion to see whether the test utilized had a high floor for the testees (Anastasi, 1976, p. 202). The reason for such skewness is that a large number of the testees would be lumped together towards the lower end of the scale (viz. earning very low scores). But this is not the main trend featuring in the distributions we have been presenting. We have provided tables of means, standard deviations, medians and interquartile deviations for both takers and controls among urbans, rurals derived from Upper Egypt, rurals drawn from Lower Egypt, literates, semi-literates and illiterates on 16 test variables (Soueif, 1971). Inspection of these tables shows that only in very few cases did skewness feature. (Incidentally, those were the cases where we decided to compare between median scores rather than means.) One table did not present any skewed distribution (that of rurals derived from Lower Egypt), one showed one incident of skewness (that of literates), and four showed 4 skewed distributions each. Moreover, a close examination of the tables (Soueif, 1971; 1976 a) shows that the phenomenon of "the differential association" does not capitalize on these few cases of skewness. In actual fact, neither in the case of literacy nor in that of age (as moderator variables), did any of the tests which delineated differential association show skewness. In residency two tests only, Mark Making and Bender Gestalt, yielded skewed distributions (among Upper Egyptians).
Another point, still, to demonstrate that our test results did not suffer from a floor effect hinges upon an examination of the spread (or scatter) of test scores (Anastasi, 1976, p. 203). Had the tests been markedly difficult for illiterates, rurals and older people compared with literates, urbans and younger subjects respectively systematic changes in the variances of test scores should have emerged accordingly. We would have expected estimated variances to be very much less in the illiterates than in the semi-literates and the literates. The reason is that since a test is very difficult for illiterates it would lump them together and no individual differences would show. By the same token variances would be very much less among rurals than among semi-rurals and urbans; and among older than younger subjects. But inspection of the tables (Soueif, 1971) shows that this is not the case. In many cases the estimated variances are almost identical in the comparable groups.
The objections raised by Fletcher and Satz against our testing procedure seem irrelevant. The tests we used were all of the objective and structured type. Even the four cards of the Rorschach inkblots and the six cards of the Bender Gestalt test were treated as objective tools. We had nothing to do with the content of the responses subjects gave to the Rorschach cards. We simply read the stopwatch to mark the number of seconds elapsing between exposing the subject to each card and the start of verbal response. This means that the score was simply a measure of speed of response and no dynamic interpretation was given to it. There is nothing subjective in this method of handling test scores. The Bender Gestalt test was used in the modified way described by Shapiro, Field and Post (1957). The scoring system devised by Miss Löfving was adopted. The system provided a five-point rating scale of correctness for deriving a score for copying and a score for recall (Shapiro, Post, Löfving and Inglis, 1956). This system leaves very little room for subjective judgement, a fact testified by the achievement of a high degree of agreement between scorers independently scoring the test during their training. The remaining 14 test variables were almost completely objective and structured. Relevant here is the following statement made by Scheier: "There is general agreement that objective tests deal with a person's behaviour as contrasted with what he says about his behaviour, and that all observers will agree on the score to be assigned a given person's performance on an objective test" (Scheier, 1958). This fact considered, plus the fact that test instructions were very much structured, and that our testers were given an intensive training, lasting for three consecutive months, for the sole purpose of getting them to be highly standardized in the way they administered the tests (and the interviews), we think that very little scope was left for the experimenter bias to distort the subject's test performance systematically.
It is very difficult to understand why Fletcher and Satz would ask for a grand mean for the whole group of takers v. a grand mean for the whole group of non-takers (on each test variable) while they acknowledge that "such findings would undoubtedly be distorted". Indeed such grand means, if they mean anything, can be readily calculated from our published tables where we stated means, standard deviations and numbers of subjects for each subgroup (cf. tables 1 and 2, Soueif, 1975 b). We did, however present an analysis of variance between takers and non-takers irrespective of literacy and residence. F ratios were statistically significant in 9 test variables. The level of significance was very high on 6 out of those 9 tests, going far beyond 0.001 (Soueif, 1975 b). It is very unlikely that levels of significance of such order would be an artefact of difference of variances between the compared groups. McNemar makes it clear "that extreme differences in variance ... do not greatly disrupt the F test as a basis for judging significance in the analysis of variance". McNemar proceeds to say, "If the investigator wishes to have some assurance that he is not risking the making of the Type I error more often than his chosen level for judging significance, he may wish to adopt a somewhat more rigorous level: requiring a computed F to reach the 0.01 level provides a very safe base for claiming significance at the 0.02 level" (McNemar, 1969, p. 288).
It is true that we reported different degrees of freedom for each variable, a fact necessitated by keenness to reject every single case that showed violation of test instructions for any reason. Only in the case of the Trail Making Test, which required subjects to be able to read numbers, did we have to spare a rather sizeable number of our subjects. Apart from that, the degrees of freedom fluctuated within narrow limits, and there is no reason to believe that essentially different sets of subjects constituted groups for each statistical analysis. Inspection of the detailed tables (Soueif, 1971) shows that such belief would be unfounded.
The questions posed about the relationship between the magnitude of probability values ( p-values) used to establish statistical significance and the size of the samples can be presented in a misleading way if not formulated properly. "It is usual and convenient for experimenters to take 5 per cent as a standard level of significance . . ." (Fisher, 1951, p. 13). "A probability of 0.01 or smaller will be regarded very significant and will simply mean that the hypothesis being tested will be rejected with greater confidence than when the probability is between 0.05 and 0.01" (Edwards, 1956, p. 28). That the magnitude of the probability value used to establish statistical significance tends to increase as the sample size becomes larger is true. But it does not follow that the research worker should avoid using large samples altogether. What is certainly required is that the student should consider with great care the problem of defining the level of significance he is ready to accept. Fisher states clearly that, "It is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result . It is obvious that an experiment would be useless of which no possible result would satisfy him" (1951, p. 13). A cautious investigator might be willing to set his acceptable level of significance at 0.01 or even a more tight p-value instead of 0.05. When doing so the researcher should take into account a number of points, the main being the dangers of committing a type II error (viz. accepting a null hypothesis as true while it is actually false), and of not permitting faint regularities to appear by insisting on accepting nothing less than very high levels of significance. The implication is that there is nothing wrong with large samples as such. On the contrary, they may provide the researcher with a chance to reveal a faint but lawful relationship. This is one of the methodological gains one gets from carrying out large scale surveys and it is very important when new areas of research are trodden upon (Edwards, 1954). The newly discovered relationship, once identified, might encourage some investigator to make the necessary effort to brush aside much of the noise which keeps it faint at which time the relationship would prove important in terms of meaningfulness and statistical significance. The value of the Egyptian study, however, does not hinge upon low probability values. Close inspection of the relevant tables (Soueif, 1971; 1976 a) shows that out of 73 significant t-tests 30 tests had a p-value far beyond 0.001, and 22 went beyond 0.01, in addition to 21 coming out with p-values between 0.05 and 0.01. And out of 39 significant F ratios, 37 had a p-value far beyond 0.001, one at 0.01 and one at 0.05.
To conclude, the points raised by Fletcher and Satz against the Egyptian study are either unconvincing or irrelevant. Not a single study of the three ones mentioned by those authors, viz. the Greek, the Jamaican or the Costa Rican, could be considered an adequate replication of the Egyptian project, regarding the tests administered and the procedure of administration. Yet, preliminary reports on the only study where such replication was undertaken (in India) shows that our main findings were reproduced. We have shown that our main findings could not be an artefact of a slightly better standing of controls on literacy. We have also shown that, strange as it may seem, our finding that cannabis use among youngsters is correlated with more brain function deficit than it is among old users, is paralelled by some analogous finding in the clinical literature on young compared with old schizophrenics. We have shown that the fact that some Egyptian users took opium over and above cannabis did not worsen their test performance compared with the performance given by "cannabis only" users. The question of a possible floor effect on our tests was shown to be either irrelevant or unfounded. It was also made clear that the criticism made against our testing procedure was ill founded. And, finally, the question as to the relationship between the size of the probability values used to establish statistical significance and the size of the sample was discussed. When the majority of the results reach a level of statistical significance far beyond 0.01 it is highly improbable that such results are simply artefacts of the big size of the sample.
A constructive suggestion that seems to impose itself at this point is that a proper replication of the Egyptian study (with a smaller sample may be, but the main independent variables being represented) would be worth undertaking. Because the relationship between the tests and the psychological functions thought to be measured by such tests is rather complicated the same instruments should be used in the replicated study. The criteria for takers and controls should be the same as those adhered to in the original work. Criticism raised under such conditions would be methodologically justified and the probability is that it will lead to creative answers.
Anastasi, A., Psychological testing, New York. Macmillan, 4th ed., 1976.
Binder, A., Schizophrenic intellectual impairment: Uniform or differential? J. Abn. Soc. Psychol., 52, 11-18, 1956.
Bowman, M. and R.O. Pihl, Cannabis: Psychological effects of chronic heavy use, Psychopharmacologia (Berlin), 29,159-169, 1973.
Edwards, A.L., Experiments: Their planning and execution, Handbook of social psychology, G. Lindzey ed., Cambridge 42, Mass., Addison-Wesley, I, 259-288, 1954.
_____, Experimental design in psychological research, New York, Rinehart, 1956.
Fisher, R.A., The design of experiments, London, Oliver and Boyd, 6th ed., 1951.
MacCorquodale, K. and P.E. Meehl, Hypothetical constructs and intervening variables, Readings in the philosophy of science, New York, Appleton-Century-Crafts, Inc., 596-611, 1953.
McNemar, Q., Psychological statistics, New York, J. Wiley, 4th ed., 1969.
Martin, I., Somatic reactivity: Interpretation, Handbook of abnormal psychology,H.J. Eysenck ed., London, Pitman Med., 333-361, 1973.
Rubin, V. and L. Comitas, Effects of chronic smoking of cannabis in Jamaica. A report by the Research Institute for the Study of Man to the Centre for Studies of Narcotic and Drug Abuse, National Institute of Mental Health, Contract No. HSM-42-70-97, 1973 (mimeographed).
Scheier, I.H., What is an objective test? Psychol. Rep., 4, 147-157, 1958.
Shapiro, M.B., J. Field and F. Post, An enquiry into the determinants of differentiation between elderly "organic"and "non-organic" psychiatric patients on the Bender Gestalt Test, J. Ment. Sci., 103, 364-374, 1957.
_____, F. Post, B. Löfving and J. Inglis, Memory function in psychiatric patients over sixty, some methodological and diagnostic implications, J. Ment. Sci., 102, 233-245, 1956.
The Egyptian study: A reply to Fletcher and Satz 43
Soueif, M.I. The use of Cannabis in Egypt: A behavioural study, Bulletin on Narcotics, XXIII: 4, 17-28, 1971.
_____, ( a) Psychomotor and cognitive deficits associated with long- and short-term cannabis consumption: Comparison of research findings and discussion of selected extrapolations, Proceedings of the Third International Cannabis Conference, organised by the Institute for the Study of Drug Dependence at the Ciba Foundation, London, P.H. Connell and N. Dorn eds., London, Churchill Livingstone, 25-44, 1975.
_____, ( b) Chronic cannabis users: Further analysis of objective test results, Bulletin on Narcotics, XXVII: 4, 1-26, 1975.
_____, ( a) Some determinants of psychological deficits associated with chronic cannabis consumption, Bulletin on Narcotics, XXVIII: 1, 25-42, 1976.
_____, ( b) The differential association between chronic cannabism and impairment of psychological functions: Towards a theoretical framework, I.C.A.A.; Papers presented at the 6th International Institute on the Prevention and Treatment of Drug Dependence; Hamburg, Germany, 106-118, 28 June-2 July 1976.
_____, Cannabis users vs. "cannabis plus opium" users: a comparative study of some correlates, 1977 (in press).