The Carlson Test continues to support Astrology
by Robert Currey
Shawn Carlson’s renowned study, “A Double-Blind Test of Astrology” (1985), comprised of five separate tests to evaluate astrology. Only two: termed Test #1 and Test #2 survived. The others were rejected by Carlson due to artefacts. Published in the scientific journal Nature, the study which engaged over a hundred participants and up to twenty-eight astrologers, found no support for the practice of natal astrology. In 2009, Professor Suitbert Ertel refuted Carlson’s negative conclusion by showing that when the sample is treated as a whole the astrologers successfully matched psychological profiles with birth charts in the two valid tests to a significant level (p = .05 in Test #1) and (p = .04: ES = .10 in Test #2). This paper reexamines the results from the Carlson Test with a multivariate regression analysis. This graph combines the astrologer’s success rate at different ranking levels with the frequencies of their evaluations. It shows that there was a strong agreement among the astrologers in rating correct matches above false matches with a high level of consistency (r = .57) despite the complexity of the test. The findings challenge previous interpretations by Dean et al. in Understanding Astrology (2022) and reinforce the need for more nuanced assessments when testing the validity of astrological claims.
The Carlson Tests continues to support astrology
Shawn Carlson’s famous “A Double-Blind Test of Astrology” (1985) is a series of five tests, although Carlson rejected three of them (we refer to the surviving tests as Test #1 and Test #2). The study was sponsored, guided and accepted in the prestigious science journal Nature by senior ‘fellows’ of a sceptical group, then known as the Committee for the Scientific Investigation of Claims of the Paranormal (CSICOP) (Kurtz 2006). The Carlson study is unique and notable as there were over one hundred subjects and it involved the collaboration of up to twenty-eight astrologers.
The original conclusion, rejecting astrology as practised by astrologers, received worldwide publicity. However, two decades later, Professor Suitbert Ertel (2009) pointed out that the published data demonstrated that the astrologers were able to match charts with psychological profiles in both of the valid tests to a significant level (p = .054 & p = .037).
Dean et al. claim a test rejected by Carlson is valid
Dean and his co-authors are quite fair in reporting Ertel’s results. However, they consider that both Ertel and I in my subsequent review of the experiment (Currey, 2011) should have included one of Carlson’s rejected tests:
“Both [Currey and Ertel] ignore Carlson’s first test, which did not involve the CPI [California Psychological Inventory] and thus remains valid.” (UA, 2023 p. 574-575).
Dean is referring to a test where the subjects were unable to identify their own horoscope to a significant level in a three-way choice that included two alternative horoscopes from other subjects with the same Sun Sign. However, Carlson and his sceptical associates, ruled out this test because the subjects were also unable to identify their own (self-reported) psychological profile (CPI) in a three-way choice that included two random alternatives.
“We conclude that subject selection of astrologically derived information is a poor test of astrology”. (Carlson 1985 p.425)
“We cannot use the result [of the subject self-selections] to rule against the astrological hypothesis, because the test subjects were also unable to select their own CPI profile at a better-than-chance level. … If subjects cannot recognize accurate descriptions of themselves at a significant level, then the experiment would show a null result no matter how well astrology worked.” (Carlson, 1985 pp.424-425)
Dean implies that the failure of this self-selection test was due to the complex nature of the CPI. However, this is not correct since the astrologers, who did not know the subjects, were able to blind match CPIs with a horoscope to a significant level (Ertel, 2009), thereby succeeding where the sample group had failed to self-select from the CPI results.
The problem was that the sample group was too homogenous and immature. Being mostly teenage adolescents (“70% college students”) recruited on and around the Berkeley campus, they were inexperienced in life. Without the required self-knowledge and the motivation, it was too much to expect subjects to differentiate their chart from two others with the same Sun sign!
Dean et al. then undermine their argument by claiming that self-selection with the CPI can be done! They cite a test result where 69% of college students identified their CPI from an inverted one – the antithesis of their personality. Even then, the results were still hit and miss. “The senior and graduate students could reliably select their own profiles while the sophomore students could not.” (Greene 1979)
While the older students were capable of self-identification through two-way comparisons of CPIs with opposites, younger students were not. This suggests that the failure of self-identification in the Carlson test was not caused by the CPI itself, contrary to the UA authors’ claim. Instead, the underlying issues were the nature of the sample group and the test design was far more demanding with a three-way test with very similar profile subjects.
The UA authors then refer to a classroom test conducted by Wyman and Vyse (2008) as “a near-replication” of the Carlson test using the NEO personality inventory (UA p.575). However, they overlooked the flaws presented in a review by McRitchie (2015). These included: the sample size of Wyman and Vyse’s test was about half of Carlson’s sample and lacked statistical power; the involvement of astrologers was absent; there were sampling errors; the computer-generated horoscope analyses were tampered with by extracting key parts; and, importantly, the experimenters circumvented their double-blind protocols, which could have allowed the subjects’ prior stated beliefs to bias the outcome.
Astrologers were consistent in rating chart matches despite the challenging task
Returning to Carlson’s 1985 study, in his legitimate first test (Test #1), astrologers were asked to rank a blind match for closeness of fit between three CPIs reported by three individuals and the birth chart of only one of them. Although Carlson could not find any statistical significance in evaluating the astrologers’ choices individually, Professor Ertel (2009) found that when the first and second choices were both considered (accounting for near misses), the successful matching showed marginal significance (p = .054; ES = .15).
At least one astrologer found it unfairly hard to rank matches with subjects due to their similar character and a tie was not permitted. Teresa Hamilton, astrologer and psychologist, who resigned from the test, stated “I was given some of these charts (CPI Profiles) to match myself, and noticed immediately that the three profiles were often quite similar.” (Hamilton, 1986).
The final and most forensic legitimate test (Test #2) overcame Hamilton’s problem. Astrologers were asked to rate the similarity between each birth chart and three CPI profiles on a scale of 1 to 10 (10 being the choice of best fit). Out of every set of three CPIs, only one was a correct match with the birth chart. This matching assessment was conducted a painstaking 308 times by each of up to twenty astrologer participants.
Carlson’s conclusion upheld the null hypothesis. However, Ertel (2009) shows how the rating results had been improperly divided into three smaller groups, which hid the significance, resulting in a Type I error – a false negative. When Ertel merged the data, the analysis shows that the astrologers rated the charts successfully (p = .037; ES = .10) one tailed. Another flaw is that Dean et al. reported a non-significant two tailed test statistic. However, a two tailed test is not appropriate since the null hypothesis alpha was p = .5 and the astrological claim tested was only in a positive direction.
Figure 1 Results from Carlson’s Test #2, the astrologers’ ratings of fit, showing the improperly divided data in three histograms (left) published in Nature and the merged data in a scatter plot (right) from UA p. 574
Figure 1 from Understanding Astrology (2022 p.574) depicts three histograms (on the left) from Carlson’s Test #2. They show how he divided the match rating results into three small subsets by improperly using the ranking positions from his previous (Test #1) experiment (even though the sample sizes were different). The graph on the right shows the properly merged results, which the UA authors have redrawn from my review (2011 p. 16) as a scatter plot. The plotted values are the percentages of successful astrologer matches between natal charts and CPIs at each rating level (from 1 to 10). Dean concludes that while the plots confirm the positive trendline and a medium effect size (r = .385). the results are not significant:
“The percentage of hits tends to increase as the fit improves from 1 to 10 (r = .385), but it was not even weakly significant (p = .27).” (UA 2022 p.574)
The problem with the authors’ calculation is the false assumption that each datum point is treated equally, as if there were only ten authentic ratings – one at each level on the graph. In fact, there are one hundred authentic ratings The plot points in Figure 1, shows the frequencies of authentic matches from all 308 choices for each of the ten ratings. However, there is no account of these frequencies in the UA calculations.
The reason why the match frequency is such a vital metric in assessing the results is that, for example, the 51 ratings at the level of 8 out of 10 (10 being best fit) had been treated equally as a single datum point, having the same impact as the 4 ratings at level 10 (best fit) and the 18 level 1 ratings (worst fit).
Figure 2 A more accurate, multivariate evaluation of the Carlson chart-matching test results plotted as a bubble chart that includes the scores of near and far misses.
In Figure 2, I redrew Dean’s scatter plot as a multivariate bubble graph analysis with the same axes to include the effects of near misses. The figure shows the number of authentic matches that appeared at each level of certainty as chosen by the astrologers, expressed as a percentage of the total matches. This corresponds to the calculation “Hits/All Hits% which is the y-axis in Figure 1. Unlike Dean’s scatter plot, the bubble diagram more accurately represents the trend of greater weight of astrologer choices toward the best-fit score (10). This is represented by the size of the bubbles and their labelled percentage numbers. The corresponding linear regression trendline is shown as the dark red dashed line. The evaluated results appear in the lower right-hand corner.
The correlation coefficient confirms the high effect size correlation (r = .57 compared to Dean’s r = .385) between the proportion of authentic matches and the rating of the match by the astrologers. The upward slope of the line confirms the positive correlation in that the astrologers tended to assign lower ratings (below 5 out of 10) to false matches and higher scores (above 5 out of 10) to true matches. The coefficient of determination value (R2 = .3221), which measures how well the model fits and, in this context, confirms the consistency of the astrologers in their ability to rate matches.
While the effect size correlation for this result is remarkably strong, the probability correlation can be misleading for ranked testing because it evaluates only the goodness of fit of the correlation rather than a measure of the objective success of the astrologers. This metric is confirmed by Ertel’s (2009) probability calculation (p = .037) by using Kendall’s tau for ranked results. So even though the blind matching test with CPIs was extremely difficult, we can see that the astrologers’ certainty in their judgement in separating true from false horoscopes correlates highly.