Thursday, November 12, 2015

Neyman does science, part 2

In part one of this series, we discussed the different philosophical viewpoints of Neyman and Fisher on the purposes of statistics. Neyman had a behavioral, decision based view: the purpose of statistical inference is to select one of several possible decisions, enumerated before the data have been collected. To Fisher, and to Bayesians, the purpose of statistical inference is related to the quantification of evidence and rational belief. I agree with Fisher on this issue, and I was curious how Neyman -- with his pre-data inferential philosophy -- would actually tackle a problem with real data. In this second part of the series, we examine Neyman's team's analysis of the data from the Whitetop weather modification experiment in the 1960s.

First: Get the data!

I have saved the data in Table 2 in Neyman et al. (1969) online in a text format. It can be loaded into R using the following code (available as a gist):

## R code to get data and make plots
## You may have to install the devtools and RCurl packages first

The code will also regenerate the panels from Neyman et al.'s Figure 2.

Project Whitetop

Project Whitetop, a weather modification experiment performed during the summers in 1960-1964, was one of the first meticulously randomized, large-scale experiments of its kind. Before the experiment began, every day was designated as a "seed" day or a "non-seed" day (control).  The designation was kept secret until the last moment. Every day in the morning, the experimenters would determine whether the conditions were good for seeding. If there were westerly winds and "high precipitable water" in Little Rock, Arkansas and Columbia, Missouri, then the day was designated as an "experimental" day, and the envelope containing the seed instructions was opened. On seeded days, an airplane dumped silver iodide into the clouds around West Plains, Missouri. The area at the center of the concentric circles in the figure below shows the experimental area.

Figure 1 from Neyman et al (1969) overlaid on a modern map. Original caption reads "Approximate map of the region around the Project Whitetop target. Solid circles mark the location of rain gages used for the evaluation. The radii of the concentric circles are multiples of 30 miles; the letters A, B, C, D, E, and F designate the region within the inner circle and the regions within the successive rings, respectively. For example, region B is the area bounded by the 30 mile (inner) circle and the 60 mile (second) circle. Additionally, the area within the outermost circle is designated as 'entire' (Tables 1 and 2)."

The original analysis as reported by Neyman et al (1969) was of the change in precipitation in the hour when the seeded plume was overhead. There appeared to be an unexpected decrease in the precipitation due to the seeding ("some" p<0.01, as it was reported by Neyman et al).

Neyman's team was interested in assessing the effect of seeding at longer time scales (24 hours) and at greater distances (up to 180 miles from the experimental area). In their minds, these longer-term, larger-range effects were much more interesting from a policy perspective.

What did Neyman think makes a good analysis?

In the same year, Neyman, Scott, and Wells wrote a paper outlining statistical inference for weather modification experiments ("Statistics in meteorology", 1969). The paper is important because it lays out what we might expect from Neyman's Whitetop analysis. They describe power as related to the notion of an "informative" experiment, and briefly mention the "optimal" class of tests that will be used to analyse the Whitetop data. The critical role of power is emphasized: 
[The] rational planning of a rain stimulation experiment must emphasize the question whether, with this design, with this proposed duration and with this particular statistical test, the probability of detecting the effect of treatment that one wishes to detect is 0.2, or 0.5, or 0.8, etc. In other words, in experimentation with weather control, it is of paramount importance to estimate the power of the statistical test to be used on the data that may be provided by the contemplated design of the experiment. (p. 123)
And again on page 124: "[T]he power of the test to be used in the evaluation of a rain stimulation experiment is of prime importance." This makes it explicit: the optimality of the test is interesting, but the power of the test with reference to the experimental design is critical. An optimal test can give an uninformative experiment, if the design is bad (e.g., low sample size). It should not be a surprise that Neyman emphasizes the importance of ensuring the test is worthwhile before one undertakes it. This is especially critical if one faces interpreting null results ($p>\alpha$), as Neyman points out elsewhere.

Neyman et al's Whitetop analysis

Here we discuss Neyman et al. (1969)'s analysis ("Areal Spread of the Effect of Cloud Seeding at the Whitetop Experiment"). In order to examine the effect of distance, they decided to use data from the 174 rain gages within 180 miles of the experimental area. The goal, according to Neyman et al, was two-fold:
Specifically, an effort was made to determine (i) the differences in the 24-hour precipitation amounts at different distances from the center of the Whitetop target, averaged over the 102 days with seeding and over the 96 experimental days without seeding, and (ii) the probability (P) of obtaining such differences, or larger, purely through unavoidable chance variation. (pp. 1445-1446)
Of critical interest to use is how P (the p value) is interpreted later on. The figure above (overlaid on the map) shows how Neyman et al. divided the area into 6 concentric rings. For each ring A-F, the change in percent change precipitation was computed, along with a two-tailed p value. Note that a positive change was expected, and only the discovery of the negative effect in this same data set would lead one to look for a negative effect.

Before we look at the analysis results themselves, let me emphasize that I am not concerned with whether Neyman et al are correct; what I'm interested in is how they use statistics to support their case. Of special interest are some ideas that never appear in this paper. These include:
  • An $\alpha$ level
  • Power
  • Error rates
  • A significant or nonsignificant result (other than the already-mentioned, previously-published $p<.01$ result with the same data, which is described as "significant")
  • Pre-determined decisions
With that in mind, we can look at the results, which I have combined from their Table 2 and Figure 2. The first panel shows the effect of the seeding on days that are "wet" (that is, given rain occurred, how much did it rain?). The second panel shows the effect of seeding on all days.
Recreation of top panel from Neyman et al's (1969) Figure 2. See bottom panel for original caption.
Recreation of bottom panel from Neyman et al's (1969) Figure 2. Note that these points are not independent. Original caption reads: "Average daily precipitation versus average distance from the target center. (Top) Precipitation averaged per wet day; (bottom) precipitation averaged per day, wet or dry. In each case the upper curve represents experimental days not seeded, the middle curve represents experimental days seeded, and the lower curve represents the 267 days of June, July, and August 1960-64, which were not classified as experimental." I have not added the lower line, since these numbers are not included in the tables (and they are irrelevant).

Neyman et al refer primarily to the results for all days. This is how Neyman et al describe the results:
The estimate of the average seeding effect in the entire region is a 21-percent loss of rain. In the absence of a real effect, chance alone could produce such an estimate of loss, or a larger one, about once in 15 independent trials. (p 1447)
Note the switch from a two-sided p value (p=0.13, lower panel) to a one-sided p value! This occurs without any a priori mention of this hypothesis, aside from the fact that it was found in these same data previously; no mention of any pre-determined decision criterion. In fact, they note that this magnitude of negative effect is supported by no "intelligible theory" (p. 1447). This is a purely evidential use of the one-sided p value. They continue:
From the point of view of the question as to whether the current state of weather modification technology justifies its use for alleviating water shortages, [the data] appear decisive. As already mentioned, the Whitetop experiment was conducted in a locality where summer precipitation is critical. In fact, the possibilities of increases due to seeding as modest as 5 to 10 percent have been mentioned as something to be hoped for. When instead of such gains the experimental results show losses averaging 20 percent over an area of some 100,000 square miles, then even the slightest possibility that these losses were caused by seeding must be considered as disqualifying the underlying technology. Actually, the evidence in support of the causal relation between seeding and loss of rain appears quite strong. (p 1447; emphasis mine)
The only mention of a decision is related to the evidence -- that is, it is decisive -- not any decision pre-planned decision.  Moreover, the interpetation is concerned with the possibility, on a graded scale, that the effect is real and negative, and the evidence is "quite strong" -- again, a graded way of referring to the strength of the evidence.

Almost immediately, Battan (1969) responded to Neyman et al saying what probably most of the readers of this blog post were thinking, on seeing the mediocre p values Neyman et al used as evidence:
The two-tailed significance levels in the two tables are not so small as to make it self-evident that the rainfall differences were caused by seeding. Several hypotheses might be offered to explain effects of seeding downwind of the seeding area, but no plausible hypothesis has been offered to explain effects upwind and to the side to distances of 180 miles. (p. 618)
In the clearest sign that their use of statistical inference was evidential and post-data, rather than decision-based and pre-data, Neyman et al responded:
Battan is certainly entitled to his opinion that "significance levels . . . are not so small as to make it self-evident that the rainfall differences were caused by seeding." In fact, we agree about the lack of self-evidence. But, if there is anything in the contention that a gain in the rainfall of 5 to 10 percent is worth talking about, then a 20 percent loss, experienced over a vast area of some 100,000 square miles, must be a disaster. In these conditions, the odds of 14 to 1 that this loss was caused by seeding do not appear negligible to us. We feel that it is imperative that the general public and the government be informed of the situation. (p. 618, emphasis mine)
In a stunning misuse of statistics, Neyman et al have confused a p value with a posterior odds. Not just any p value; this was a p value from a post hoc one-tailed test. The need for a post-data mode of statistical inference is so great that Neyman -- who was famed for his pre-data theory of statistical inference -- is forced into a basic fallacy when responding to a critic. To me, this is quite remarkable.

Wrapping up

Fisher and Neyman disagreed about the philosophy of statistical inference. To Fisher (and indeed to almost all scientists), statistical inference was post-data and evidential. Neyman, however, had a pre-data, behavioural view of statistical inference. Neyman's viewpoint (which, unfortunately, has stuck around a long time in training of scientists using Type I and Type II errors) is not one that is conducive to science. Neyman himself, when doing science, appears to have had a post-data mind-set.

In part 3 of the series, I will look at the aftermath of the Neyman's Whitetop analysis, and how Neyman eventually abandoned the conclusions.

1 comment:

  1. Your part 2 has good independent reading essay for student. I also use this blog help when i write an essay on science topic.