## Sunday, February 23, 2014

### Bayes factor t tests, part 2: Two-sample tests

In the previous post, I introduced the logic of Bayes factors for one-sample designs by means of a simple example. In this post, I will give more detail about the models and assumptions used by the BayesFactor package, and also how to do simple analyses of two- sample designs.
See the previous posts for background:

## Bayesian t tests

Frisby and Clatworthy (1975) [DASL] presented random dot stereograms to 78 participants. Each participant was asked to report how long it took them to fuse the random dot stereogram. Of the 78 participants, 35 were given extra visual information about the target image, in the form of a 3D model. The remaining participants were given no such information.
The question of interest is whether the visual information affected the fusion times for the random dot stereograms. We begin by loading the BayesFactor package and reading the data into R.
# Load the BayesFactor package
library(BayesFactor)

# Read in the data from the Data and Story Library
header = FALSE, skip = 33, nrows = 78)
colnames(randDotStereo) = c("fuseTime", "condition")
randDotStereo$logFuseTime = log(randDotStereo$fuseTime)

Because the response times are right-skewed, we perform a conventional logarithmic transformation. All plots and tests will be on the transformed data. The figures below show summaries of the data, in the form of a box plot and a plot with means and standard errors.

The condition in which participants were given visual information about the target image in the random dot stereogram yields slightly faster fusion times, on average. Before exploring the results of a Bayes factor analysis, we first compute the classical t test for comparison.
classical.test = t.test(logFuseTime ~ condition, data = randDotStereo, var.eq = TRUE)
classical.test

##
##  Two Sample t-test
##
## data:  logFuseTime by condition
## t = 2.319, df = 76, p-value = 0.02308
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.06077 0.80034
## sample estimates:
## mean in group NV mean in group VV
##            1.820            1.389

The classical t test yields a statistically significant result at $\alpha=0.05$, with a $p$ value of $p=0.0231$. The standardized effect size is $d=0.5277$ [CI95%: $0.072,0.98]$. This would typically be interpreted (incorrectly) as indicating strong enough evidence against the null hypothesis to reject it. To truly evaluate the evidence against the null hypothesis, we need a Bayes factor; to compute a Bayes factor, we need to outline two hypotheses.

### Finding two hypotheses to compare

A t test is typically formulated to test the plausibility of the null hypothesis; however, in order to determine the plausibility of the null hypothesis, we must compare it to something. Formulating the null hypothesis is easy; it states simply that there is no effect, implying that $\delta=0$ (where $\delta$ is the true standardized effect size).
We will compare the null hypothesis to an alternative, so we what standardized effect sizes would be plausible if the null hypothesis were false. The BayesFactor package has a few built-in “default” named settings. The figure below shows three alternative hypotheses as prior distributions over the true effect size $\delta$.

These priors all have the same shape; the only differ by their scale, denoted by $r$. The three named defaults are given in the following table.
Prior name // $r$ scale
*medium $\sqrt{2}/2$
wide $1$
ultrawide $\sqrt{2}$
“Medium”, indicated by a star, is the default. The scale controls how large, on average, the expected true effect sizes are. For a particular scale 50% of the true effect sizes are within the interval $(-r,r)$. For the default scale of “medium”, 50% of the prior effect sizes are within the range $(-0.7071,0.7071)$. Increasing $r$ increases the sizes of expected effects; decreasing $r$ decreases the size of the expected effects.

### Performing the test

We can now compute a Bayes factor using the BayesFactor package. To compute a two-sample t test, we use the ttestBF function. The formula argument indicates the dependent variable on the left of the ~, and the grouping variable on the right. The data argument is used to pass the data frame containing the dependent and independent variables as columns.
bf = ttestBF(formula = logFuseTime ~ condition, data = randDotStereo)
bf

## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 2.322 ±0.06%
##
## Against denominator:
##   Null, mu1-mu2 = 0
## ---
## Bayes factor type: BFindepSample, JZS

The Bayes factor analysis with the default prior settings yields a Bayes factor of 2.3224 in favor of the alternative. What does this mean?
1. The data are 2.3224 times as probable under the alternative as under the null.
2. The relative odds in favor of the alternative, against the null, must change by a factor of 2.3224 in light of the data. For instance, if we held even odds before seeing the data, then we our posterior odds would be 2.3224:1 for the alternative. This is regarded as weak evidence, though it does slightly favor the alternative.
If we regarded the significant $p$ value to be an indication that there was “enough” evidence to reject the null hypothesis, we might be surprised by this result. Do the classical result and the Bayesian result contradict one another? Yes, and no. If we interpret the classical $p$ value as a measure of the weight of the evidence — a common, though incorrect, interpretation — then the results contradict. However, we must remember that classical statistics has no formal notion of the weight of evidence. The significance test is merely a test with a fixed error rate of $\alpha$. Interpreted correctly, then, the results do not contradict one another. But correctly interpreted, “significance” is an uninteresting concept; the Bayesian notion of the strength of evidence is more in line with the interests of researchers.

### A one-sided test

In the random dot stereogram data, the two-sided hypothesis we chose might be inappropriate. It seems reasonable to ask the question “How strong is the evidence that giving visual information about the random dot stereogram yields faster responses?” if we thought that having visual information of the picture in the stereogram helped participants fuse the image. This question suggests testing the hypothesis that $\delta>0$ versus $\delta=0$.
For the one-sided test, we use the positive (or negative) portion of the prior distribution. Indicating that we want to restrict our alternative to a range is done using the nullInterval argument:
bf.signed = ttestBF(formula = logFuseTime ~ condition, data = randDotStereo,
nullInterval = c(0, Inf))
bf.signed

## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 0<d<Inf    : 4.567   ±0.06%
## [2] Alt., r=0.707 !(0<d<Inf) : 0.07747 ±0.06%
##
## Against denominator:
##   Null, mu1-mu2 = 0
## ---
## Bayes factor type: BFindepSample, JZS

(see also the previous blog post.) The BayesFactor package helpfully returns the Bayes factor for the selected interval against the null and the Bayes factor for the complement of the selected interval against the null. The Bayes factor for the one-sided alternative that the effect size is positive, against the null, is 4.5673. Because the effect size in the data is consistent with the true effect size being positive, the weight of evidence is greater relative to the two-sided test.

### Comparing other hypotheses

Suppose that we were uninterested in testing the point-null hypothesis that $\delta=0$. We might, for instance, care more about whether the effect size is positive versus negative. In the case of the random dot stereograms, one hypothesis might state that the visual information actually slows fusion times, and the other that visual information speeds fusion times. We already have enough information to compute such a Bayes factor from the analyses above.
Recall that the Bayes factor is the ratio of the probability of the data under the two hypotheses; a hypothesis is favored to the extent that it specifies that the observed data is more probable. In bf.signed, we have two Bayes factors, both against the same denominator model: the null hypothesis. In order to compute a third Bayes factor comparing the positive effect sizes to the negative ones, all we have to do is divide: \[ \begin{eqnarray*} &&\left.\frac{\mbox{Probability of the data if}~\delta>0}{\mbox{Probability of the data if}~\delta=0}\middle/\frac{\mbox{Probability of the data if}~\delta<0 data="" delta="" frac="" if="" mbox="" of="" robability="" the="">0}{\mbox{Probability of the data if}~\delta<0 bayes="" cancel="" code="" denominators="" easy="" end="" eqnarray="" factor.="" identical="" is="" leaving="" new="" our="" the="" this="" using="">BayesFactor package. Since bf.signed contains the two Bayes factors of interest, we just divide the first element by the second element:
bf.pos.vs.neg = bf.signed[1]/bf.signed[2]
bf.pos.vs.neg

## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 0<d<Inf : 58.96 ±0.09%
##
## Against denominator:
##   Alternative, r = 0.707106781186548, mu1-mu2 =/= 0 !(0<d<Inf)
## ---
## Bayes factor type: BFindepSample, JZS

Note that the numerator is now restricted to positive effect sizes, and the denominator is restricted to negative effect sizes. The Bayes factor is 58.9575, indicating that positive effect sizes are strongly favored to negative ones.
The same logic can be used to perform other tests, as well; consider testing a positive, but small ($0<\delta\leq.2$), effect, against a positive and medium-or-larger sized effect ($\delta>.2$):
## Each of these will contain two Bayes factors, the first of which is of
## interest
bf.negl = ttestBF(formula = logFuseTime ~ condition, data = randDotStereo, nullInterval = c(0,
0.2))
bf.nonnegl = ttestBF(formula = logFuseTime ~ condition, data = randDotStereo,
nullInterval = c(0.2, Inf))

bf.nonnegl[1]/bf.negl[1]

## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 0.2<d<Inf : 1.917 ±0.09%
##
## Against denominator:
##   Alternative, r = 0.707106781186548, mu =/= 0 0<d<0.2
## ---
## Bayes factor type: BFindepSample, JZS

The Bayes factor of 1.9168 is extremely weak evidence against the small-effect-size hypothesis.

### The effect of the prior

Whenever we do a Bayesian analysis, we must use a prior distribution. In the BayesFactor package, a flexible family of priors is provided for you. An analysis specifies the spread of possible true values of $\delta$ under the alternative hypothesis. Predictions for data under the alternative are obtained by combining the uncertainty in the true value with the uncertainty from sampling; thus, a reasonable prior distribution is critical because the prior distribution makes the alternative testable through its predictions.
Reasonable predictions for data are necessary for reasonable inference, and a reasonable prior is necessary for reasonable predictions. Given a reasonable prior, evaluations of the evidence must be based on Bayes' theorem, and hence the Bayes factor (see “What is a Bayes factor?”) . The problem is that while we might have a sense that a particular prior is “reasonable”, others might be reasonable as well. We might, for instance, be uncertain whether the “medium” or “wide” prior scale is more reasonable for the t test outlined above. It is interesting, therefore, to understand how the Bayes factor changes with the prior scale.
The figure below shows the predictions for the observed effect size for four hypotheses: the null hypothesis, and the three named alternative scales used in the BayesFactor package. The vertical line shows the effect size observed in the random dot stereogram experiment.

The alternative hypotheses (“medium”, “wide”, and “ultrawide”) show increasingly spread out predictions for the observed effect sizes, which follow from the fact that the corresponding priors on the true effect size $\delta$ are increasingly wide. The predictions under the null hypothesis, on the other hand, are more concentrated around the effect size of 0. The The prior scales affect the Bayes factor because a hypothesis that has more spread out predictions will have lower density on any one prediction. This is a natural and desirable consequence of probability theory: a hypothesis that predicts everything predicts nothing.
The figure below shows how the Bayes factor for the random dot stereogram data is related to the prior scale. The scales vary by a factor of about 16, a substantial range. Although the Bayes factor is certainly sensitive to these changes, even changes of this large magnitude do not markedly change the Bayes factor. Any one of these chosen prior scales would lead to the conclusion that there is equivocal evidence for an effect.

There are two main trends to understand: how the Bayes factor changes as the prior scale becomes small, and how it changes as the prior scale becomes very large.
• As the prior scale approaches 0, the alternative looks more and more like the null. The Bayes factor approaches 1.
• As the prior scale approaches $\infty$, the alternative predicts larger and larger observed values. The Bayes factor eventually favors the null to arbitrary degree.
It is important to emphasize that the sensitivity of the Bayes factor to the prior is not a “bad” or “undesirable” feature of the Bayes factor. The sensitivity to the prior is simply a sensitivity to predictions of what data are plausible under the hypotheses. There is no question that inferences should be sensitive to predictions. The only question is how sensitive they should be, and Bayes theorem gives us a definitive answer: the Bayes factor is exactly as sensitive as Bayes' theorem requires.
For more on the sensitivity of the BayesFactor t tests to the prior, see this shiny app, which demonstrates the effect of the prior on data predictions and the Bayes factor. See also Felix Schönbrodt's recent post on the topic, and the associated shiny app.

In the next post — part 3 in the t test series — I will discuss using BayesFactor to sample from posterior distributions.

#### 1 comment:

1. If you're trying to provide evidence for the null hypothesis, is it reasonable to choose the prior scale the yields the Bayes factor that most favors the alternative, as a sort of "worst case" result for the null?