Assuming that the trials described in Reports B [6] and C [7] were conducted and reported independently, how likely is the close agreement shown in table 3? This depends on the variability of mNDI: the greater the variability, the less likely the agreement. Unfortunately, as noted above, the different sources of information about variability (p values and SDs) within each of Reports B and C are inconsistent. Thus, to estimate the probability that the percentage in each replication report would agree as well as it did with the percentage in Report A [5], we assumed that the reported means (mNDIs) are correct, and made three different assumptions about variability to obtain estimates of the variances of the mNDIs. Because the assumption that the reported variability measures were in fact SDs led to the implausible significance levels shown in tables 1 and 2, and because of their great divergence from SDs in other studies (Fig. 1), we have not considered this possibility.

According to Assumption 1, what were called "SDs" in the replication reports were actually correct values for the SEs. The variance of mNDI, var(mNDI), is SE^{2}.

According to Assumption 2, the relation between SD and mean is given by the top fitted line in Figure 1: SD = 1.59 × Mean, the line fitted to the data from the bronchitis groups. (This seems more plausible than Assumption 1.) Thus we used this equation with the mNDI values reported in Reports B and C to estimate the SDs, and from these and the sample sizes, var(mNDI).

According to Assumption 3, the reported significance levels are correct. However, whereas we need exact p values, we are given only upper bounds (p < 0.03 for Report B, p < 0.02 for Report C). We therefore used the next smaller p value (p = 0.02 for Report B, p = 0.01 for Report C), biasing the result towards higher probabilities of close agreement. Assuming that the p values were obtained from one tailed t tests, this enabled us to estimate the SEs, and from these, the values of var(mNDI). We obtained these SE estimates using three different methods, and report results from the method that produced the highest probabilities of agreement. The three methods were: (a) Conventional t test, assuming that the true SDs differ from those reported by a common factor; (b) Conventional t test, assuming that the true SDs are equal; (c) Welch modified two-sample t test, assuming that the true SDs differ from those reported by a common factor. Because of the bias mentioned above, the estimated probabilities of close agreement based on Assumption 3 are upper bounds.

Estimation of the probability also requires us to assume the form of the distribution of mNDI; we made our estimates assuming both Gaussian and gamma distributions consistent with the reported means and with the inferred values of of var(mNDI). For each of the two replication studies and each assumption, the desired probability was estimated as follows. Given the mean, inferred variance, and assumed distributional form of mNDI for the supplement and placebo groups of subjects in that study, we generated a sample of 1,000,000 pairs of mNDI values (mNDI_{s} for the supplement group and mNDI_{p} for the placebo group). Next, we obtained the ratio of the two numbers in each pair, (mNDI_{s}/mNDI_{p}). We then determined what proportion of these values were at least as close to the original finding as the observed value.

The probability that both of two independent replication studies would agree at least as well as they did with the original is the product of their separate probabilities. It is therefore the product of these two proportions, one for each replication study, that gives us the estimated probability. Under Assumption 1, the estimated probabilities for Gaussian and gamma distributions are 0.00203 and 0.00204, respectively; under Assumption 2 they are both 0.00019; and under Assumption 3 the upper bounds are 0.00066 and 0.00065, respectively.

Even under Assumption 1, which is implausible given the variability of NDI in other studies shown in Figure 1, the means and variabilities of these two sets of data make the observed closeness of agreement extremely unlikely. Under both of the other assumptions the chance of such good agreement is vanishingly small: too good to be true. These results also show that our conclusion is insensitive to the choice of distribution.