See the introduction for an overview. Also see other analyses of this data. See the textbook for a full discussion. See STAN analysis with default prior. See INLA analysis.

Load the libraries:

library(ggplot2)
library(rstan, quietly=TRUE)
library(reshape2)
library(dplyr,quietly=TRUE)

Data

Load up and look at the data:

data(pulp, package="faraway")
summary(pulp)

     bright     operator
 Min.   :59.8   a:5     
 1st Qu.:60.0   b:5     
 Median :60.5   c:5     
 Mean   :60.4   d:5     
 3rd Qu.:60.7           
 Max.   :61.0

ggplot(pulp, aes(x=operator, y=bright))+geom_point(position = position_jitter(width=0.1, height=0.0))

Fitting the Model

Set up STAN to use multiple cores. Set the random number seed for reproducibility.

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
set.seed(123)

We need the STAN command file pcprior.stan. We have used uninformative priors for the overall mean and error SD but an exponential prior on the random effect SD. This is a penalised complexity prior.

Prepare data in a format consistent with the command file. Needs to be a list. Can’t use the word operator since this is reserved for system use in STAN. Need to compute lambda for scaling of the PC prior.

lambda <- -log(0.01)/(3*sd(pulp$bright))
pulpdat <- list(N=nrow(pulp),J=length(unique(pulp$operator)),response=pulp$bright,predictor=as.numeric(pulp$operator),lambda=lambda)

Break the fitting process into three steps:

rt <- stanc(file="pcprior.stan")
sm <- stan_model(stanc_ret = rt, verbose=FALSE)
system.time(fit <- sampling(sm, data=pulpdat))

   user  system elapsed 
  0.295   0.083   4.929

By default, we use 2000 iterations but repeated with independent starts 4 times giving 4 chains. We can thin but do not by default. The warmup period is half the number of observations (which is very conservative in this instance).

Diagnostics

Diagnostics to check the convergence are worthwhile.

traceplot(fit, pars="mu")

We see the traces of the four chains overlaid in different colors. The chains appear roughly stationary.

The similar plots can be produced for the two variance terms although note that STAN uses the standard deviations (which we prefer). Here is the group (operator) SD:

traceplot(fit, pars="sigmaalpha")

This looks acceptable. We expect that the distribution will be asymmetric so this is no concern. The chains stay away from zero (or close to it). Here’s the same plot for the error SD.

traceplot(fit, pars="sigmaepsilon")

Again this looks satisfactory.

Output summaries

Examine the output:

fit

Inference for Stan model: pcprior.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

              mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
eta[1]       -0.41    0.04 0.73 -1.90 -0.88 -0.39  0.06  1.03   319 1.01
eta[2]       -0.86    0.04 0.80 -2.47 -1.40 -0.84 -0.34  0.66   376 1.01
eta[3]        0.56    0.04 0.74 -0.97  0.10  0.57  1.04  1.98   443 1.00
eta[4]        0.69    0.03 0.76 -0.80  0.21  0.68  1.18  2.18   543 1.01
mu           60.40    0.02 0.17 60.00 60.30 60.40 60.50 60.76   128 1.02
sigmaalpha    0.27    0.01 0.17  0.03  0.16  0.24  0.35  0.68   330 1.01
sigmaepsilon  0.36    0.00 0.07  0.25  0.31  0.35  0.40  0.54   521 1.01
a[1]         60.30    0.00 0.14 60.00 60.21 60.30 60.39 60.57  1829 1.00
a[2]         60.19    0.00 0.16 59.88 60.08 60.18 60.29 60.50  1323 1.00
a[3]         60.55    0.00 0.15 60.26 60.45 60.54 60.64 60.83  1308 1.00
a[4]         60.58    0.00 0.15 60.29 60.47 60.58 60.67 60.88  1337 1.00
yhat[1]      60.30    0.00 0.14 60.00 60.21 60.30 60.39 60.57  1829 1.00
yhat[2]      60.30    0.00 0.14 60.00 60.21 60.30 60.39 60.57  1829 1.00
yhat[3]      60.30    0.00 0.14 60.00 60.21 60.30 60.39 60.57  1829 1.00
yhat[4]      60.30    0.00 0.14 60.00 60.21 60.30 60.39 60.57  1829 1.00
yhat[5]      60.30    0.00 0.14 60.00 60.21 60.30 60.39 60.57  1829 1.00
yhat[6]      60.19    0.00 0.16 59.88 60.08 60.18 60.29 60.50  1323 1.00
yhat[7]      60.19    0.00 0.16 59.88 60.08 60.18 60.29 60.50  1323 1.00
yhat[8]      60.19    0.00 0.16 59.88 60.08 60.18 60.29 60.50  1323 1.00
yhat[9]      60.19    0.00 0.16 59.88 60.08 60.18 60.29 60.50  1323 1.00
yhat[10]     60.19    0.00 0.16 59.88 60.08 60.18 60.29 60.50  1323 1.00
yhat[11]     60.55    0.00 0.15 60.26 60.45 60.54 60.64 60.83  1308 1.00
yhat[12]     60.55    0.00 0.15 60.26 60.45 60.54 60.64 60.83  1308 1.00
yhat[13]     60.55    0.00 0.15 60.26 60.45 60.54 60.64 60.83  1308 1.00
yhat[14]     60.55    0.00 0.15 60.26 60.45 60.54 60.64 60.83  1308 1.00
yhat[15]     60.55    0.00 0.15 60.26 60.45 60.54 60.64 60.83  1308 1.00
yhat[16]     60.58    0.00 0.15 60.29 60.47 60.58 60.67 60.88  1337 1.00
yhat[17]     60.58    0.00 0.15 60.29 60.47 60.58 60.67 60.88  1337 1.00
yhat[18]     60.58    0.00 0.15 60.29 60.47 60.58 60.67 60.88  1337 1.00
yhat[19]     60.58    0.00 0.15 60.29 60.47 60.58 60.67 60.88  1337 1.00
yhat[20]     60.58    0.00 0.15 60.29 60.47 60.58 60.67 60.88  1337 1.00
lp__          5.56    0.12 2.45  0.09  4.12  5.97  7.40  9.28   410 1.01

Samples were drawn using NUTS(diag_e) at Thu Feb 11 10:57:22 2016.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

We are not interested in the yhat values as these duplicate the a values. In bigger datasets, there might be a lot of these so we can select which parameters we view:

print(fit, pars=c("mu","sigmaalpha","sigmaepsilon","a"))

Inference for Stan model: pcprior.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

              mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu           60.40    0.02 0.17 60.00 60.30 60.40 60.50 60.76   128 1.02
sigmaalpha    0.27    0.01 0.17  0.03  0.16  0.24  0.35  0.68   330 1.01
sigmaepsilon  0.36    0.00 0.07  0.25  0.31  0.35  0.40  0.54   521 1.01
a[1]         60.30    0.00 0.14 60.00 60.21 60.30 60.39 60.57  1829 1.00
a[2]         60.19    0.00 0.16 59.88 60.08 60.18 60.29 60.50  1323 1.00
a[3]         60.55    0.00 0.15 60.26 60.45 60.54 60.64 60.83  1308 1.00
a[4]         60.58    0.00 0.15 60.29 60.47 60.58 60.67 60.88  1337 1.00

Samples were drawn using NUTS(diag_e) at Thu Feb 11 10:57:22 2016.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

We see the posterior mean, SE and SD of the samples. We see some quantiles from which we could construct a 95% credible interval (for example). The n_eff is a rough measure of the sample size taking into account the correlation in the samples. The effective sample sizes for the three primary parameters is not large although adequate enough for most purposes. In this case, we could easily run the chains for much longer if we were concerned. The Rhat statistic is known as the potential scale reduction factor. Values much greater than one indicate that additional samples would significantly improve the inference. In this case, the factors are all close to one so we feel no inclination to draw more samples.

Posterior Distributions

We can use extract to get at various components of the STAN fit. We plot the posterior densities for the SDs:

postsig <- rstan::extract(fit, pars=c("sigmaalpha","sigmaepsilon"))
ref <- melt(postsig,value.name="bright")
ggplot(data=ref,aes(x=bright, color=L1))+geom_density()+guides(color=guide_legend(title="SD"))

We see that the error SD can be localized much more than the operator SD. We can also look at the operator effects:

opre <- rstan::extract(fit, pars="a")
ref <- melt(opre, value.name="bright")
ref[,2] <- (LETTERS[1:4])[ref[,2]]
ggplot(data=ref,aes(x=bright, color=Var2))+geom_density()+guides(color=guide_legend(title="operator"))

We see that the four operator distributions overlap.

Package version info

sessionInfo()

R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.3    reshape2_1.4.1 rstan_2.9.0    ggplot2_2.0.0 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3      knitr_1.12.3     magrittr_1.5     munsell_0.4.2    colorspace_1.2-6 R6_2.1.2        
 [7] stringr_1.0.0    plyr_1.8.3       tools_3.2.3      parallel_3.2.3   grid_3.2.3       gtable_0.1.2    
[13] DBI_0.3.1        htmltools_0.3    yaml_2.1.13      digest_0.6.9     assertthat_0.1   gridExtra_2.0.0 
[19] formatR_1.2.1    codetools_0.2-14 inline_0.3.14    evaluate_0.8     rmarkdown_0.9.2  labeling_0.3    
[25] stringi_1.0-1    scales_0.3.0     stats4_3.2.3

STAN for One Way Anova with a random effect using a PC prior

Julian Faraway

11 February 2016