[1] "Created: Wed Apr 1 16:39:44 2015"
See the introduction for an overview.
library(lme4)
library(ggplot2)
options(digits=5,show.signif.stars=FALSE)
Load in and summarize the data:
data(psid, package="faraway")
head(psid)
age educ sex income year person
1 31 12 M 6000 68 1
2 31 12 M 5300 69 1
3 31 12 M 5200 70 1
4 31 12 M 6900 71 1
5 31 12 M 7500 72 1
6 31 12 M 8000 73 1
summary(psid)
age educ sex income year
Min. :25.0 Min. : 3.0 F:732 Min. : 3 Min. :68.0
1st Qu.:28.0 1st Qu.:10.0 M:929 1st Qu.: 4300 1st Qu.:73.0
Median :34.0 Median :12.0 Median : 9000 Median :78.0
Mean :32.2 Mean :11.8 Mean : 13575 Mean :78.6
3rd Qu.:36.0 3rd Qu.:13.0 3rd Qu.: 18050 3rd Qu.:84.0
Max. :39.0 Max. :16.0 Max. :180000 Max. :90.0
person
Min. : 1.0
1st Qu.:20.0
Median :42.0
Mean :42.4
3rd Qu.:63.0
Max. :85.0
Construct some plots:
library(dplyr)
psid20 <- filter(psid, person <= 20)
ggplot(psid20, aes(x=year, y=income))+geom_line()+facet_wrap(~ person)
ggplot(psid20, aes(x=year, y=income+100, group=person)) +geom_line()+facet_wrap(~ sex)+scale_y_log10()
Fit the textbook model:
psid$cyear <- psid$year-78
mmod <- lmer(log(income) ~ cyear*sex +age+educ+(cyear|person),psid)
print(summary(mmod), corr=FALSE)
Linear mixed model fit by REML ['lmerMod']
Formula: log(income) ~ cyear * sex + age + educ + (cyear | person)
Data: psid
REML criterion at convergence: 3819.8
Scaled residuals:
Min 1Q Median 3Q Max
-10.231 -0.213 0.079 0.415 2.825
Random effects:
Groups Name Variance Std.Dev. Corr
person (Intercept) 0.2817 0.531
cyear 0.0024 0.049 0.19
Residual 0.4673 0.684
Number of obs: 1661, groups: person, 85
Fixed effects:
Estimate Std. Error t value
(Intercept) 6.6742 0.5433 12.28
cyear 0.0853 0.0090 9.48
sexM 1.1503 0.1213 9.48
age 0.0109 0.0135 0.81
educ 0.1042 0.0214 4.86
cyear:sexM -0.0263 0.0122 -2.15
anova(mmod)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
cyear 1 61.6 61.6 131.76
sex 1 45.8 45.8 97.92
age 1 0.0 0.0 0.00
educ 1 10.9 10.9 23.41
cyear:sex 1 2.2 2.2 4.62
Estimates but no inferential output as expected.
Compute parametric bootstrap confidence intervals:
confint(mmod, method="boot")
2.5 % 97.5 %
sd_(Intercept)|person 0.448599 0.6177670
cor_cyear.(Intercept)|person -0.073465 0.4363037
sd_cyear|person 0.039640 0.0583016
sigma 0.660695 0.7104605
(Intercept) 5.585035 7.7072513
cyear 0.068175 0.1032825
sexM 0.920625 1.3845276
age -0.015304 0.0378592
educ 0.062602 0.1491223
cyear:sexM -0.050078 -0.0037928
Some indication of no correlation between the slope and intercept random effects. Also perhaps no year by sex interaction.
We can test the interaction term between year and sex in the model using the KR adjustment:
library(pbkrtest)
psid$cyear <- psid$year-78
mmod <- lmer(log(income) ~ cyear*sex +age+educ+(cyear|person),psid, REML=FALSE)
mmodr <- lmer(log(income) ~ cyear + sex +age+educ+(cyear|person),psid, REML=FALSE)
KRmodcomp(mmod,mmodr)
F-test with Kenward-Roger approximation; computing time: 0.30 sec.
large : log(income) ~ cyear + sex + age + educ + (cyear | person) + cyear:sex
small : log(income) ~ cyear + sex + age + educ + (cyear | person)
stat ndf ddf F.scaling p.value
Ftest 4.61 1.00 81.33 1 0.035
The result is very close to the p-value obtained for the t-test of this term in the textbook output although slightly different from the bootstrapped confidence interval result.
Using RLRsim
is not possible because of having multiple random effects. We can implement the parametric bootstrap directly to test whether there is any variance in the slopes of the random effects
First we compute the observed LRT statistic
mmod <- lmer(log(income) ~ cyear*sex +age+educ+(cyear|person),psid, REML=FALSE)
rmod <- lmer(log(income) ~ cyear*sex +age+educ+(1|person),psid, REML=FALSE)
(olrt <- 2*(logLik(mmod)-logLik(rmod)))
'log Lik.' 142.3 (df=10)
Now we resample 1000 times: (notice that the simulated response is already on the log-scale so we should not take logs again)
lrstat <- numeric(1000)
for(i in 1:1000){
rincome <- unlist(simulate(rmod))
bmod <- refit(mmod,rincome)
smod <- refit(rmod,rincome)
lrstat[i] <- 2*(logLik(bmod)-logLik(smod))
}
mean(lrstat > olrt)
[1] 0
Looks like there is some variation in the slopes.
library(lmerTest)
mmod <- lmer(log(income) ~ cyear*sex +age+educ+(cyear|person),psid)
print(summary(mmod), corr=FALSE)
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [merModLmerTest]
Formula: log(income) ~ cyear * sex + age + educ + (cyear | person)
Data: psid
REML criterion at convergence: 3819.8
Scaled residuals:
Min 1Q Median 3Q Max
-10.231 -0.213 0.079 0.415 2.825
Random effects:
Groups Name Variance Std.Dev. Corr
person (Intercept) 0.2817 0.531
cyear 0.0024 0.049 0.19
Residual 0.4673 0.684
Number of obs: 1661, groups: person, 85
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 6.6742 0.5433 81.2000 12.28 < 2e-16
cyear 0.0853 0.0090 78.9000 9.48 1.2e-14
sexM 1.1503 0.1213 81.8000 9.48 8.0e-15
age 0.0109 0.0135 80.8000 0.81 0.421
educ 0.1042 0.0214 80.7000 4.86 5.7e-06
cyear:sexM -0.0263 0.0122 78.0000 -2.15 0.035
anova(mmod)
Analysis of Variance Table of type 3 with Satterthwaite
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
cyear 65.0 65.0 1 78.0 139.1 < 2e-16
sex 42.0 42.0 1 81.8 89.9 8.0e-15
age 0.3 0.3 1 80.8 0.7 0.421
educ 11.0 11.0 1 80.7 23.6 5.7e-06
cyear:sex 2.2 2.2 1 78.0 4.6 0.035
Can test the random effects:
rand(mmod)
Analysis of Random effects Table:
Chi.sq Chi.DF p.value
cyear:person 145 2 <2e-16
but we prefer the parametric bootstrap result because of problems with asymptotic distribution of the LRT in such cases.
We use the default priors in this instance with an unstructured covariance matrix for the random effects consisting of the intercept and slope for each person:
library(MCMCglmm)
bmod <- MCMCglmm(log(income) ~ cyear*sex+age+educ, random=~us(1+cyear):person, data=psid, verbose=FALSE)
xyplot(bmod$VCV)
The diagnostics appear satisfactory. Notice that the covariance term between slope and intercept appears twice (identically). We can investigate this correlation:
yrpercor <- bmod$VCV[,2]/sqrt(bmod$VCV[,1]*bmod$VCV[,4])
ggplot(data.frame(corr=yrpercor),aes(x=var1))+geom_density()+xlab("Correlation")
2*mean(yrpercor < 0)
[1] 0.15
Seems it would be not unreasonable drop the correlation allowing a simplified model:
bmod <- MCMCglmm(log(income) ~ cyear*sex+age+educ, random=~idh(1+cyear):person, data=psid, verbose=FALSE)
xyplot(bmod$VCV)
Again the diagnostics appear satisfactory. Check the summary output:
summary(bmod)
Iterations = 3001:12991
Thinning interval = 10
Sample size = 1000
DIC: 3595.4
G-structure: ~idh(1 + cyear):person
post.mean l-95% CI u-95% CI eff.samp
(Intercept).person 0.28924 0.18870 0.3831 1000
cyear.person 0.00249 0.00164 0.0037 1195
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 0.468 0.434 0.499 1000
Location effects: log(income) ~ cyear * sex + age + educ
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 6.633254 5.558990 7.806761 1182 <0.001
cyear 0.085062 0.067315 0.104848 1000 <0.001
sexM 1.153738 0.933499 1.410487 909 <0.001
age 0.010706 -0.015798 0.041155 1148 0.428
educ 0.108139 0.066923 0.150871 1248 <0.001
cyear:sexM -0.025771 -0.049605 -0.000359 1000 0.036
We see that the age term may not be important but there is a stronger contribution from other terms. Both the random effect terms appear to be appreciable. We can check the posteriors. First for the intercept variation:
ref <- data.frame(bmod$VCV)
colnames(ref) <- c("intercept","slope","error")
ggplot(ref, aes(x=intercept))+geom_density()
and now for the slope:
ggplot(ref, aes(x=slope))+geom_density()
and finally for the error term:
ggplot(ref, aes(x=error))+geom_density()
Check on versions of R packages used:
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MCMCglmm_2.21 ape_3.1-4 coda_0.16-1 lattice_0.20-29
[5] lmerTest_2.0-20 pbkrtest_0.4-1 MASS_7.3-33 dplyr_0.4.1
[9] ggplot2_0.9.3.1 lme4_1.1-7 Rcpp_0.11.3 Matrix_1.1-4
[13] knitr_1.6 rmarkdown_0.2.68
loaded via a namespace (and not attached):
[1] assertthat_0.1 bitops_1.0-6 boot_1.3-11
[4] caTools_1.16 cluster_1.15.2 colorspace_1.2-4
[7] corpcor_1.6.7 DBI_0.3.1 dichromat_2.0-0
[10] digest_0.6.4 evaluate_0.5.3 formatR_0.10
[13] Formula_1.1-1 gdata_2.13.3 gplots_2.13.0
[16] grid_3.1.1 gtable_0.1.2 gtools_3.3.1
[19] Hmisc_3.14-3 htmltools_0.2.4 KernSmooth_2.23-12
[22] labeling_0.2 latticeExtra_0.6-26 lazyeval_0.1.10
[25] magrittr_1.0.1 minqa_1.2.3 munsell_0.4.2
[28] nlme_3.1-117 nloptr_1.0.4 numDeriv_2012.9-1
[31] parallel_3.1.1 plyr_1.8.1 proto_0.3-10
[34] RColorBrewer_1.0-5 reshape2_1.2.2 scales_0.2.3
[37] splines_3.1.1 stringr_0.6.2 survival_2.37-7
[40] tensorA_0.36 tools_3.1.1 yaml_2.1.11