[1] "Created: Wed Sep 3 16:41:03 2014"
The main thread of the analysis uses a single split of the data into testing and training. The purpose of this page is to show the results are not dependent on this choice of split.
See featlcdb.Rmd for how the data feat.rda was prepared for this analysis.
load("feat.rda")
source("funcs.R")
require(MASS,quietly=TRUE)
require(nnet,quietly=TRUE)
require(ggplot2,quietly=TRUE)
require(rpart,quietly=TRUE)
require(rpart.plot,quietly=TRUE)
require(xtable,quietly=TRUE)
require(kernlab,quietly=TRUE)
require(randomForest,quietly=TRUE)
nrep <- 100
We will compute 100 random splits of the data (1/3 test and 2/3 training) and recompute the classification rates for both sets of measures.
ourres <- array(NA, c(4,5,nrep))
richres <- array(NA, c(4,5,nrep))
set.seed(123)
for(irep in 1:nrep){
n <- nrow(cmdb)
isel <- sample(1:n,round(n/3))
trains <- cmdb[-isel,]
tests <- cmdb[isel,]
predform <- "shov + maxdiff + dscore + log(totvar) + log(quadvar) + log(famp) + log(fslope) + log(outl) + gscore + lsd + nudlog(gtvar) + rm.amplitude + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("shov","maxdiff","dscore","totvar","quadvar","famp","fslope","outl","gscore","lsd","gtvar","rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
out <- knit_child("childfeat.Rmd",quiet=TRUE)
ourres[,,irep] <- cmat
predform <- "rm.amplitude + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
out <- knit_child("childfeat.Rmd",quiet=TRUE)
richres[,,irep] <- cmat
}
Here is the mean of the Richards set results: (all expressed as percentages)
md <- apply(richres,c(1,2),mean)
dimnames(md) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest"))
round(100*md,1)
LDA RPart SVM NN Forest
All 57.0 59.6 66.5 63.7 67.4
TranNoTran 74.5 78.5 79.8 75.0 82.2
Tranonly 54.3 56.9 64.9 59.4 64.0
Heirarch 55.8 59.2 64.4 58.2 66.0
Here is the mean result using our measures:
md <- apply(ourres,c(1,2),mean)
dimnames(md) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest"))
round(100*md,1)
LDA RPart SVM NN Forest
All 75.4 72.3 80.4 78.8 80.8
TranNoTran 91.4 89.4 92.7 92.2 92.7
Tranonly 67.8 64.4 74.1 70.3 73.4
Heirarch 76.5 72.8 80.7 78.1 80.0
Now consider the mean difference between the classification rates using our set and the Richards set:
md <- apply(ourres-richres,c(1,2),mean)
dimnames(md) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest"))
round(100*md,1)
LDA RPart SVM NN Forest
All 18.5 12.7 13.9 15.1 13.4
TranNoTran 16.9 10.9 12.9 17.2 10.6
Tranonly 13.5 7.4 9.2 10.9 9.3
Heirarch 20.7 13.7 16.3 19.8 14.0
and here is the SD of the difference:
smd <- apply(ourres-richres,c(1,2),sd)
dimnames(smd) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest"))
round(100*smd,3)
LDA RPart SVM NN Forest
All 1.109 1.410 0.984 1.128 1.004
TranNoTran 1.030 1.247 0.978 0.989 1.011
Tranonly 1.741 2.007 1.223 1.597 1.329
Heirarch 1.088 1.791 1.050 1.247 1.104
We see that the difference between the two sets of measures does not vary much. Although the classification rates will vary a bit depending on the split used, this does not change the main conclusion that our measures are clearly better. We could do some additional work to compute additional splits and perhaps use n-fold crossvalidation which would allow a more accurate estimation of classification rates but there is no great interest in doing this as we just want to show our measures are valuable. The classification rate will change when applied to new data since we have used an intentionally-biased (contains more transients) sample for this demonstration.