Does it matter if we use other splits of the data?

[1] "Created: Wed Sep  3 16:41:03 2014"

The main thread of the analysis uses a single split of the data into testing and training. The purpose of this page is to show the results are not dependent on this choice of split.

See featlcdb.Rmd for how the data feat.rda was prepared for this analysis.

load("feat.rda")
source("funcs.R")
require(MASS,quietly=TRUE)
require(nnet,quietly=TRUE)
require(ggplot2,quietly=TRUE)
require(rpart,quietly=TRUE)
require(rpart.plot,quietly=TRUE)
require(xtable,quietly=TRUE)
require(kernlab,quietly=TRUE)
require(randomForest,quietly=TRUE)
nrep <- 100

We will compute 100 random splits of the data (1/3 test and 2/3 training) and recompute the classification rates for both sets of measures.

ourres <- array(NA, c(4,5,nrep))
richres <- array(NA, c(4,5,nrep))
set.seed(123)

for(irep in 1:nrep){
    n <- nrow(cmdb)
    isel <- sample(1:n,round(n/3))
    trains <- cmdb[-isel,]
    tests <- cmdb[isel,]

predform <- "shov + maxdiff + dscore + log(totvar) + log(quadvar) + log(famp) + log(fslope) + log(outl) + gscore + lsd + nudlog(gtvar) + rm.amplitude  + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("shov","maxdiff","dscore","totvar","quadvar","famp","fslope","outl","gscore","lsd","gtvar","rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
out <- knit_child("childfeat.Rmd",quiet=TRUE)
ourres[,,irep] <- cmat

predform <- "rm.amplitude  + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
out <- knit_child("childfeat.Rmd",quiet=TRUE)
richres[,,irep] <- cmat
}

Here is the mean of the Richards set results: (all expressed as percentages)

md <- apply(richres,c(1,2),mean)
dimnames(md) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest"))
round(100*md,1)

            LDA RPart  SVM   NN Forest
All        57.0  59.6 66.5 63.7   67.4
TranNoTran 74.5  78.5 79.8 75.0   82.2
Tranonly   54.3  56.9 64.9 59.4   64.0
Heirarch   55.8  59.2 64.4 58.2   66.0

Here is the mean result using our measures:

md <- apply(ourres,c(1,2),mean)
dimnames(md) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest"))
round(100*md,1)

            LDA RPart  SVM   NN Forest
All        75.4  72.3 80.4 78.8   80.8
TranNoTran 91.4  89.4 92.7 92.2   92.7
Tranonly   67.8  64.4 74.1 70.3   73.4
Heirarch   76.5  72.8 80.7 78.1   80.0

Now consider the mean difference between the classification rates using our set and the Richards set:

md <- apply(ourres-richres,c(1,2),mean)
dimnames(md) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest"))
round(100*md,1)

            LDA RPart  SVM   NN Forest
All        18.5  12.7 13.9 15.1   13.4
TranNoTran 16.9  10.9 12.9 17.2   10.6
Tranonly   13.5   7.4  9.2 10.9    9.3
Heirarch   20.7  13.7 16.3 19.8   14.0

and here is the SD of the difference:

smd <- apply(ourres-richres,c(1,2),sd)
dimnames(smd) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest"))
round(100*smd,3)

             LDA RPart   SVM    NN Forest
All        1.109 1.410 0.984 1.128  1.004
TranNoTran 1.030 1.247 0.978 0.989  1.011
Tranonly   1.741 2.007 1.223 1.597  1.329
Heirarch   1.088 1.791 1.050 1.247  1.104

We see that the difference between the two sets of measures does not vary much. Although the classification rates will vary a bit depending on the split used, this does not change the main conclusion that our measures are clearly better. We could do some additional work to compute additional splits and perhaps use n-fold crossvalidation which would allow a more accurate estimation of classification rates but there is no great interest in doing this as we just want to show our measures are valuable. The classification rate will change when applied to new data since we have used an intentionally-biased (contains more transients) sample for this demonstration.

Does it matter if we use other splits of the data?

Julian Faraway