--- title: Does it matter if we use other splits of the data? author: Julian Faraway output: html_document: toc: true theme: cosmo --- ```{r global_options, include=FALSE} library(knitr) opts_chunk$set(comment=NA, fig.path='/tmp/Figs/', warning=FALSE, message=FALSE, quiet=TRUE, progress=FALSE) ``` ```{r echo=FALSE} paste("Created:",date()) ``` The main thread of the analysis uses a single split of the data into testing and training. The purpose of this page is to show the results are not dependent on this choice of split. See [featlcdb.Rmd](featlcdb.Rmd) for how the data [feat.rda](feat.rda) was prepared for this analysis. ```{r} load("feat.rda") source("funcs.R") require(MASS,quietly=TRUE) require(nnet,quietly=TRUE) require(ggplot2,quietly=TRUE) require(rpart,quietly=TRUE) require(rpart.plot,quietly=TRUE) require(xtable,quietly=TRUE) require(kernlab,quietly=TRUE) require(randomForest,quietly=TRUE) nrep <- 100 ``` We will compute `r nrep` random splits of the data (1/3 test and 2/3 training) and recompute the classification rates for both sets of measures. ```{r} ourres <- array(NA, c(4,5,nrep)) richres <- array(NA, c(4,5,nrep)) set.seed(123) ``` ```{r} for(irep in 1:nrep){ n <- nrow(cmdb) isel <- sample(1:n,round(n/3)) trains <- cmdb[-isel,] tests <- cmdb[isel,] predform <- "shov + maxdiff + dscore + log(totvar) + log(quadvar) + log(famp) + log(fslope) + log(outl) + gscore + lsd + nudlog(gtvar) + rm.amplitude + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)" preds <- c("shov","maxdiff","dscore","totvar","quadvar","famp","fslope","outl","gscore","lsd","gtvar","rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor") tpredform <- paste(preds,collapse="+") out <- knit_child("childfeat.Rmd",quiet=TRUE) ourres[,,irep] <- cmat predform <- "rm.amplitude + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)" preds <- c("rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor") tpredform <- paste(preds,collapse="+") out <- knit_child("childfeat.Rmd",quiet=TRUE) richres[,,irep] <- cmat } ``` Here is the mean of the Richards set results: (all expressed as percentages) ```{r} md <- apply(richres,c(1,2),mean) dimnames(md) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest")) round(100*md,1) ``` Here is the mean result using our measures: ```{r} md <- apply(ourres,c(1,2),mean) dimnames(md) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest")) round(100*md,1) ``` Now consider the mean difference between the classification rates using our set and the Richards set: ```{r} md <- apply(ourres-richres,c(1,2),mean) dimnames(md) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest")) round(100*md,1) ``` and here is the SD of the difference: ```{r} smd <- apply(ourres-richres,c(1,2),sd) dimnames(smd) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest")) round(100*smd,3) ``` We see that the difference between the two sets of measures does not vary much. Although the classification rates will vary a bit depending on the split used, this does not change the main conclusion that our measures are clearly better. We could do some additional work to compute additional splits and perhaps use n-fold crossvalidation which would allow a more accurate estimation of classification rates but there is no great interest in doing this as we just want to show our measures are valuable. The classification rate will change when applied to new data since we have used an intentionally-biased (contains more transients) sample for this demonstration.