---
title: Does it matter if we use other splits of the data?
author: Julian Faraway
output:
 html_document:
  toc: true
  theme: cosmo
---

```{r global_options, include=FALSE}
library(knitr)
opts_chunk$set(comment=NA, fig.path='/tmp/Figs/', warning=FALSE, message=FALSE, quiet=TRUE, progress=FALSE)
```

```{r echo=FALSE}
paste("Created:",date())
```
The main thread of the analysis uses a single split of the data into testing and training. The purpose
of this page is to show the results are not dependent on this choice of split.

See [featlcdb.Rmd](featlcdb.Rmd) for how the data [feat.rda](feat.rda) was prepared for this analysis.

```{r}
load("feat.rda")
source("funcs.R")
require(MASS,quietly=TRUE)
require(nnet,quietly=TRUE)
require(ggplot2,quietly=TRUE)
require(rpart,quietly=TRUE)
require(rpart.plot,quietly=TRUE)
require(xtable,quietly=TRUE)
require(kernlab,quietly=TRUE)
require(randomForest,quietly=TRUE)
nrep <- 100
```

We will compute `r nrep` random splits of the data (1/3 test and 2/3 training) and recompute the classification rates
for both sets of measures.

```{r}
ourres <- array(NA, c(4,5,nrep))
richres <- array(NA, c(4,5,nrep))
set.seed(123)
```

```{r}
for(irep in 1:nrep){
    n <- nrow(cmdb)
    isel <- sample(1:n,round(n/3))
    trains <- cmdb[-isel,]
    tests <- cmdb[isel,]

predform <- "shov + maxdiff + dscore + log(totvar) + log(quadvar) + log(famp) + log(fslope) + log(outl) + gscore + lsd + nudlog(gtvar) + rm.amplitude  + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("shov","maxdiff","dscore","totvar","quadvar","famp","fslope","outl","gscore","lsd","gtvar","rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
out <- knit_child("childfeat.Rmd",quiet=TRUE)
ourres[,,irep] <- cmat

predform <- "rm.amplitude  + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
out <- knit_child("childfeat.Rmd",quiet=TRUE)
richres[,,irep] <- cmat
}
```

Here is the mean of the Richards set results: (all expressed as percentages)


```{r}
md <- apply(richres,c(1,2),mean)
dimnames(md) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest"))
round(100*md,1)
```

Here is the mean result using our measures:

```{r}
md <- apply(ourres,c(1,2),mean)
dimnames(md) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest"))
round(100*md,1)
```


Now consider the mean difference between the classification rates using our set and the Richards set:

```{r}
md <- apply(ourres-richres,c(1,2),mean)
dimnames(md) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest"))
round(100*md,1)
```

and here is the SD of the difference:

```{r}
smd <- apply(ourres-richres,c(1,2),sd)
dimnames(smd) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest"))
round(100*smd,3)
```

We see that the difference between the two sets of measures does not vary much. Although the classification rates
will vary a bit depending on the split used, this does not change the main conclusion that our measures are clearly
better. We could do some additional work to compute additional splits and perhaps use n-fold crossvalidation which
would allow a more accurate estimation of classification rates but there is no great interest in doing this as we just
want to show our measures are valuable. The classification rate will change when applied to new data since we have
used an intentionally-biased (contains more transients) sample for this demonstration.