`[1] "Created: Thu Sep 4 11:00:09 2014"`

See compfeat.html for how the data featbig.rda was prepared for this analysis.

```
load("featbig.rda")
source("funcs.R")
require(MASS)
require(nnet)
require(ggplot2)
require(rpart)
require(rpart.plot)
require(xtable)
require(kernlab)
require(randomForest)
```

Randomly divide available data into two equally sized sets:

```
set.seed(123)
n <- nrow(cmdbig)
isel <- sample(1:n,round(n/2))
trainset <- cmdbig[-isel,]
testset <- cmdbig[isel,]
cmat <- rep(NA,5)
```

Set up the predictors that we will use. Note that I have transformed some of the variables as seen in the setup.

```
predform <- "rm.amplitude + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
trform <- as.formula(paste("type ~",predform))
```

Linear Discriminant analysis using the default options.

We produce the cross-classification between predicted and observed class. Note that the default priors are the proportions found in the training set.

```
ldamod <- lda(formula=trform , data=trainset)
pv <- predict(ldamod, testset)
cm <- xtabs( ~ pv$class + testset$type)
```

This table shows the predicted type in the rows by the actual type in the columns.

`print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")`

nv | tr | |
---|---|---|

nv | 20664 | 7729 |

tr | 4496 | 17111 |

Same as above but now expressed a percentage within each column:

`print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")`

nv | tr | |
---|---|---|

nv | 82 | 31 |

tr | 18 | 69 |

The overall classification rate is 0.7555.

```
roz <- rpart(trform ,data=trainset)
rpart.plot(roz,type=1,extra=1)
```

```
pv <- predict(roz,newdata=testset,type="class")
cm <- xtabs( ~ pv + testset$type)
```

This table shows the predicted type in the rows by the actual type in the columns.

`print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")`

nv | tr | |
---|---|---|

nv | 20791 | 5920 |

tr | 4369 | 18920 |

Same as above but now expressed a percentage within each column:

`print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")`

nv | tr | |
---|---|---|

nv | 83 | 24 |

tr | 17 | 76 |

The overall classification rate is 0.7942.

Use the default choice of setting from the *kernlab* R package for this:

`svmod <- ksvm(trform, data=trainset)`

`Using automatic sigma estimation (sigest) for RBF or laplace kernel `

```
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)
```

This table shows the predicted type in the rows by the actual type in the columns.

`print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")`

nv | tr | |
---|---|---|

nv | 22307 | 4506 |

tr | 2853 | 20334 |

Same as above but now expressed a percentage within each column:

`print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")`

nv | tr | |
---|---|---|

nv | 89 | 18 |

tr | 11 | 82 |

The overall classification rate is 0.8528.

Use the multinom() function from the *nnet* R package. Might work better with some scaling.

```
svmod <- multinom(trform, data=trainset, trace=FALSE, maxit=1000, decay=5e-4)
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)
```

This table shows the predicted type in the rows by the actual type in the columns.

`print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")`

nv | tr | |
---|---|---|

nv | 20577 | 7012 |

tr | 4583 | 17828 |

Same as above but now expressed a percentage within each column:

`print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")`

nv | tr | |
---|---|---|

nv | 82 | 28 |

tr | 18 | 72 |

The overall classification rate is 0.7681.

Use the *randomForest* package with the default settings:

```
tallform <- as.formula(paste("type ~",tpredform))
fmod <- randomForest(tallform, data=na.omit(trainset))
pv <- predict(fmod, newdata=na.omit(testset))
cm <- xtabs( ~ pv + na.omit(testset)$type)
```

This table shows the predicted type in the rows by the actual type in the columns.

`print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")`

nv | tr | |
---|---|---|

nv | 21684 | 3785 |

tr | 3476 | 21055 |

Same as above but now expressed a percentage within each column:

`print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")`

nv | tr | |
---|---|---|

nv | 86 | 15 |

tr | 14 | 85 |

The overall classification rate is 0.8548.

Percentage correctly classified:

```
names(cmat) <- c("LDA","RP","SVM","NN","RF")
round(cmat*100,1)
```

```
LDA RP SVM NN RF
75.5 79.4 85.3 76.8 85.5
```

```
predform <- "shov + maxdiff + dscore + log(totvar) + log(quadvar) + log(famp) + log(fslope) + log(outl) + gscore + lsd + nudlog(gtvar) + rm.amplitude + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("shov","maxdiff","dscore","totvar","quadvar","famp","fslope","outl","gscore","lsd","gtvar","rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
trform <- as.formula(paste("type ~",predform))
```

Linear Discriminant analysis using the default options.

We produce the cross-classification between predicted and observed class. Note that the default priors are the proportions found in the training set.

```
ldamod <- lda(formula=trform , data=trainset)
pv <- predict(ldamod, testset)
cm <- xtabs( ~ pv$class + testset$type)
```

This table shows the predicted type in the rows by the actual type in the columns.

`print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")`

nv | tr | |
---|---|---|

nv | 24747 | 834 |

tr | 413 | 24006 |

Same as above but now expressed a percentage within each column:

`print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")`

nv | tr | |
---|---|---|

nv | 98 | 3 |

tr | 2 | 97 |

The overall classification rate is 0.9751.

```
roz <- rpart(trform ,data=trainset)
rpart.plot(roz,type=1,extra=1)
```