[1] "Created: Thu Sep  4 11:00:09 2014"

See compfeat.html for how the data featbig.rda was prepared for this analysis.

load("featbig.rda")
source("funcs.R")
require(MASS)
require(nnet)
require(ggplot2)
require(rpart)
require(rpart.plot)
require(xtable)
require(kernlab)
require(randomForest)

Define the training and test sets

Randomly divide available data into two equally sized sets:

set.seed(123)
n <- nrow(cmdbig)
isel <- sample(1:n,round(n/2))
trainset <- cmdbig[-isel,]
testset <- cmdbig[isel,]
cmat <- rep(NA,5)

Richards measures alone

Set up the predictors that we will use. Note that I have transformed some of the variables as seen in the setup.

predform <- "rm.amplitude  + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
trform <- as.formula(paste("type ~",predform))

LDA

Linear Discriminant analysis using the default options.

We produce the cross-classification between predicted and observed class. Note that the default priors are the proportions found in the training set.

ldamod <- lda(formula=trform , data=trainset)
pv <- predict(ldamod, testset)
cm <- xtabs( ~ pv$class + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 20664 7729
tr 4496 17111

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 82 31
tr 18 69

The overall classification rate is 0.7555.

Recursive Partitioning

roz <- rpart(trform ,data=trainset)
rpart.plot(roz,type=1,extra=1)

plot of chunk unnamed-chunk-14

pv <- predict(roz,newdata=testset,type="class")
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 20791 5920
tr 4369 18920

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 83 24
tr 17 76

The overall classification rate is 0.7942.

Support Vector Machines

Use the default choice of setting from the kernlab R package for this:

svmod <- ksvm(trform, data=trainset)
Using automatic sigma estimation (sigest) for RBF or laplace kernel 
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 22307 4506
tr 2853 20334

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 89 18
tr 11 82

The overall classification rate is 0.8528.

Neural Net

Use the multinom() function from the nnet R package. Might work better with some scaling.

svmod <- multinom(trform, data=trainset, trace=FALSE, maxit=1000, decay=5e-4)
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 20577 7012
tr 4583 17828

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 82 28
tr 18 72

The overall classification rate is 0.7681.

Random Forest

Use the randomForest package with the default settings:

tallform <- as.formula(paste("type ~",tpredform))
fmod <- randomForest(tallform, data=na.omit(trainset))
pv <- predict(fmod, newdata=na.omit(testset))
cm <- xtabs( ~ pv + na.omit(testset)$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 21684 3785
tr 3476 21055

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 86 15
tr 14 85

The overall classification rate is 0.8548.

Summary of Classification Rate Performance

Percentage correctly classified:

names(cmat) <- c("LDA","RP","SVM","NN","RF")
round(cmat*100,1)
 LDA   RP  SVM   NN   RF 
75.5 79.4 85.3 76.8 85.5 

Richards+Our set of measures

predform <- "shov + maxdiff + dscore + log(totvar) + log(quadvar) + log(famp) + log(fslope) + log(outl) + gscore + lsd + nudlog(gtvar) + rm.amplitude  + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("shov","maxdiff","dscore","totvar","quadvar","famp","fslope","outl","gscore","lsd","gtvar","rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
trform <- as.formula(paste("type ~",predform))

LDA

Linear Discriminant analysis using the default options.

We produce the cross-classification between predicted and observed class. Note that the default priors are the proportions found in the training set.

ldamod <- lda(formula=trform , data=trainset)
pv <- predict(ldamod, testset)
cm <- xtabs( ~ pv$class + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 24747 834
tr 413 24006

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 98 3
tr 2 97

The overall classification rate is 0.9751.

Recursive Partitioning

roz <- rpart(trform ,data=trainset)
rpart.plot(roz,type=1,extra=1)

plot of chunk unnamed-chunk-45

pv <- predict(roz,newdata=testset,type="class")
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 21170 1391
tr 3990 23449

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 84 6
tr 16 94

The overall classification rate is 0.8924.

Support Vector Machines

Use the default choice of setting from the kernlab R package for this:

svmod <- ksvm(trform, data=trainset)
Using automatic sigma estimation (sigest) for RBF or laplace kernel 
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 24538 182
tr 622 24658

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 98 1
tr 2 99

The overall classification rate is 0.9839.

Neural Net

Use the multinom() function from the nnet R package. Might work better with some scaling.

svmod <- multinom(trform, data=trainset, trace=FALSE, maxit=1000, decay=5e-4)
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 24699 398
tr 461 24442

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 98 2
tr 2 98

The overall classification rate is 0.9828.

Random Forest

Use the randomForest package with the default settings:

tallform <- as.formula(paste("type ~",tpredform))
fmod <- randomForest(tallform, data=na.omit(trainset))
pv <- predict(fmod, newdata=na.omit(testset))
cm <- xtabs( ~ pv + na.omit(testset)$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 23994 420
tr 1166 24420

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 95 2
tr 5 98

The overall classification rate is 0.9683.

Summary of Classification Rate Performance

Percentage correctly classified:

names(cmat) <- c("LDA","RP","SVM","NN","RF")
round(cmat*100,1)
 LDA   RP  SVM   NN   RF 
97.5 89.2 98.4 98.3 96.8 

Comparison of performance

cmatc <- rbind(cmatRICH,cmatGPR)
dimnames(cmatc) <- list(c("Richards","Ours+Rich"),  c("LDA","RP","SVM","NN","RF"))
round(cmatc*100,1)
           LDA   RP  SVM   NN   RF
Richards  75.5 79.4 85.3 76.8 85.5
Ours+Rich 97.5 89.2 98.4 98.3 96.8