Division of 100k set into training and test with evaluation of both sets of measures

Define the training and test sets
Richards measures alone
Richards+Our set of measures
Comparison of performance

[1] "Created: Thu Sep  4 11:00:09 2014"

See compfeat.html for how the data featbig.rda was prepared for this analysis.

load("featbig.rda")
source("funcs.R")
require(MASS)
require(nnet)
require(ggplot2)
require(rpart)
require(rpart.plot)
require(xtable)
require(kernlab)
require(randomForest)

Define the training and test sets

Randomly divide available data into two equally sized sets:

set.seed(123)
n <- nrow(cmdbig)
isel <- sample(1:n,round(n/2))
trainset <- cmdbig[-isel,]
testset <- cmdbig[isel,]
cmat <- rep(NA,5)

Richards measures alone

Set up the predictors that we will use. Note that I have transformed some of the variables as seen in the setup.

predform <- "rm.amplitude  + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
trform <- as.formula(paste("type ~",predform))

LDA

Linear Discriminant analysis using the default options.

We produce the cross-classification between predicted and observed class. Note that the default priors are the proportions found in the training set.

ldamod <- lda(formula=trform , data=trainset)
pv <- predict(ldamod, testset)
cm <- xtabs( ~ pv$class + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	20664	7729
tr	4496	17111

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	82	31
tr	18	69

The overall classification rate is 0.7555.

Recursive Partitioning

roz <- rpart(trform ,data=trainset)
rpart.plot(roz,type=1,extra=1)

plot of chunk unnamed-chunk-14

pv <- predict(roz,newdata=testset,type="class")
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	20791	5920
tr	4369	18920

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	83	24
tr	17	76

The overall classification rate is 0.7942.

Support Vector Machines

Use the default choice of setting from the kernlab R package for this:

svmod <- ksvm(trform, data=trainset)

Using automatic sigma estimation (sigest) for RBF or laplace kernel

pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	22307	4506
tr	2853	20334

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	89	18
tr	11	82

The overall classification rate is 0.8528.

Neural Net

Use the multinom() function from the nnet R package. Might work better with some scaling.

svmod <- multinom(trform, data=trainset, trace=FALSE, maxit=1000, decay=5e-4)
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	20577	7012
tr	4583	17828

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	82	28
tr	18	72

The overall classification rate is 0.7681.

Random Forest

Use the randomForest package with the default settings:

tallform <- as.formula(paste("type ~",tpredform))
fmod <- randomForest(tallform, data=na.omit(trainset))
pv <- predict(fmod, newdata=na.omit(testset))
cm <- xtabs( ~ pv + na.omit(testset)$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	21684	3785
tr	3476	21055

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	86	15
tr	14	85

The overall classification rate is 0.8548.

Summary of Classification Rate Performance

Percentage correctly classified:

names(cmat) <- c("LDA","RP","SVM","NN","RF")
round(cmat*100,1)

 LDA   RP  SVM   NN   RF 
75.5 79.4 85.3 76.8 85.5

Richards+Our set of measures

predform <- "shov + maxdiff + dscore + log(totvar) + log(quadvar) + log(famp) + log(fslope) + log(outl) + gscore + lsd + nudlog(gtvar) + rm.amplitude  + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("shov","maxdiff","dscore","totvar","quadvar","famp","fslope","outl","gscore","lsd","gtvar","rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
trform <- as.formula(paste("type ~",predform))

LDA

Linear Discriminant analysis using the default options.

We produce the cross-classification between predicted and observed class. Note that the default priors are the proportions found in the training set.

ldamod <- lda(formula=trform , data=trainset)
pv <- predict(ldamod, testset)
cm <- xtabs( ~ pv$class + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	24747	834
tr	413	24006

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	98	3
tr	2	97

The overall classification rate is 0.9751.

Recursive Partitioning

roz <- rpart(trform ,data=trainset)
rpart.plot(roz,type=1,extra=1)

plot of chunk unnamed-chunk-45

pv <- predict(roz,newdata=testset,type="class")
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	21170	1391
tr	3990	23449

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	84	6
tr	16	94

The overall classification rate is 0.8924.

Support Vector Machines

Use the default choice of setting from the kernlab R package for this:

svmod <- ksvm(trform, data=trainset)

Using automatic sigma estimation (sigest) for RBF or laplace kernel

pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	24538	182
tr	622	24658

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	98	1
tr	2	99

The overall classification rate is 0.9839.

Neural Net

Use the multinom() function from the nnet R package. Might work better with some scaling.

svmod <- multinom(trform, data=trainset, trace=FALSE, maxit=1000, decay=5e-4)
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	24699	398
tr	461	24442

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	98	2
tr	2	98

The overall classification rate is 0.9828.

Random Forest

Use the randomForest package with the default settings:

tallform <- as.formula(paste("type ~",tpredform))
fmod <- randomForest(tallform, data=na.omit(trainset))
pv <- predict(fmod, newdata=na.omit(testset))
cm <- xtabs( ~ pv + na.omit(testset)$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	23994	420
tr	1166	24420

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	nv	tr
nv	95	2
tr	5	98

The overall classification rate is 0.9683.

Summary of Classification Rate Performance

Percentage correctly classified:

names(cmat) <- c("LDA","RP","SVM","NN","RF")
round(cmat*100,1)

 LDA   RP  SVM   NN   RF 
97.5 89.2 98.4 98.3 96.8

Comparison of performance

cmatc <- rbind(cmatRICH,cmatGPR)
dimnames(cmatc) <- list(c("Richards","Ours+Rich"),  c("LDA","RP","SVM","NN","RF"))
round(cmatc*100,1)

           LDA   RP  SVM   NN   RF
Richards  75.5 79.4 85.3 76.8 85.5
Ours+Rich 97.5 89.2 98.4 98.3 96.8

Division of 100k set into training and test with evaluation of both sets of measures

Julian Faraway

Define the training and test sets

Richards measures alone

LDA

Recursive Partitioning

Support Vector Machines

Neural Net

Random Forest

Summary of Classification Rate Performance

Richards+Our set of measures

LDA

Recursive Partitioning

Support Vector Machines

Neural Net

Random Forest

Summary of Classification Rate Performance

Comparison of performance