[1] "Created: Thu Sep  4 12:22:07 2014"

See compfeatrev5.Rmd for how the revised version of the original data featrev5.rda was prepared for this analysis. See also compfeatnew5.Rmd for how the revised version of the original data featnew5.rda was prepared for this analysis.

source("funcs.R")
require(MASS)
require(nnet)
require(ggplot2)
require(rpart)
require(rpart.plot)
require(xtable)
require(kernlab)
require(randomForest)

Training and test sets

load("featrev5.rda")
trainset <- cmd5
load("featnew5.rda")
testset <- cmd5
cmat <- rep(NA,5)

Richards set predictors

Set up the predictors that we will use throughout. Note that I have transformed some of the variables as seen in the setup.

predform <- "rm.amplitude  + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
trform <- as.formula(paste("type ~",predform))

LDA

Linear Discriminant analysis using the default options.

We produce the cross-classification between predicted and observed class. Note that the default priors are the proportions found in the training set.

ldamod <- lda(formula=trform , data=trainset)
pv <- predict(ldamod, testset)
cm <- xtabs( ~ pv$class + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 61 6 1 10 6
blazar 3 8 2 1 5
cv 1 4 55 3 33
flare 1 0 7 18 1
sn 87 14 21 13 197

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 40 19 1 22 2
blazar 2 25 2 2 2
cv 1 12 64 7 14
flare 1 0 8 40 0
sn 57 44 24 29 81

The overall classification rate is 0.6075.

Recursive Partitioning

roz <- rpart(trform ,data=trainset)
rpart.plot(roz,type=1,extra=1)

plot of chunk unnamed-chunk-14

pv <- predict(roz,newdata=testset,type="class")
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 89 7 0 2 9
blazar 16 6 1 0 6
cv 4 6 51 19 24
flare 0 0 3 11 3
sn 44 13 31 13 200

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 58 22 0 4 4
blazar 10 19 1 0 2
cv 3 19 59 42 10
flare 0 0 3 24 1
sn 29 41 36 29 83

The overall classification rate is 0.6398.

Support Vector Machines

Use the default choice of setting from the kernlab R package for this:

svmod <- ksvm(trform, data=trainset)
Using automatic sigma estimation (sigest) for RBF or laplace kernel 
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 119 11 0 3 6
blazar 0 7 1 0 9
cv 2 2 57 9 24
flare 1 0 1 20 1
sn 31 12 27 13 202

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 78 34 0 7 2
blazar 0 22 1 0 4
cv 1 6 66 20 10
flare 1 0 1 44 0
sn 20 38 31 29 83

The overall classification rate is 0.7258.

Neural Net

Use the multinom() function from the nnet R package. Might work better with some scaling.

svmod <- multinom(trform, data=trainset, trace=FALSE, maxit=1000, decay=5e-4)
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 113 12 1 3 7
blazar 2 10 1 2 9
cv 3 4 55 2 27
flare 1 0 2 24 4
sn 34 6 27 14 195

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 74 38 1 7 3
blazar 1 31 1 4 4
cv 2 12 64 4 11
flare 1 0 2 53 2
sn 22 19 31 31 81

The overall classification rate is 0.7115.

Random Forest

Use the randomForest package with the default settings:

tallform <- as.formula(paste("type ~",tpredform))
fmod <- randomForest(tallform, data=na.omit(trainset))
pv <- predict(fmod, newdata=na.omit(testset))
cm <- xtabs( ~ pv + na.omit(testset)$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 107 8 0 3 1
blazar 6 6 1 1 12
cv 3 3 61 6 24
flare 1 0 2 23 0
sn 36 15 22 12 205

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 70 25 0 7 0
blazar 4 19 1 2 5
cv 2 9 71 13 10
flare 1 0 2 51 0
sn 24 47 26 27 85

The overall classification rate is 0.7204.

Summary of Classification Rate Performance

Percentage correctly classified:

names(cmat) <- c("LDA","RP","SVM","NN","RF")
round(cmat*100,1)
 LDA   RP  SVM   NN   RF 
60.8 64.0 72.6 71.1 72.0 

Richards+Our set of measures

predform <- "shov + maxdiff + dscore + log(totvar) + log(quadvar) + log(famp) + log(fslope) + log(outl) + gscore + lsd + nudlog(gtvar) + rm.amplitude  + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("shov","maxdiff","dscore","totvar","quadvar","famp","fslope","outl","gscore","lsd","gtvar","rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
trform <- as.formula(paste("type ~",predform))

LDA

Linear Discriminant analysis using the default options.

We produce the cross-classification between predicted and observed class. Note that the default priors are the proportions found in the training set.

ldamod <- lda(formula=trform , data=trainset)
pv <- predict(ldamod, testset)
cm <- xtabs( ~ pv$class + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 127 13 2 7 13
blazar 3 13 1 0 4
cv 2 0 53 2 22
flare 1 0 6 22 2
sn 20 6 24 14 201

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 83 41 2 16 5
blazar 2 41 1 0 2
cv 1 0 62 4 9
flare 1 0 7 49 1
sn 13 19 28 31 83

The overall classification rate is 0.7455.

Recursive Partitioning

roz <- rpart(trform ,data=trainset)
rpart.plot(roz,type=1,extra=1)

plot of chunk unnamed-chunk-45

pv <- predict(roz,newdata=testset,type="class")
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 93 8 2 8 8
blazar 0 8 1 0 3
cv 0 3 37 0 8
flare 9 0 5 21 6
sn 51 13 41 16 217

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 61 25 2 18 3
blazar 0 25 1 0 1
cv 0 9 43 0 3
flare 6 0 6 47 2
sn 33 41 48 36 90

The overall classification rate is 0.6738.

Support Vector Machines

Use the default choice of setting from the kernlab R package for this:

svmod <- ksvm(trform, data=trainset)
Using automatic sigma estimation (sigest) for RBF or laplace kernel 
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 125 12 0 3 8
blazar 2 13 1 0 3
cv 2 0 62 2 19
flare 1 0 3 26 2
sn 23 7 20 14 210

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 82 38 0 7 3
blazar 1 41 1 0 1
cv 1 0 72 4 8
flare 1 0 3 58 1
sn 15 22 23 31 87

The overall classification rate is 0.7814.

Neural Net

Use the multinom() function from the nnet R package. Might work better with some scaling.

svmod <- multinom(trform, data=trainset, trace=FALSE, maxit=1000, decay=5e-4)
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 134 13 0 4 8
blazar 3 13 0 0 3
cv 0 1 51 0 23
flare 3 0 7 30 5
sn 13 5 28 11 203

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 88 41 0 9 3
blazar 2 41 0 0 1
cv 0 3 59 0 10
flare 2 0 8 67 2
sn 8 16 33 24 84

The overall classification rate is 0.7724.

Random Forest

Use the randomForest package with the default settings:

tallform <- as.formula(paste("type ~",tpredform))
fmod <- randomForest(tallform, data=na.omit(trainset))
pv <- predict(fmod, newdata=na.omit(testset))
cm <- xtabs( ~ pv + na.omit(testset)$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 123 7 0 2 3
blazar 4 12 2 1 3
cv 2 2 58 1 9
flare 1 0 4 29 3
sn 23 11 22 12 224

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv flare sn
agn 80 22 0 4 1
blazar 3 38 2 2 1
cv 1 6 67 2 4
flare 1 0 5 64 1
sn 15 34 26 27 93

The overall classification rate is 0.7993.

Summary of Classification Rate Performance

Percentage correctly classified:

names(cmat) <- c("LDA","RP","SVM","NN","RF")
round(cmat*100,1)
 LDA   RP  SVM   NN   RF 
74.6 67.4 78.1 77.2 79.9 

Comparison of performance

cmatc <- rbind(cmatRICH,cmatGPR)
dimnames(cmatc) <- list(c("Richards","Ours+Rich"),  c("LDA","RP","SVM","NN","RF"))
round(cmatc*100,1)
           LDA   RP  SVM   NN   RF
Richards  60.8 64.0 72.6 71.1 72.0
Ours+Rich 74.6 67.4 78.1 77.2 79.9