[1] "Created: Wed Sep  3 16:21:25 2014"

The purpose of this analysis is to establish a performance baseline for our measures. I have deliberately chosen the default version of five different classification methods:

  1. LDA
  2. Recursive Partitioning
  3. SVM
  4. NNET
  5. Random Forest

It is very possible that better results could be obtained by tuning these methods better or using different classification methods. This is worth trying but my main objective is the generation of new features which I hope will improve the Richards features. I need to use the same classification methods for both sets of features in order to make a fair comparison.

See featlcdb.Rmd for how the data feat.rda was prepared for this analysis.

load("feat.rda")
source("funcs.R")
require(MASS)
require(nnet)
require(ggplot2)
require(rpart)
require(rpart.plot)
require(xtable)
require(kernlab)
require(randomForest)

Richards set predictors

Set up the predictors that we will use throughout. Note that I have transformed some of the variables as seen in the setup.

predform <- "rm.amplitude  + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
trform <- as.formula(paste("type ~",predform))

Creation of model formulae and subset testing and training sets

Formula for classification of all types:

allform <- as.formula(paste("type ~",predform))

transients vs. non-variables:

trains$vtype <- ifelse(trains$type == "nv","nv","tr")
trains$vtype <- factor(trains$vtype)
tests$vtype <- ifelse(tests$type == "nv","nv","tr")
tests$vtype <- factor(tests$vtype)
vtform <- as.formula(paste("vtype ~",predform))

transients only,

trains$ttype <- trains$type
trains$ttype[trains$ttype == "nv"] <- NA
trains$ttype <- factor(trains$ttype)
tests$ttype <- tests$type
tests$ttype[tests$ttype == "nv"] <- NA
tests$ttype <- factor(tests$ttype)
trform <- as.formula(paste("ttype ~",predform))
cmat <- matrix(NA,nrow=4,ncol=5)
dimnames(cmat) <- list(c("All","TranNoTran","Tranonly","Heirarch"),c("LDA","RPart","SVM","NN","Forest"))

All types

LDA

Linear Discriminant analysis using the default options.

We produce the cross-classification between predicted and observed class. Note that the default priors are the proportions found in the training set.

ldamod <- lda(allform ,data=trains)
pv <- predict(ldamod, tests)
cm <- xtabs( ~ pv$class + tests$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv downes flare nv rrlyr sn
agn 2 0 0 0 0 1 0 0
blazar 9 19 11 15 0 9 0 7
cv 0 4 81 34 2 10 1 22
downes 0 1 7 12 1 5 9 4
flare 1 0 7 8 9 10 1 1
nv 32 15 25 66 11 489 65 89
rrlyr 0 0 3 0 0 18 26 4
sn 0 0 11 0 1 16 1 65

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv downes flare nv rrlyr sn
agn 5 0 0 0 0 0 0 0
blazar 20 49 8 11 0 2 0 4
cv 0 10 56 25 8 2 1 11
downes 0 3 5 9 4 1 9 2
flare 2 0 5 6 38 2 1 1
nv 73 38 17 49 46 88 63 46
rrlyr 0 0 2 0 0 3 25 2
sn 0 0 8 0 4 3 1 34

The overall classification rate is 0.5669.

Recursive Partitioning

roz <- rpart(allform ,data=trains)
rpart.plot(roz,type=1,extra=1)

plot of chunk unnamed-chunk-18

pv <- predict(roz,newdata=tests,type="class")
cm <- xtabs( ~ pv + tests$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv downes flare nv rrlyr sn
agn 0 0 0 0 0 0 0 0
blazar 0 0 0 0 0 0 0 0
cv 5 26 95 53 4 20 3 27
downes 0 0 0 0 0 0 0 0
flare 0 0 0 0 0 0 0 0
nv 31 7 26 73 16 497 47 80
rrlyr 0 1 2 3 0 12 53 3
sn 8 5 22 6 4 29 0 82

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv downes flare nv rrlyr sn
agn 0 0 0 0 0 0 0 0
blazar 0 0 0 0 0 0 0 0
cv 11 67 66 39 17 4 3 14
downes 0 0 0 0 0 0 0 0
flare 0 0 0 0 0 0 0 0
nv 70 18 18 54 67 89 46 42
rrlyr 0 3 1 2 0 2 51 2
sn 18 13 15 4 17 5 0 43

The overall classification rate is 0.5863.

Support Vector Machines

Use the default choice of setting from the kernlab R package for this:

svmod <- ksvm(allform, data=trains)
Using automatic sigma estimation (sigest) for RBF or laplace kernel 
pv <- predict(svmod, tests)
cm <- xtabs( ~ pv + tests$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv downes flare nv rrlyr sn
agn 11 2 1 5 0 5 2 3
blazar 0 14 4 8 0 2 0 1
cv 1 4 95 37 3 11 2 23
downes 0 4 6 10 0 3 4 0
flare 0 0 0 0 3 4 0 0
nv 31 9 19 66 17 508 18 61
rrlyr 0 1 0 4 0 8 77 1
sn 1 5 20 5 1 17 0 103

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv downes flare nv rrlyr sn
agn 25 5 1 4 0 1 2 2
blazar 0 36 3 6 0 0 0 1
cv 2 10 66 27 12 2 2 12
downes 0 10 4 7 0 1 4 0
flare 0 0 0 0 12 1 0 0
nv 70 23 13 49 71 91 17 32
rrlyr 0 3 0 3 0 1 75 1
sn 2 13 14 4 4 3 0 54

The overall classification rate is 0.6621.

Neural Net

Use the multinom() function from the nnet R package. Might work better with some scaling.

svmod <- multinom(allform, data=trains, trace=FALSE, maxit=1000, decay=5e-4)
pv <- predict(svmod, tests)
cm <- xtabs( ~ pv + tests$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv downes flare nv rrlyr sn
agn 13 2 0 7 0 5 0 2
blazar 0 12 4 7 0 1 0 2
cv 1 10 82 39 1 10 2 18
downes 1 1 16 11 1 4 3 8
flare 0 0 0 1 4 4 0 0
nv 28 12 29 56 12 508 10 95
rrlyr 1 1 0 14 4 5 88 0
sn 0 1 14 0 2 21 0 67

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv downes flare nv rrlyr sn
agn 30 5 0 5 0 1 0 1
blazar 0 31 3 5 0 0 0 1
cv 2 26 57 29 4 2 2 9
downes 2 3 11 8 4 1 3 4
flare 0 0 0 1 17 1 0 0
nv 64 31 20 41 50 91 10 49
rrlyr 2 3 0 10 17 1 85 0
sn 0 3 10 0 8 4 0 35

The overall classification rate is 0.6331.

Random Forest

Use the randomForest package with the default settings:

tallform <- as.formula(paste("type ~",tpredform))
fmod <- randomForest(tallform, data=trains)
pv <- predict(fmod, newdata=tests)
cm <- xtabs( ~ pv + tests$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv downes flare nv rrlyr sn
agn 21 1 0 5 0 8 0 2
blazar 4 16 3 7 1 5 0 4
cv 2 7 82 35 3 11 1 17
downes 0 1 15 16 0 8 6 2
flare 0 0 0 2 9 3 0 1
nv 15 7 18 61 8 496 17 52
rrlyr 0 1 1 5 0 7 78 1
sn 2 6 26 4 3 20 1 113

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
agn blazar cv downes flare nv rrlyr sn
agn 48 3 0 4 0 1 0 1
blazar 9 41 2 5 4 1 0 2
cv 5 18 57 26 12 2 1 9
downes 0 3 10 12 0 1 6 1
flare 0 0 0 1 38 1 0 1
nv 34 18 12 45 33 89 17 27
rrlyr 0 3 1 4 0 1 76 1
sn 5 15 18 3 12 4 1 59

The overall classification rate is 0.6702.

Transients vs. non-variables

LDA

Linear Discriminant analysis using the default options.

We produce the cross-classification between predicted and observed class. Note that the default priors are the proportions found in the training set.

ldamod <- lda(vtform ,data=trains)
pv <- predict(ldamod, tests)
cm <- xtabs( ~ pv$class + tests$vtype)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 414 171
tr 144 511

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
Actual
nv tr
nv 74 25
tr 26 75

The overall classification rate is 0.746.

Recursive Partitioning

roz <- rpart(vtform ,data=trains)
rpart.plot(roz,type=1,extra=1)