[1] "Created: Thu Sep 4 12:22:07 2014"
See compfeatrev5.Rmd for how the revised version of the original data featrev5.rda was prepared for this analysis. See also compfeatnew5.Rmd for how the revised version of the original data featnew5.rda was prepared for this analysis.
source("funcs.R")
require(MASS)
require(nnet)
require(ggplot2)
require(rpart)
require(rpart.plot)
require(xtable)
require(kernlab)
require(randomForest)
load("featrev5.rda")
trainset <- cmd5
load("featnew5.rda")
testset <- cmd5
cmat <- rep(NA,5)
Set up the predictors that we will use throughout. Note that I have transformed some of the variables as seen in the setup.
predform <- "rm.amplitude + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
trform <- as.formula(paste("type ~",predform))
Linear Discriminant analysis using the default options.
We produce the cross-classification between predicted and observed class. Note that the default priors are the proportions found in the training set.
ldamod <- lda(formula=trform , data=trainset)
pv <- predict(ldamod, testset)
cm <- xtabs( ~ pv$class + testset$type)
This table shows the predicted type in the rows by the actual type in the columns.
print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 61 | 6 | 1 | 10 | 6 |
blazar | 3 | 8 | 2 | 1 | 5 |
cv | 1 | 4 | 55 | 3 | 33 |
flare | 1 | 0 | 7 | 18 | 1 |
sn | 87 | 14 | 21 | 13 | 197 |
Same as above but now expressed a percentage within each column:
print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 40 | 19 | 1 | 22 | 2 |
blazar | 2 | 25 | 2 | 2 | 2 |
cv | 1 | 12 | 64 | 7 | 14 |
flare | 1 | 0 | 8 | 40 | 0 |
sn | 57 | 44 | 24 | 29 | 81 |
The overall classification rate is 0.6075.
roz <- rpart(trform ,data=trainset)
rpart.plot(roz,type=1,extra=1)
pv <- predict(roz,newdata=testset,type="class")
cm <- xtabs( ~ pv + testset$type)
This table shows the predicted type in the rows by the actual type in the columns.
print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 89 | 7 | 0 | 2 | 9 |
blazar | 16 | 6 | 1 | 0 | 6 |
cv | 4 | 6 | 51 | 19 | 24 |
flare | 0 | 0 | 3 | 11 | 3 |
sn | 44 | 13 | 31 | 13 | 200 |
Same as above but now expressed a percentage within each column:
print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 58 | 22 | 0 | 4 | 4 |
blazar | 10 | 19 | 1 | 0 | 2 |
cv | 3 | 19 | 59 | 42 | 10 |
flare | 0 | 0 | 3 | 24 | 1 |
sn | 29 | 41 | 36 | 29 | 83 |
The overall classification rate is 0.6398.
Use the default choice of setting from the kernlab R package for this:
svmod <- ksvm(trform, data=trainset)
Using automatic sigma estimation (sigest) for RBF or laplace kernel
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)
This table shows the predicted type in the rows by the actual type in the columns.
print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 119 | 11 | 0 | 3 | 6 |
blazar | 0 | 7 | 1 | 0 | 9 |
cv | 2 | 2 | 57 | 9 | 24 |
flare | 1 | 0 | 1 | 20 | 1 |
sn | 31 | 12 | 27 | 13 | 202 |
Same as above but now expressed a percentage within each column:
print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 78 | 34 | 0 | 7 | 2 |
blazar | 0 | 22 | 1 | 0 | 4 |
cv | 1 | 6 | 66 | 20 | 10 |
flare | 1 | 0 | 1 | 44 | 0 |
sn | 20 | 38 | 31 | 29 | 83 |
The overall classification rate is 0.7258.
Use the multinom() function from the nnet R package. Might work better with some scaling.
svmod <- multinom(trform, data=trainset, trace=FALSE, maxit=1000, decay=5e-4)
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)
This table shows the predicted type in the rows by the actual type in the columns.
print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 113 | 12 | 1 | 3 | 7 |
blazar | 2 | 10 | 1 | 2 | 9 |
cv | 3 | 4 | 55 | 2 | 27 |
flare | 1 | 0 | 2 | 24 | 4 |
sn | 34 | 6 | 27 | 14 | 195 |
Same as above but now expressed a percentage within each column:
print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 74 | 38 | 1 | 7 | 3 |
blazar | 1 | 31 | 1 | 4 | 4 |
cv | 2 | 12 | 64 | 4 | 11 |
flare | 1 | 0 | 2 | 53 | 2 |
sn | 22 | 19 | 31 | 31 | 81 |
The overall classification rate is 0.7115.
Use the randomForest package with the default settings:
tallform <- as.formula(paste("type ~",tpredform))
fmod <- randomForest(tallform, data=na.omit(trainset))
pv <- predict(fmod, newdata=na.omit(testset))
cm <- xtabs( ~ pv + na.omit(testset)$type)
This table shows the predicted type in the rows by the actual type in the columns.
print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 107 | 8 | 0 | 3 | 1 |
blazar | 6 | 6 | 1 | 1 | 12 |
cv | 3 | 3 | 61 | 6 | 24 |
flare | 1 | 0 | 2 | 23 | 0 |
sn | 36 | 15 | 22 | 12 | 205 |
Same as above but now expressed a percentage within each column:
print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 70 | 25 | 0 | 7 | 0 |
blazar | 4 | 19 | 1 | 2 | 5 |
cv | 2 | 9 | 71 | 13 | 10 |
flare | 1 | 0 | 2 | 51 | 0 |
sn | 24 | 47 | 26 | 27 | 85 |
The overall classification rate is 0.7204.
Percentage correctly classified:
names(cmat) <- c("LDA","RP","SVM","NN","RF")
round(cmat*100,1)
LDA RP SVM NN RF
60.8 64.0 72.6 71.1 72.0
predform <- "shov + maxdiff + dscore + log(totvar) + log(quadvar) + log(famp) + log(fslope) + log(outl) + gscore + lsd + nudlog(gtvar) + rm.amplitude + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("shov","maxdiff","dscore","totvar","quadvar","famp","fslope","outl","gscore","lsd","gtvar","rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
trform <- as.formula(paste("type ~",predform))
Linear Discriminant analysis using the default options.
We produce the cross-classification between predicted and observed class. Note that the default priors are the proportions found in the training set.
ldamod <- lda(formula=trform , data=trainset)
pv <- predict(ldamod, testset)
cm <- xtabs( ~ pv$class + testset$type)
This table shows the predicted type in the rows by the actual type in the columns.
print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 127 | 13 | 2 | 7 | 13 |
blazar | 3 | 13 | 1 | 0 | 4 |
cv | 2 | 0 | 53 | 2 | 22 |
flare | 1 | 0 | 6 | 22 | 2 |
sn | 20 | 6 | 24 | 14 | 201 |
Same as above but now expressed a percentage within each column:
print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 83 | 41 | 2 | 16 | 5 |
blazar | 2 | 41 | 1 | 0 | 2 |
cv | 1 | 0 | 62 | 4 | 9 |
flare | 1 | 0 | 7 | 49 | 1 |
sn | 13 | 19 | 28 | 31 | 83 |
The overall classification rate is 0.7455.
roz <- rpart(trform ,data=trainset)
rpart.plot(roz,type=1,extra=1)
pv <- predict(roz,newdata=testset,type="class")
cm <- xtabs( ~ pv + testset$type)
This table shows the predicted type in the rows by the actual type in the columns.
print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 93 | 8 | 2 | 8 | 8 |
blazar | 0 | 8 | 1 | 0 | 3 |
cv | 0 | 3 | 37 | 0 | 8 |
flare | 9 | 0 | 5 | 21 | 6 |
sn | 51 | 13 | 41 | 16 | 217 |
Same as above but now expressed a percentage within each column:
print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 61 | 25 | 2 | 18 | 3 |
blazar | 0 | 25 | 1 | 0 | 1 |
cv | 0 | 9 | 43 | 0 | 3 |
flare | 6 | 0 | 6 | 47 | 2 |
sn | 33 | 41 | 48 | 36 | 90 |
The overall classification rate is 0.6738.
Use the default choice of setting from the kernlab R package for this:
svmod <- ksvm(trform, data=trainset)
Using automatic sigma estimation (sigest) for RBF or laplace kernel
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)
This table shows the predicted type in the rows by the actual type in the columns.
print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 125 | 12 | 0 | 3 | 8 |
blazar | 2 | 13 | 1 | 0 | 3 |
cv | 2 | 0 | 62 | 2 | 19 |
flare | 1 | 0 | 3 | 26 | 2 |
sn | 23 | 7 | 20 | 14 | 210 |
Same as above but now expressed a percentage within each column:
print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 82 | 38 | 0 | 7 | 3 |
blazar | 1 | 41 | 1 | 0 | 1 |
cv | 1 | 0 | 72 | 4 | 8 |
flare | 1 | 0 | 3 | 58 | 1 |
sn | 15 | 22 | 23 | 31 | 87 |
The overall classification rate is 0.7814.
Use the multinom() function from the nnet R package. Might work better with some scaling.
svmod <- multinom(trform, data=trainset, trace=FALSE, maxit=1000, decay=5e-4)
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)
This table shows the predicted type in the rows by the actual type in the columns.
print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 134 | 13 | 0 | 4 | 8 |
blazar | 3 | 13 | 0 | 0 | 3 |
cv | 0 | 1 | 51 | 0 | 23 |
flare | 3 | 0 | 7 | 30 | 5 |
sn | 13 | 5 | 28 | 11 | 203 |
Same as above but now expressed a percentage within each column:
print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 88 | 41 | 0 | 9 | 3 |
blazar | 2 | 41 | 0 | 0 | 1 |
cv | 0 | 3 | 59 | 0 | 10 |
flare | 2 | 0 | 8 | 67 | 2 |
sn | 8 | 16 | 33 | 24 | 84 |
The overall classification rate is 0.7724.
Use the randomForest package with the default settings:
tallform <- as.formula(paste("type ~",tpredform))
fmod <- randomForest(tallform, data=na.omit(trainset))
pv <- predict(fmod, newdata=na.omit(testset))
cm <- xtabs( ~ pv + na.omit(testset)$type)
This table shows the predicted type in the rows by the actual type in the columns.
print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 123 | 7 | 0 | 2 | 3 |
blazar | 4 | 12 | 2 | 1 | 3 |
cv | 2 | 2 | 58 | 1 | 9 |
flare | 1 | 0 | 4 | 29 | 3 |
sn | 23 | 11 | 22 | 12 | 224 |
Same as above but now expressed a percentage within each column:
print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")
agn | blazar | cv | flare | sn | |
---|---|---|---|---|---|
agn | 80 | 22 | 0 | 4 | 1 |
blazar | 3 | 38 | 2 | 2 | 1 |
cv | 1 | 6 | 67 | 2 | 4 |
flare | 1 | 0 | 5 | 64 | 1 |
sn | 15 | 34 | 26 | 27 | 93 |
The overall classification rate is 0.7993.
Percentage correctly classified:
names(cmat) <- c("LDA","RP","SVM","NN","RF")
round(cmat*100,1)
LDA RP SVM NN RF
74.6 67.4 78.1 77.2 79.9
cmatc <- rbind(cmatRICH,cmatGPR)
dimnames(cmatc) <- list(c("Richards","Ours+Rich"), c("LDA","RP","SVM","NN","RF"))
round(cmatc*100,1)
LDA RP SVM NN RF
Richards 60.8 64.0 72.6 71.1 72.0
Ours+Rich 74.6 67.4 78.1 77.2 79.9