Comparison of classification methods on the five class data

Training and test sets
Richards set predictors
Richards+Our set of measures
Comparison of performance

[1] "Created: Thu Sep  4 12:22:07 2014"

See compfeatrev5.Rmd for how the revised version of the original data featrev5.rda was prepared for this analysis. See also compfeatnew5.Rmd for how the revised version of the original data featnew5.rda was prepared for this analysis.

source("funcs.R")
require(MASS)
require(nnet)
require(ggplot2)
require(rpart)
require(rpart.plot)
require(xtable)
require(kernlab)
require(randomForest)

Training and test sets

load("featrev5.rda")
trainset <- cmd5
load("featnew5.rda")
testset <- cmd5
cmat <- rep(NA,5)

Richards set predictors

Set up the predictors that we will use throughout. Note that I have transformed some of the variables as seen in the setup.

predform <- "rm.amplitude  + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
trform <- as.formula(paste("type ~",predform))

LDA

Linear Discriminant analysis using the default options.

We produce the cross-classification between predicted and observed class. Note that the default priors are the proportions found in the training set.

ldamod <- lda(formula=trform , data=trainset)
pv <- predict(ldamod, testset)
cm <- xtabs( ~ pv$class + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	61	6	1	10	6
blazar	3	8	2	1	5
cv	1	4	55	3	33
flare	1	0	7	18	1
sn	87	14	21	13	197

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	40	19	1	22	2
blazar	2	25	2	2	2
cv	1	12	64	7	14
flare	1	0	8	40	0
sn	57	44	24	29	81

The overall classification rate is 0.6075.

Recursive Partitioning

roz <- rpart(trform ,data=trainset)
rpart.plot(roz,type=1,extra=1)

pv <- predict(roz,newdata=testset,type="class")
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	89	7	0	2	9
blazar	16	6	1	0	6
cv	4	6	51	19	24
flare	0	0	3	11	3
sn	44	13	31	13	200

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	58	22	0	4	4
blazar	10	19	1	0	2
cv	3	19	59	42	10
flare	0	0	3	24	1
sn	29	41	36	29	83

The overall classification rate is 0.6398.

Support Vector Machines

Use the default choice of setting from the kernlab R package for this:

svmod <- ksvm(trform, data=trainset)

Using automatic sigma estimation (sigest) for RBF or laplace kernel

pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	119	11	0	3	6
blazar	0	7	1	0	9
cv	2	2	57	9	24
flare	1	0	1	20	1
sn	31	12	27	13	202

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	78	34	0	7	2
blazar	0	22	1	0	4
cv	1	6	66	20	10
flare	1	0	1	44	0
sn	20	38	31	29	83

The overall classification rate is 0.7258.

Neural Net

Use the multinom() function from the nnet R package. Might work better with some scaling.

svmod <- multinom(trform, data=trainset, trace=FALSE, maxit=1000, decay=5e-4)
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	113	12	1	3	7
blazar	2	10	1	2	9
cv	3	4	55	2	27
flare	1	0	2	24	4
sn	34	6	27	14	195

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	74	38	1	7	3
blazar	1	31	1	4	4
cv	2	12	64	4	11
flare	1	0	2	53	2
sn	22	19	31	31	81

The overall classification rate is 0.7115.

Random Forest

Use the randomForest package with the default settings:

tallform <- as.formula(paste("type ~",tpredform))
fmod <- randomForest(tallform, data=na.omit(trainset))
pv <- predict(fmod, newdata=na.omit(testset))
cm <- xtabs( ~ pv + na.omit(testset)$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	107	8	0	3	1
blazar	6	6	1	1	12
cv	3	3	61	6	24
flare	1	0	2	23	0
sn	36	15	22	12	205

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	70	25	0	7	0
blazar	4	19	1	2	5
cv	2	9	71	13	10
flare	1	0	2	51	0
sn	24	47	26	27	85

The overall classification rate is 0.7204.

Summary of Classification Rate Performance

Percentage correctly classified:

names(cmat) <- c("LDA","RP","SVM","NN","RF")
round(cmat*100,1)

 LDA   RP  SVM   NN   RF 
60.8 64.0 72.6 71.1 72.0

Richards+Our set of measures

predform <- "shov + maxdiff + dscore + log(totvar) + log(quadvar) + log(famp) + log(fslope) + log(outl) + gscore + lsd + nudlog(gtvar) + rm.amplitude  + rm.beyond1std + rm.fpr20 +rm.fpr35 + rm.fpr50 + rm.fpr80 + log(rm.maxslope) + rm.mad +asylog(rm.medbuf) + rm.pairslope + log(rm.peramp) + log(rm.pdfp) + rm.skew + log(rm.kurtosis)+ rm.std + dublog(rm.rcorbor)"
preds <- c("shov","maxdiff","dscore","totvar","quadvar","famp","fslope","outl","gscore","lsd","gtvar","rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
tpredform <- paste(preds,collapse="+")
trform <- as.formula(paste("type ~",predform))

LDA

Linear Discriminant analysis using the default options.

We produce the cross-classification between predicted and observed class. Note that the default priors are the proportions found in the training set.

ldamod <- lda(formula=trform , data=trainset)
pv <- predict(ldamod, testset)
cm <- xtabs( ~ pv$class + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	127	13	2	7	13
blazar	3	13	1	0	4
cv	2	0	53	2	22
flare	1	0	6	22	2
sn	20	6	24	14	201

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	83	41	2	16	5
blazar	2	41	1	0	2
cv	1	0	62	4	9
flare	1	0	7	49	1
sn	13	19	28	31	83

The overall classification rate is 0.7455.

Recursive Partitioning

roz <- rpart(trform ,data=trainset)
rpart.plot(roz,type=1,extra=1)

plot of chunk unnamed-chunk-45

pv <- predict(roz,newdata=testset,type="class")
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	93	8	2	8	8
blazar	0	8	1	0	3
cv	0	3	37	0	8
flare	9	0	5	21	6
sn	51	13	41	16	217

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	61	25	2	18	3
blazar	0	25	1	0	1
cv	0	9	43	0	3
flare	6	0	6	47	2
sn	33	41	48	36	90

The overall classification rate is 0.6738.

Support Vector Machines

Use the default choice of setting from the kernlab R package for this:

svmod <- ksvm(trform, data=trainset)

Using automatic sigma estimation (sigest) for RBF or laplace kernel

pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	125	12	0	3	8
blazar	2	13	1	0	3
cv	2	0	62	2	19
flare	1	0	3	26	2
sn	23	7	20	14	210

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	82	38	0	7	3
blazar	1	41	1	0	1
cv	1	0	72	4	8
flare	1	0	3	58	1
sn	15	22	23	31	87

The overall classification rate is 0.7814.

Neural Net

Use the multinom() function from the nnet R package. Might work better with some scaling.

svmod <- multinom(trform, data=trainset, trace=FALSE, maxit=1000, decay=5e-4)
pv <- predict(svmod, testset)
cm <- xtabs( ~ pv + testset$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	134	13	0	4	8
blazar	3	13	0	0	3
cv	0	1	51	0	23
flare	3	0	7	30	5
sn	13	5	28	11	203

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	88	41	0	9	3
blazar	2	41	0	0	1
cv	0	3	59	0	10
flare	2	0	8	67	2
sn	8	16	33	24	84

The overall classification rate is 0.7724.

Random Forest

Use the randomForest package with the default settings:

tallform <- as.formula(paste("type ~",tpredform))
fmod <- randomForest(tallform, data=na.omit(trainset))
pv <- predict(fmod, newdata=na.omit(testset))
cm <- xtabs( ~ pv + na.omit(testset)$type)

This table shows the predicted type in the rows by the actual type in the columns.

print(xtable(cm,digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	123	7	0	2	3
blazar	4	12	2	1	3
cv	2	2	58	1	9
flare	1	0	4	29	3
sn	23	11	22	12	224

Same as above but now expressed a percentage within each column:

print(xtable(round(100*prop.table(cm, 2)),digits=0,caption="Actual"),type="html",caption.placement="top")

Actual
	agn	blazar	cv	flare	sn
agn	80	22	0	4	1
blazar	3	38	2	2	1
cv	1	6	67	2	4
flare	1	0	5	64	1
sn	15	34	26	27	93

The overall classification rate is 0.7993.

Summary of Classification Rate Performance

Percentage correctly classified:

names(cmat) <- c("LDA","RP","SVM","NN","RF")
round(cmat*100,1)

 LDA   RP  SVM   NN   RF 
74.6 67.4 78.1 77.2 79.9

Comparison of performance

cmatc <- rbind(cmatRICH,cmatGPR)
dimnames(cmatc) <- list(c("Richards","Ours+Rich"),  c("LDA","RP","SVM","NN","RF"))
round(cmatc*100,1)

           LDA   RP  SVM   NN   RF
Richards  60.8 64.0 72.6 71.1 72.0
Ours+Rich 74.6 67.4 78.1 77.2 79.9

Comparison of classification methods on the five class data

Julian Faraway

Training and test sets

Richards set predictors

LDA

Recursive Partitioning

Support Vector Machines

Neural Net

Random Forest

Summary of Classification Rate Performance

Richards+Our set of measures

LDA

Recursive Partitioning

Support Vector Machines

Neural Net

Random Forest

Summary of Classification Rate Performance

Comparison of performance