[1] "Created: Wed Sep 3 16:26:14 2014"
We use only Random Forest. We eliminate terms in order to improve prediction accuracy.
As the name suggests, the Random Forest generates trees randomly. This means that if you repeat these calculations, you will probably get a slightly different result. We could avoid this reproducibility issue by setting the random number seed or (better) increasing the number of trees generated at the cost of a substantially longer computation. However, it may be better to accept that this procedure does not produce an exact definitive ordering of the features because close neighbours may perform about as well as each other and with a slightly different dataset, the ordering might come out slightly differently. However, we can be confident that the model-based features are generally superior to the sample-based ones. So when we consider developing new features for new problems, we will tend to do better if we take a modelling rather than sample-based approach.
See featlcdb.Rmd for how the data feat.rda was prepared for this analysis.
load("feat.rda")
source("funcs.R")
require(randomForest)
Set up the predictors that we will use throughout. Note that monotone transformations of the variables would make no difference to a partitioning method. I have not used data_num as this is the number of observations in the lightcurve file (might be cheating to use this?).
I have also excluded location (ra,dec,longitude,latitude) because it is difficult to use the non-variable objects with these variables. Also it is questionable whether it makes any sense to use them anyway.
preds <- c("shov","maxdiff","dscore","totvar","quadvar","famp","fslope","outl","gscore","lsd","gtvar","rm.amplitude","rm.beyond1std","rm.fpr20","rm.fpr35","rm.fpr50","rm.fpr80","rm.maxslope","rm.mad","rm.medbuf","rm.pairslope","rm.peramp","rm.pdfp","rm.skew","rm.kurtosis","rm.std","rm.rcorbor")
npreds <- length(preds)
Apply random forest to all predictors and compute the training and testing classification rates:
tpredform <- paste(preds,collapse="+")
tallform <- as.formula(paste("type ~",tpredform))
fmod <- randomForest(tallform, data=trains)
pv <- predict(fmod)
cm <- xtabs( ~ pv + trains$type)
nomrate <- sum(diag(cm))/sum(cm)
pv <- predict(fmod, newdata=tests)
cm <- xtabs( ~ pv + tests$type)
truerate <- sum(diag(cm))/sum(cm)
rr <- c(nomrate,truerate)
names(rr) <- c("train","test")
rr
train test
0.8121 0.8024
Look at the importance of each predictor
importance(fmod)
MeanDecreaseGini
shov 94.29
maxdiff 67.61
dscore 98.67
totvar 121.73
quadvar 157.22
famp 174.92
fslope 133.99
outl 59.37
gscore 121.54
lsd 66.81
gtvar 99.16
rm.amplitude 63.30
rm.beyond1std 27.19
rm.fpr20 29.83
rm.fpr35 29.75
rm.fpr50 29.87
rm.fpr80 28.54
rm.maxslope 47.59
rm.mad 33.49
rm.medbuf 29.88
rm.pairslope 21.59
rm.peramp 46.28
rm.pdfp 39.42
rm.skew 46.81
rm.kurtosis 37.74
rm.std 80.26
rm.rcorbor 10.23
Drop predictors sequentially according to importance
crm <- matrix(NA,npreds,2)
vpreds <- preds
vnames <- rep("",npreds)
for(i in 1:(npreds-1)){
vimp <- importance(fmod)
vari <- which.min(vimp)
vpreds <- vpreds[-vari]
vnames[i] <- rownames(vimp)[vari]
tpredform <- paste(vpreds,collapse="+")
tallform <- as.formula(paste("type ~",tpredform))
fmod <- randomForest(tallform, data=trains)
pv <- predict(fmod)
cm <- xtabs( ~ pv + trains$type)
nomrate <- sum(diag(cm))/sum(cm)
pv <- predict(fmod, newdata=tests)
cm <- xtabs( ~ pv + tests$type)
truerate <- sum(diag(cm))/sum(cm)
crm[i,] <- c(nomrate,truerate)
}
nomrate <- mean(trains$type == 'nv')
truerate <- mean(tests$type == 'nv')
crm[npreds,] <- c(nomrate,truerate)
vnames[npreds] <- vpreds
dimnames(crm) <- list(vnames,c("train","test"))
crm
train test
rm.rcorbor 0.8105 0.8081
rm.pairslope 0.8133 0.8137
rm.beyond1std 0.8117 0.8056
rm.fpr20 0.8117 0.8089
rm.fpr80 0.8109 0.8065
rm.fpr35 0.8129 0.8073
rm.medbuf 0.8113 0.8105
rm.mad 0.8125 0.8073
rm.fpr50 0.8133 0.8089
rm.kurtosis 0.8198 0.8129
rm.pdfp 0.8121 0.8113
rm.peramp 0.8153 0.8121
rm.maxslope 0.8141 0.8032
rm.skew 0.8073 0.8032
outl 0.8065 0.8048
rm.amplitude 0.7992 0.8000
lsd 0.7972 0.7968
maxdiff 0.7931 0.7968
dscore 0.7903 0.7911
rm.std 0.7867 0.7823
gtvar 0.7794 0.7685
quadvar 0.7798 0.7629
gscore 0.7540 0.7484
shov 0.6815 0.6871
fslope 0.6306 0.6524
totvar 0.4940 0.4806
famp 0.4706 0.4500
Plot the rates: (solid=train, dashed=test)
crm <- data.frame(crm)
p <- ggplot(aes(x=1:nrow(crm),y=train),data=crm)+ylab("Classification rate")+xlab("")
p <- p + geom_line()
p <- p + scale_x_continuous(breaks=1:nrow(crm),labels=row.names(crm))
p <- p + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5),axis.text = element_text(size = rel(1.2)), axis.title = element_text(size = rel(1.2)))
p <- p + geom_line(aes(y=test),linetype="dashed")
p
trains$vtype <- ifelse(trains$type == "nv","nv","tr")
trains$vtype <- factor(trains$vtype)
tests$vtype <- ifelse(tests$type == "nv","nv","tr")
tests$vtype <- factor(tests$vtype)
tpredform <- paste(preds,collapse="+")
tallform <- as.formula(paste("vtype ~",tpredform))
fmod <- randomForest(tallform, data=trains)
pv <- predict(fmod)
cm <- xtabs( ~ pv + trains$vtype)
nomrate <- sum(diag(cm))/sum(cm)
pv <- predict(fmod, newdata=tests)
cm <- xtabs( ~ pv + tests$vtype)
truerate <- sum(diag(cm))/sum(cm)
rr <- c(nomrate,truerate)
names(rr) <- c("train","test")
rr
train test
0.9282 0.9194
Look at the importance of each predictor
importance(fmod)
MeanDecreaseGini
shov 29.288
maxdiff 16.330
dscore 95.589
totvar 65.896
quadvar 186.157
famp 207.285
fslope 149.558
outl 22.457
gscore 138.073
lsd 42.311
gtvar 42.732
rm.amplitude 27.800
rm.beyond1std 10.821
rm.fpr20 11.571
rm.fpr35 11.521
rm.fpr50 12.636
rm.fpr80 13.075
rm.maxslope 21.738
rm.mad 13.150
rm.medbuf 10.378
rm.pairslope 7.825
rm.peramp 17.798
rm.pdfp 18.634
rm.skew 15.014
rm.kurtosis 16.761
rm.std 28.604
rm.rcorbor 2.108
Drop predictors sequentially according to importance
crm <- matrix(NA,npreds,2)
vpreds <- preds
vnames <- rep("",npreds)
for(i in 1:(npreds-1)){
vimp <- importance(fmod)
vari <- which.min(vimp)
vpreds <- vpreds[-vari]
vnames[i] <- rownames(vimp)[vari]
tpredform <- paste(vpreds,collapse="+")
tallform <- as.formula(paste("vtype ~",tpredform))
fmod <- randomForest(tallform, data=trains)
pv <- predict(fmod)
cm <- xtabs( ~ pv + trains$vtype)
nomrate <- sum(diag(cm))/sum(cm)
pv <- predict(fmod, newdata=tests)
cm <- xtabs( ~ pv + tests$vtype)
truerate <- sum(diag(cm))/sum(cm)
crm[i,] <- c(nomrate,truerate)
}
nomrate <- mean(trains$vtype == 'nv')
truerate <- mean(tests$vtype == 'nv')
crm[npreds,] <- c(nomrate,truerate)
vnames[npreds] <- vpreds
dimnames(crm) <- list(vnames,c("train","test"))
crm
train test
rm.rcorbor 0.9266 0.9210
rm.medbuf 0.9274 0.9210
rm.pairslope 0.9254 0.9194
rm.beyond1std 0.9258 0.9242
rm.fpr20 0.9286 0.9226
rm.fpr35 0.9266 0.9250
rm.fpr80 0.9246 0.9202
rm.mad 0.9262 0.9210
rm.skew 0.9246 0.9194
rm.fpr50 0.9262 0.9218
maxdiff 0.9282 0.9194
rm.pdfp 0.9258 0.9202
rm.kurtosis 0.9254 0.9194
rm.peramp 0.9230 0.9218
rm.maxslope 0.9234 0.9218
rm.amplitude 0.9246 0.9194
rm.std 0.9246 0.9218
outl 0.9206 0.9129
gtvar 0.9177 0.9056
shov 0.9129 0.8984
totvar 0.9089 0.8960
lsd 0.9032 0.8879
dscore 0.8964 0.8774
gscore 0.8786 0.8710
fslope 0.8730 0.8589
quadvar 0.8149 0.8024
famp 0.4706 0.4500
Plot the rates: (solid=train, dashed=test)
crm <- data.frame(crm)
p <- ggplot(aes(x=1:nrow(crm),y=train),data=crm)+ylab("Classification rate")+xlab("")
p <- p + geom_line()
p <- p + scale_x_continuous(breaks=1:nrow(crm),labels=row.names(crm))
p <- p + theme(axis.text.x=element_text(angle=90, hjust=1))
p <- p + geom_line(aes(y=test),linetype="dashed")
p
trains$ttype <- trains$type
trains$ttype[trains$ttype == "nv"] <- NA
trains$ttype <- factor(trains$ttype)
tests$ttype <- tests$type
tests$ttype[tests$ttype == "nv"] <- NA
tests$ttype <- factor(tests$ttype)
tpredform <- paste(preds,collapse="+")
tallform <- as.formula(paste("ttype ~",tpredform))
fmod <- randomForest(tallform, data=na.omit(trains))
pv <- predict(fmod)
jta <- na.omit(trains$ttype)
cm <- xtabs( ~ pv + jta)
nomrate <- sum(diag(cm))/sum(cm)
pv <- predict(fmod, newdata=na.omit(tests))
jtr <- na.omit(tests$ttype)
cm <- xtabs( ~ pv + jtr)
truerate <- sum(diag(cm))/sum(cm)
rr <- c(nomrate,truerate)
names(rr) <- c("train","test")
rr
train test
0.7380 0.7375
Look at the importance of each predictor
importance(fmod)
MeanDecreaseGini
shov 86.664
maxdiff 52.388
dscore 48.037
totvar 77.518
quadvar 44.996
famp 35.571
fslope 47.521
outl 55.025
gscore 45.107
lsd 39.766
gtvar 67.844
rm.amplitude 48.880
rm.beyond1std 20.191
rm.fpr20 20.972
rm.fpr35 22.102
rm.fpr50 21.534
rm.fpr80 20.414
rm.maxslope 29.069
rm.mad 28.607
rm.medbuf 28.343
rm.pairslope 15.248
rm.peramp 37.634
rm.pdfp 33.015
rm.skew 35.188
rm.kurtosis 29.814
rm.std 59.894
rm.rcorbor 8.678
Drop predictors sequentially according to importance
crm <- matrix(NA,npreds,2)
vpreds <- preds
vnames <- rep("",npreds)
for(i in 1:(npreds-1)){
vimp <- importance(fmod)
vari <- which.min(vimp)
vpreds <- vpreds[-vari]
vnames[i] <- rownames(vimp)[vari]
tpredform <- paste(vpreds,collapse="+")
tallform <- as.formula(paste("ttype ~",tpredform))
fmod <- randomForest(tallform, data=na.omit(trains))
pv <- predict(fmod)
cm <- xtabs( ~ pv + jta)
nomrate <- sum(diag(cm))/sum(cm)
pv <- predict(fmod, newdata=na.omit(tests))
cm <- xtabs( ~ pv + jtr)
truerate <- sum(diag(cm))/sum(cm)
crm[i,] <- c(nomrate,truerate)
}
nomrate <- mean(jta == 'sn')
truerate <- mean(jtr == 'sn')
crm[npreds,] <- c(nomrate,truerate)
vnames[npreds] <- vpreds
dimnames(crm) <- list(vnames,c("train","test"))
crm
train test
rm.rcorbor 0.7380 0.7346
rm.pairslope 0.7372 0.7390
rm.fpr80 0.7365 0.7390
rm.beyond1std 0.7319 0.7361
rm.fpr20 0.7418 0.7434
rm.fpr50 0.7342 0.7317
rm.fpr35 0.7350 0.7449
rm.mad 0.7327 0.7331
rm.medbuf 0.7456 0.7405
rm.maxslope 0.7342 0.7331
rm.kurtosis 0.7365 0.7375
rm.pdfp 0.7365 0.7361
famp 0.7380 0.7419
lsd 0.7342 0.7346
rm.skew 0.7319 0.7449
rm.peramp 0.7319 0.7331
quadvar 0.7304 0.7390
dscore 0.7182 0.7317
outl 0.7281 0.7258
rm.amplitude 0.7098 0.7243
gscore 0.7060 0.6979
fslope 0.6725 0.7053
maxdiff 0.6489 0.6716
gtvar 0.5918 0.6026
shov 0.5407 0.5381
totvar 0.3199 0.3402
rm.std 0.2620 0.2815
Plot the rates: (solid=train, dashed=test)
crm <- data.frame(crm)
p <- ggplot(aes(x=1:nrow(crm),y=train),data=crm)+ylab("Classification rate")+xlab("")
p <- p + geom_line()
p <- p + scale_x_continuous(breaks=1:nrow(crm),labels=row.names(crm))
p <- p + theme(axis.text.x=element_text(angle=90, hjust=1))
p <- p + geom_line(aes(y=test),linetype="dashed")
p