--- title: Fitting GPR and generation of the measures author: Julian Faraway output: html_document: toc: true theme: cosmo --- ```{r global_options, include=FALSE} library(knitr) opts_chunk$set(comment=NA, fig.path='/tmp/Figs/', warning=FALSE, message=FALSE) ``` ```{r echo=FALSE} paste("Created:",date()) ``` ```{r} load("lcdb.rda") source("funcs.R") GPR <- TRUE ``` Set the kernel type to squared exponential and set the inverse width: ```{r} kerntype <- "exponential" wvec <- 2e-4 ``` ## Definition of measures Measures can be classified into groups. 1. **Whole curve measures** - *meds* median magnitude(mag) - *iqr* interquartile range of mag - *shov* mean of absolute differences of successive observed - *maxdiff* the maximum difference magnitudes - *dscore* mean value of phi((mag-median)/sderr) - all the *Richards measures* 2. **Fitted curve measures** (first smooth data then compute) - *totvar* total variation - *quadvar* quadratic variation - *famp* range of fitted curve - *fslope* maximum derivative in the fitted curve - *trend* linear trend of fitted curve 3. **Residual from fit measures** - *outl* take residuals of smoothed fit and use the maximum studentized residual. - *std* SD of residuals - *skewres* skewness of residuals - *shapwilk* Shapiro-Wilk statistic of residuals (test for normality) 4. **Cluster measures** (based on groups of up to 4 measures in 30mins) - *lsd* fit the means within the groups (up to 4 measurements). Take the logged SD of the residuals from this fit - *gscore* mean value of phi((mag-mean(mag))/sd) based on clusters - *mdev* the max absolute residuals from this fit - *gtvar* total variation of curve based on group means scaled by range of observation 5. **Other** (can't use these for discriminant purposes) - *wander* mean movement of object in (ra, dec) space - *moveloc* range of motion in (ra, dec) space - *nobs* number of measurements in the group We have not used all these measures for classification purposes. Some we have not used, like *nobs*, because they would distort classification performance as explained in the article. Other measures were failed experiments in that they failed to incrementally improve classification performance. We have not reported these in the article but keep them here for future reference. ```{r} firstdate <- 53464 daterange <- 2764 detection.limit <- 20.5 ``` ```{r child="meascomp.Rmd",eval=TRUE} ``` ## Numerical summary of the measures ```{r} summary(cmdb) ``` ## Plots of the measures ```{r} mnames <- names(cmdb) for(i in 1:(ncol(cmdb)-2)){ vname <- mnames[i] tranf <- functran[[match(vname,names(functran))]] y <- sapply(cmdb[,i], tranf) ylab <- ifelse(tranf=="identity",mnames[i],paste0(tranf,"(",mnames[i],")")) plot(cmdb[,i] ~ cmdb$type, ylab=ylab, xlab="Type") } ``` ## Split data into training and test Split the data into a training(2/3) and a test(1/3) sample in the same way as before. The split will not be identical to that used on the Richards measures because of the problem that not all the Richards stats are computed on the NV group resulting in the discard of some objects from the calculation. ```{r} set.seed(123) n <- nrow(cmdb) isel <- sample(1:n,round(n/3)) trains <- cmdb[-isel,] tests <- cmdb[isel,] ``` There are `r round(n/3)` observations in the test set and `r (n-round(n/3))` observations in the training set. Save for future use. ```{r} save(cmdb,trains,tests,file="feat.rda") ```