Introduction to R for Python Users

Default packages
Multiplying a list
Namespace clashes
Call by reference versus value
Pipes
Regression

Default packages

R attaches several packages by default. Without these, there would be much less functionality:

sessionInfo()

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_3.4.1  backports_1.1.1 magrittr_1.5    rprojroot_1.2  
 [5] tools_3.4.1     htmltools_0.3.6 yaml_2.1.14     Rcpp_0.12.13   
 [9] stringi_1.1.5   rmarkdown_1.6   knitr_1.17      stringr_1.2.0  
[13] digest_0.6.12   evaluate_0.10.1

Multiplying a list

Cannot multiply a list by two

x = list(1,2,3)
x * 2

Error in x * 2: non-numeric argument to binary operator

but no problem for a vector

x = c(1,2,3)
x * 2

[1] 2 4 6

Namespace clashes

Availablity of functions depends on packages loaded:

select()

Error in select(): could not find function "select"

MASS has a select function (used for smoothing parameter selection in ridge regression)

library(MASS)
select

function (obj) 
UseMethod("select")
<bytecode: 0x7fc0be4d2510>
<environment: namespace:MASS>

But dplyr also has a select:

library(dplyr)
select

function (.data, ...) 
{
    UseMethod("select")
}
<environment: namespace:dplyr>

This now overwrites the original select. If you want that back, you need:

MASS::select

function (obj) 
UseMethod("select")
<bytecode: 0x7fc0be4d2510>
<environment: namespace:MASS>

which is what Python does all the time to avoid this namespace problem.

Call by reference versus value

Set y = x and then edit x:

y = x
x[3] = 10
x

[1]  1  2 10

Does not change y

[1] 1 2 3

Write a function to add one to the vector:

addone = function(x){
x = x + 1
x
}
addone(x)

[1]  2  3 11

Does not change x

[1]  1  2 10

Pipes

output from previous stage is input to the next stage (as the first argument)

x %>% mean

[1] 4.3333

iris %>% group_by(Species) %>% summarise(mean(Sepal.Length), mean(Sepal.Width))

# A tibble: 3 x 3
     Species `mean(Sepal.Length)` `mean(Sepal.Width)`
      <fctr>                <dbl>               <dbl>
1     setosa                5.006               3.428
2 versicolor                5.936               2.770
3  virginica                6.588               2.974

Regression

gala = read.table("http://people.bath.ac.uk/jjf23/data/gala.dat",header=TRUE)
head(gala)

             Species Endemics  Area Elevation Nearest Scruz Adjacent
Baltra            58       23 25.09       346     0.6   0.6     1.84
Bartolome         31       21  1.24       109     0.6  26.3   572.33
Caldwell           3        3  0.21       114     2.8  58.7     0.78
Champion          25        9  0.10        46     1.9  47.4     0.18
Coamano            2        1  0.05        77     1.9   1.9   903.82
Daphne.Major      18       11  0.34       119     8.0   8.0     1.84

Summary output is reasonably compact:

lmod = lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, gala)
summary(lmod)


Call:
lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent, 
    data = gala)

Residuals:
    Min      1Q  Median      3Q     Max 
-111.68  -34.90   -7.86   33.46  182.58 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  7.06822   19.15420    0.37   0.7154
Area        -0.02394    0.02242   -1.07   0.2963
Elevation    0.31946    0.05366    5.95  3.8e-06
Nearest      0.00914    1.05414    0.01   0.9932
Scruz       -0.24052    0.21540   -1.12   0.2752
Adjacent    -0.07480    0.01770   -4.23   0.0003

Residual standard error: 61 on 24 degrees of freedom
Multiple R-squared:  0.766, Adjusted R-squared:  0.717 
F-statistic: 15.7 on 5 and 24 DF,  p-value: 6.84e-07

Introducing a collinear predictor gets rejected in a noticeable way:

lmod = lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent + I(Area + Adjacent), gala)
summary(lmod)


Call:
lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent + 
    I(Area + Adjacent), data = gala)

Residuals:
    Min      1Q  Median      3Q     Max 
-111.68  -34.90   -7.86   33.46  182.58 

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)         7.06822   19.15420    0.37   0.7154
Area               -0.02394    0.02242   -1.07   0.2963
Elevation           0.31946    0.05366    5.95  3.8e-06
Nearest             0.00914    1.05414    0.01   0.9932
Scruz              -0.24052    0.21540   -1.12   0.2752
Adjacent           -0.07480    0.01770   -4.23   0.0003
I(Area + Adjacent)       NA         NA      NA       NA

Residual standard error: 61 on 24 degrees of freedom
Multiple R-squared:  0.766, Adjusted R-squared:  0.717 
F-statistic: 15.7 on 5 and 24 DF,  p-value: 6.84e-07