Chapter 2 - Estimation

Load in the packages:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import statsmodels.api as sm
import statsmodels.formula.api as smf

2.6 Example

Read in the Galapagos data and check the first few lines:

In [2]:
gala = pd.read_csv("data/gala.csv", index_col=0)
gala.head()
Out[2]:
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58 23 25.09 346 0.6 0.6 1.84
Bartolome 31 21 1.24 109 0.6 26.3 572.33
Caldwell 3 3 0.21 114 2.8 58.7 0.78
Champion 25 9 0.10 46 1.9 47.4 0.18
Coamano 2 1 0.05 77 1.9 1.9 903.82

Drop the endemics variable permanently from the data frame:

In [3]:
gala.drop('Endemics', axis=1, inplace=True)
gala.head()
Out[3]:
Species Area Elevation Nearest Scruz Adjacent
Baltra 58 25.09 346 0.6 0.6 1.84
Bartolome 31 1.24 109 0.6 26.3 572.33
Caldwell 3 0.21 114 2.8 58.7 0.78
Champion 25 0.10 46 1.9 47.4 0.18
Coamano 2 0.05 77 1.9 1.9 903.82

Fit a linear model

In [4]:
lmod = smf.ols(formula='Species ~ Area + Elevation + Nearest + Scruz  + Adjacent', data=gala).fit()
lmod.summary()
Out[4]:
OLS Regression Results
Dep. Variable: Species R-squared: 0.766
Model: OLS Adj. R-squared: 0.717
Method: Least Squares F-statistic: 15.70
Date: Wed, 26 Sep 2018 Prob (F-statistic): 6.84e-07
Time: 11:13:58 Log-Likelihood: -162.54
No. Observations: 30 AIC: 337.1
Df Residuals: 24 BIC: 345.5
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 7.0682 19.154 0.369 0.715 -32.464 46.601
Area -0.0239 0.022 -1.068 0.296 -0.070 0.022
Elevation 0.3195 0.054 5.953 0.000 0.209 0.430
Nearest 0.0091 1.054 0.009 0.993 -2.166 2.185
Scruz -0.2405 0.215 -1.117 0.275 -0.685 0.204
Adjacent -0.0748 0.018 -4.226 0.000 -0.111 -0.038
Omnibus: 12.683 Durbin-Watson: 2.476
Prob(Omnibus): 0.002 Jarque-Bera (JB): 13.498
Skew: 1.136 Prob(JB): 0.00117
Kurtosis: 5.374 Cond. No. 1.90e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.9e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The warnings can be ignored for now. The first is always assumed. The second can be a problem.

Compute LS estimates using the formula:

In [5]:
X = gala.iloc[:,1:]
X.insert(0,'intercept',1)
XtXi = np.linalg.inv(X.T @ X)
np.dot(XtXi @ X.T, gala['Species'])
Out[5]:
array([ 7.06822071, -0.02393834,  0.31946476,  0.00914396, -0.24052423,
       -0.07480483])

A somewhat more efficient way to do the calculation

In [6]:
np.linalg.solve(X.T @ X, np.dot(X.T,gala['Species']))
Out[6]:
array([ 7.06822071, -0.02393834,  0.31946476,  0.00914396, -0.24052423,
       -0.07480483])

2.9 QR decomposition

Computation using the QR decomposition

In [7]:
q, r = np.linalg.qr(X)
f = np.dot(np.transpose(q), gala['Species'])
f
Out[7]:
array([-466.84219318,  381.40557435,  256.25047255,    5.40764552,
       -119.49834019,  257.69436853])

This function from scipy uses the fact that r is upper triangular. The np.linalg.solve does not.

In [8]:
sp.linalg.solve_triangular(r, f)
Out[8]:
array([ 7.06822071, -0.02393834,  0.31946476,  0.00914396, -0.24052423,
       -0.07480483])

2.10 Identifiability

Add a predictor which is a linear combination of other predictors. In contrast to R, the parameter is estimated. The second warning indicates that the design matrix is singular (as indeed it is).

In [9]:
gala['Adiff'] = gala['Area'] - gala['Adjacent']
lmod = smf.ols(formula='Species ~ Area + Elevation + Nearest + Scruz  + Adjacent + Adiff', data=gala).fit()
lmod.summary()
Out[9]:
OLS Regression Results
Dep. Variable: Species R-squared: 0.766
Model: OLS Adj. R-squared: 0.717
Method: Least Squares F-statistic: 15.70
Date: Wed, 26 Sep 2018 Prob (F-statistic): 6.84e-07
Time: 11:13:58 Log-Likelihood: -162.54
No. Observations: 30 AIC: 337.1
Df Residuals: 24 BIC: 345.5
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 7.0682 19.154 0.369 0.715 -32.464 46.601
Area -0.0409 0.018 -2.236 0.035 -0.079 -0.003
Elevation 0.3195 0.054 5.953 0.000 0.209 0.430
Nearest 0.0091 1.054 0.009 0.993 -2.166 2.185
Scruz -0.2405 0.215 -1.117 0.275 -0.685 0.204
Adjacent -0.0578 0.016 -3.511 0.002 -0.092 -0.024
Adiff 0.0170 0.007 2.340 0.028 0.002 0.032
Omnibus: 12.683 Durbin-Watson: 2.476
Prob(Omnibus): 0.002 Jarque-Bera (JB): 13.498
Skew: 1.136 Prob(JB): 0.00117
Kurtosis: 5.374 Cond. No. 4.67e+15


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.45e-24. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

2.11 Orthogonality

Orthogonality example

In [10]:
odor = pd.read_csv("data/odor.csv")
odor.head()
Out[10]:
odor temp gas pack
0 66 -1 -1 0
1 39 1 -1 0
2 43 -1 1 0
3 49 1 1 0
4 58 -1 0 -1

Covariance of the predictors is diagonal

In [11]:
odor.iloc[:,1:].cov()
Out[11]:
temp gas pack
temp 0.571429 0.000000 0.000000
gas 0.000000 0.571429 0.000000
pack 0.000000 0.000000 0.571429

LS estimates with all three predictors

In [12]:
lmod = smf.ols(formula='odor ~ temp + gas + pack', data=odor).fit()
lmod.params
Out[12]:
Intercept    15.200
temp        -12.125
gas         -17.000
pack        -21.375
dtype: float64

Covariance of the parameter estimates

In [13]:
np.round(lmod.cov_params(),2)
Out[13]:
Intercept temp gas pack
Intercept 86.46 -0.0 -0.0 -0.0
temp -0.00 162.1 -0.0 0.0
gas -0.00 -0.0 162.1 0.0
pack -0.00 0.0 0.0 162.1

see that estimates do not change when a predictor is dropped from the model

In [14]:
lmod = smf.ols(formula='odor ~ gas + pack', data=odor).fit()
lmod.params
Out[14]:
Intercept    15.200
gas         -17.000
pack        -21.375
dtype: float64
In [15]:
%load_ext version_information
%version_information pandas, numpy, matplotlib, seaborn, scipy, patsy, statsmodels
Out[15]:
SoftwareVersion
Python3.7.0 64bit [Clang 4.0.1 (tags/RELEASE_401/final)]
IPython6.5.0
OSDarwin 17.7.0 x86_64 i386 64bit
pandas0.23.4
numpy1.15.1
matplotlib2.2.3
seaborn0.9.0
scipy1.1.0
patsy0.5.0
statsmodels0.9.0
Wed Sep 26 11:13:58 2018 BST