Errata: *February 15, 2019*

Thank you for purchasing*Practical Data Science with R*. We'll update this list as necessary. Thank you!

*Unless otherwise noted, corrections are pending in all formats.*

## Page 12, Listing 1.2

## Page 26 Listing 2.8, page 30 Listing 2.11 and any other use of H2DB

*This correction hasn't been made.*

## Page 86, Table 5.1, second row, second column

## Page 120, Section 6.2.1 Using categorical features

*This correction hasn't been made.*

## Page 120, Using categorical features, Listing 6.4

## Pages 125-138, Section 6.3 Building models using many variables

## Page 126, Section 6.2.3 Using cross-validation

## Page 126, Listing 6.11 Basic variable selection

## Page 126, Text and Listing 6.11 Basic variable selection

## Page 130, How decision tree models work

## Page 156, Section 7.1.5 Reading the model summary and characterizing coefficient quality

## Page 157, Section 7.2.1 Understanding logistic regression

## Page 180, Section 8.1.2, Preparing the data

*This correction hasn't been made.*

## Page 206, Section 8.2.3 Mining associations with the arules package, Listing 8.20

## Page 214, Listing 9.1 Preparing Spambase data and evaluating the performance of decision trees

*This correction hasn't been made.*

## Page 224, Footnote 5

## Page 230, Listing 8.17 examining the size distribution

## Page 387, sqldf index entry

Thank you for purchasing

Start the listing with the line `creditdata <- d`

.

Starting with version 1.4 the H2DB database enforces that the
path specified for the database files must be absolute
(starting with "./", "/", or "~/"). The book examples were
written using 1.3 versions of H2DB which did not enforce this.
The fix is to edit the db definition xml files and R
connection commands to make sure
you have absolute paths. Use:
`jdbc:h2:./H2DB;...`

and not: `jdbc:h2:H2DB;...`

.

The random forests cross-reference should be to section 9.1.2.

It is a good idea to harden the look-up `vTab[pos,]`

to `vTab[as.character(pos),]`

to defend against the irregular nature of `R`

's indexing operator (which would fail in this
case if we were to have `pos`

as a logical instead of a string).

Replace `pPosWna <- (naTab/sum(naTab))[pos]`

with
`pPosWna <- (naTab/sum(naTab))[as.character(pos)]`

to get around
bad indexing issues due to R's treatment of logical types and expansion of vector lengths.

We strongly recommend splitting your training set into two pieces, and using one piece for the construction of single variable models, and only the other disjoint portion of the training data for model construction. The issue is any row of data examined during single variable model construction is no longer exchangeable with even test data (let alone future data). Models trained on rows used to build the variable encodings tend to overestimate effect sizes of the sub-models (or treated variables), underestimate degrees of freedom, and get significances wrong. We think the KNN model in this section happened to perform okay due to the aggressive variable pruning heuristic in listing 6.11.

A rerun of chapter 6 with the stricter data separation can be found here (including a newer run automating a lot of the steps using the vtreat library; files, of course, are in the course downloads).

First paragraph: "cross-reference" should be "cross-validation."

Second "Run through categorical variables and pick based on a deviance improvement" should read "Run through numerical variables and pick based on a deviance improvement."

The "2 times" in listing 6.11 is because statistical deviance is traditionally written as -2*(log(p|model) - log(p|saturatedModel)). We are using a "delta deviance" or improvement of deviance given by (-2*(log(p|model) - log(p|saturatedModel))_-(-2*(log(p|nullModel) - log(p|saturatedModel))). We wrote this as 2*log(p|model)-baseRateCheck) and it is a deviance style metric (not an AIC style metric as claimed in the text, which got behind by one revision).

The text above listing 6.11 on page 126 claims we are going to use an AIC-like metric. The original implementation did this by subtracting a variable entropy estimate from each liCheck in the the catVars list and subtracted 1 from each numeric variable (a placeholder). This turned out to not be useful for the data set at hand (this is the sort of thing you do want to try variations of), and evidently we removed it without completely updating the text.

Corrected text, listings, and results now here and here .

In the text explaining the decision tree
`predVar126 < -0.002810871`

should be
`predVar126 < 0.07366888"`

.

Normality of errors is desirable, but not necessary for linear regression (see
Wikipedia
and What are the key assumptions of linear regression?). We had previously said the errors must be normally distributed (and we certainly did not mean to imply the `y`

or `x`

need to be normally distributed).

Replace

Unfortunate typo: the sigmoid function is `s(z) = 1/(1+exp(-z))`

(not `1/(1+exp(z))`

).

After calling `scale()`

always clear attributes from result with `attr(., "scaled:center") = NULL; attr(., "scaled:scale") = NULL`

otherwise `prcomp`

`predict(princ, .)`

does not always equal the desired projection `. %*% princ$rotation`

.

`intersetMeasure`

now uses the argument `measure=`

instead of `method=`

(which now crashes).

Bug: F1 is the harmonic mean of precision and recall: `2*precision*recall/(precision+frecall)`

.
See Page 96, Section 5.2.1 "F1" for the correct definition.

Our description of "heteroscedasticity" is wrong and (unintentionally) conflates a number of different negative issues. Our intent was to expand on a simple definition such as "In statistics, a collection of random variables is heteroscedastic if there are sub-populations that have different variabilities from others" (taken from Wikipedia: Heteroscedasticity). Unfortunately we pushed this too far (roughly saying "it is bad if errors correlate with y's" when we meant to say "it was bad if errors correlate with unobserved ideal y's which don't already have the error term added in", i.e. with functions of the x's).

The correct thing to do is state that regression (and in particular the diagnostics of a regression) depends on a number of assumptions about the data. Important properties include (paraphrased from Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin "Bayesian Data Analysis" 3rd edition pp. 369-376) model structure, linearity of expected value of y as a function of the x's, bias of error terms, normality of error terms, independence of observations, and constancy of variance. For more see What are the key assumptions of linear regression?.

Also, the frequenstist version of regression is commonly said to make no assumptions on the distribution of data or parameters (only on distribution of errors). This is not quite true as even the frequentist analysis depends on independence of observations (which is a fact about x's in addition to y's and errors). And a Bayesian treatment may need to assume some form of priors on one or more of these to get well behaved detailed posterior estimates.

We will correct later editions of the book, and hope our error does not overly distract from the important lesson that you must be aware of modeling assumptions. We do note with some amusement how rarely "Bayesian Data Analysis" 3rd edition pp. 369-376 uses the term "heteroscedasticity" even though these are the pages tagged with this term in the index (these authors rightly emphasizing concepts over naming).

Remove the `binwidth=1`

argument.

(Not an error) Another important cross-reference for sqldf is page 327, where we show an option that prevents an R-crash on OSX.

© 2018 Manning Publications Co. All rights reserved.