STT592-002: Intro. to Statistical Learning RESAMPLING METHODS Chapter 05 Disclaimer: This PPT is modified based on IOM 530: Intro. to Statistical Learning "Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani " 1 STT592-002: Intro. to Statistical Learning Outline Cross Validation The Validation Set Approach Leave-One-Out Cross Validation K-fold Cross Validation Bias-Variance Trade-off for k-fold Cross Validation Cross Validation on Classification Problems

2 STT592-002: Intro. to Statistical Learning 3 What are resampling methods? Tools that involves repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain more information about the fitted model Model Assessment: estimate test error rates Model Selection: select appropriate level of model flexibility They are computationally expensive! But these days we have powerful computers Two resampling methods: Cross Validation Bootstrapping

STT592-002: Intro. to Statistical Learning Simulation study for resample techniques Simulation study: Data: 1, 3, 4, 6, 7, 9, 12, 15, 17, 20. Validation-set method K-fold CV (eg: k=5) LOO-CV 4 STT592-002: Intro. to Statistical Learning 5 5.1.1 Typical Approach: Validation Set Approach To find a set of variables that give lowest test (not training) error rate

With a large data set, we can randomly split data into training and validation(testing) sets Then use training set to build a model, and choose the model with lowest error rate for validation data Training Data Testing Data STT592-002: Intro. to Statistical Learning 6 Example: Auto Data Suppose that we want to predict mpg from horsepower Two models: mpg ~ horsepower mpg ~ horsepower + horspower2 Which model gives a better fit? Randomly split Auto data set into training (196 obs.) and validation data (196 obs.)

Fit both models using training data set Then, evaluate both models using validation data set The model with lowest validation (testing) MSE is the winner! STT592-002: Intro. to Statistical Learning 7 Results: Auto Data Left: Validation error rate for a single split Right: Validation method repeated 10 times, each time the split is done randomly! There is a lot of variability among the MSEs Not good! We need more stable methods! STT592-002: Intro. to Statistical Learning 8 The Validation Set Approach

Advantages: Simple Easy to implement Disadvantages: The validation MSE can be highly variable Only a subset of observations are used to fit the model (training data). Statistical methods tend to perform worse when trained on fewer observations STT592-002: Intro. to Statistical Learning 9 5.1.2 Leave-One-Out Cross Validation (LOOCV)

U R E 5.3. A schematic display of LOOCV. A set of n data points is repeatsplit into a training set (shown in blue) containing all but one observation, a validation set that contains only that observation (shown in beige). T he test This method is similar to the MSEs. Validation r is then estimated by averaging the n resulting T he rst training set ains all but observation 1, thebut second training contains all but observation Set Approach, it tries to set address

nd so forth. the latters disadvantages For each suggested model, do: ervations, and a prediction y1 is made for the excluded observation, the data set of size n into g its valuex1Split . Since(x 1 , y1) wasnot used in thetting process, MSE 1 = Training data set unbiased (blue) size: n -1 for the test error. y1)2 provides an approximately estimate even though MSE test error, it is a poor estimate

Validation data for setthe (beige) size: 1 1 is unbiased ause it is highly variable, sincethe it training is baseddata upon a single observation Fit the model using y1). Validate model using the validation data, and We can repeat the procedure by selecting compute the corresponding

MSE (x2, y2) for the validation a, trainingthe statistical learning procedure on the n 1 observations Repeat this process n times 2 . . . , (x , y )}, and computingMSE = (y

y ) . Repeat1, y1), (x3 , y n n 2 2 2 3), The MSE for the model is computed as this approach n times produces n squared errors, MSE 1, . . . , MSE n . follows: e LOOCV estimate for the test MSE is the averageof these n test error mates: n 1 CV (n) = MSE i . (5.1) n i =1

chematic of the LOOCV approach is illustrated in Figure 5.3. OOCV has a couple of major advantages over the validation set apach. First, it has far less bias. In LOOCV, we repeatedly t the statis- STT592-002: Intro. to Statistical Learning 10 LOOCV vs. the Validation Set Approach LOOCV has less bias We repeatedly fit the statistical learning method using training data that contains n-1 obs., i.e. almost all the data set is used LOOCV produces a less variable MSE The validation approach produces different MSE when applied repeatedly due to randomness in the splitting process, while performing LOOCV multiple times will always yield the same results, because we split based on 1 obs. each time LOOCV is computationally intensive (disadvantage) We fit the each model n times! STT592-002:

Intro. to Statistical Learning 11 5.1.3 k-fold Cross Validation

F I G U R E 5.5. A schematic display of 5-fold CV. A set of n observations is randomly split into ve non-overlapping groups. Each of these fths acts as a LOOCVvalidation is computationally so we can run k-fold set (shown in beige),intensive, and the remainder as a training set (shown in blue). T he test error is estimated by averaging the ve resulting MSE estimates. Cross Validation instead chapters. T he magic formula (5.2) not hold general,set in which

With k-fold Cross Validation, wedoes divide thein data intocase K model has to be ret n times. differentthe parts (e.g. K = 5, or K = 10, etc.) We then5.1.3 remove the first part, fit the model on the remaining Kk-Fold Cross-Validation 1 parts, An and see how good the predictions are on the left out alternative to LOOCV is k-fold CV. T his approach involves randomly

k-fold CV the setthe of observations groups, or folds, of approximately part (i.e.dividing compute MSE oninto thek first part) equal size. T he rst fold is treated as a validation set, and the method We thenisrepeat K different times taking out error, a different t on thethis remaining k 1 folds. T he mean

squared MSE 1, is part then computed on the observations in the held-out fold. T his procedure is each time repeated k times; each time, a different group of observations is treated By averaging the Kset. different MSEs an estimated as a validation T his process results we in k get estimates of the test error, MSE(test) , . . . , MSE CVobservations estimateis computed by averaging 1, MSE 2error

k . T he validation rate fork-fold new these values, k CV (k) 1 = MSE i . k i =1 (5.3) Figure 5.5 illustrates the k-fold CV approach. It is not hard to seethat LOOCV is a special caseof k-fold CV in which k

STT592-002: Intro. to Statistical Learning K-fold Cross Validation 12 STT592-002: Intro. to Statistical Learning K-fold Cross Validation 13 STT592-002: Intro. to Statistical Learning 14 Auto Data: LOOCV vs. K-fold CV Left: LOOCV error curve Right: 10-fold CV was run many times, and the figure shows the slightly different CV error rates

LOOCV is a special case of k-fold, where k = n They are both stable, but LOOCV is more computationally intensive! STT592-002: Intro. to Statistical Learning 15 Auto Data: Validation Set Approach vs. K-fold CV Approach Left: Validation Set Approach Right: 10-fold Cross Validation Approach Indeed, 10-fold CV is more stable! STT592-002: Intro. to Statistical Learning 16 K-fold Cross Validation on Three Simulated Date Blue: True Test MSE

Black: LOOCV MSE Orange: 10-fold MSE Refer to chapter 2 for the top graphs, Fig 2.9, 2.10, and 2.11 STT592-002: Intro. to Statistical Learning THE BIAS-VARIANCE TRADE-OFF Lets Go back to Chapter 02: Statistical Learning 17 Where does the error come from? Disclaimer: This PPT is modified based on Dr. Hung-yi Lee http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17.h tml

Estimator Bias + Variance From training data, we find is an estimator of ^ Bias and Variance of Estimator Estimate the mean of a variable x assume the mean of x is assume the variance of x is Estimator of mean Sample N points: 2 5 1

1 == 4 1 1 [ ] = [ ] = [ ]

[ unbiased ] 3 6 Bias and Variance of Estimator Estimate the mean of a variable x assume the mean of x is assume the variance of x is Estimator of mean Sample N points:

== 1 2 ) =V [ ] = ( Variance depends on the number of samples unbiased

Smaller N Larger N Bias and Variance of Estimator Estimate the mean of a variable x assume the mean of x is assume the variance of x is Estimator of variance Sample N points: == 1 2 1

= ( ) Biased estimator [ ]= 1 2 2 Increase N 3 2 1 4 5 2 6

2 [ ] = Variance Bias ^ from 100 samples y = b + w xcp y = b + w1 xcp + w2 (xcp)2 + w3 (xcp)3 y = b + w1 xcp + w2 (xcp)2

+ w3 (xcp)3 + w4 (xcp)4 + w5 (xcp)5 Variance y = b + w xcp Small Variance y = b + w1 xcp + w2 (xcp)2 + w3 (xcp)3 + w4 (xcp)4 + w5 (xcp)5 Large Variance Simpler model is less influenced by the sampled data Consider the extreme case f(x) of degree 5 Bias

[ ] = Bias: If we average all the , is it close to ? Large Bias Assume this is Small Bias Black curve: the true function Red curves: 5000 Blue curve: the average of 5000 Degree=1 Degree=5

Degree=3 Bias y = b + w1 xcp + w2 (xcp)2 + w3 (xcp)3 + w4 (xcp)4 + w5 (xcp)5 y = b + w xcp model Small Bias Large Bias model What to do with large bias? Diagnosis:

If your model cannot even fit the training examples, then you have large bias Underfitting If you can fit the training data, but large error on testing data, then you probably have large variance Overfitting For bias, redesign your model: large bias Add more features as input A more complex model What to do with large variance? More data Very effective, but not always practical Regularization 10 examples May increase bias

100 examples Bias v.s. Variance Error from bias Error from variance Error observed Overfitting Underfitting Horizontal Axis: Model Complex Large Bias Small Variance Small Bias Large Variance Consequently

Average Error on Testing Data error due to "bias" and error due to "variance" A more complex model does not always lead to better performance on testing data. Cross Validation Training Set public private Testing Set Testing Set Training Set

Validation set Model 1 Err = 0.9 Using the results of public testing data to tune your model You are making public set better than private set. Model 2 Err = 0.7 Not recommend Model 3 Err = 0.5

Err > 0.5 Err > 0.5 N-fold Cross Validation Training Set Model 1 Model 2 Model 3 Train Train Val Err = 0.2 Err = 0.4 Err = 0.4 Train Val

Train Err = 0.4 Err = 0.5 Err = 0.5 Val Train Train Err = 0.3 Err = 0.6 Err = 0.3 Avg Err = 0.3 Avg Err = 0.5 Testing Set Testing Set

public private Avg Err = 0.4 STT592-002: Intro. to Statistical Learning RESAMPLING METHODS CONTINUE... Chapter 05 Disclaimer: This PPT is modified based on IOM 530: Intro. to Statistical Learning 35 STT592-002: Intro. to Statistical Learning 36

5.1.4 Bias- Variance Trade-off for k-fold CV Putting aside that LOOCV is more computationally intensive than k-fold CV Which is better LOOCV or K-fold CV? LOOCV is less bias than k-fold CV (when k < n) But, LOOCV has higher variance than k-fold CV (when k < n) Thus, there is a trade-off between what to use Conclusion: We tend to use k-fold CV with (K = 5 and K = 10) These are the magical Ks It has been empirically shown that they yield test error rate estimates that suffer neither from excessively high bias, nor from very high variance STT592-002: Intro. to Statistical Learning 37 5.1.5 Cross Validation on Classification Problems

So far, we have been dealing with CV on regression problems We can use cross validation in a classification situation in a similar manner Divide data into K parts Hold out one part, fit using the remaining data and compute the error rate on the hold out data Repeat K times CV error rate is the average over the K errors we have computed STT592-002: Intro. to Statistical Learning CV to Choose Order of Polynomial The data set used is simulated (refer to Fig 2.13) The purple dashed line is the Bayes boundary Bayes Error Rate: 0.133 38

STT592-002: Intro. to Statistical Learning 39 CV to Choose Order of Polynomial Linear Logistic regression (Degree 1) is not able to fit the Bayes decision boundary Quadratic Logistic regression does better than linear Error Rate: 0.201 Error Rate: 0.197 STT592-002: Intro. to Statistical Learning 40 CV to Choose Order of Polynomial Using cubic and quartic predictors, the accuracy of the

model improves Error Rate: 0.160 Error Rate: 0.162 STT592-002: Intro. to Statistical Learning CV to Choose the Order Logistic Regression Brown: Test Error Blue: Training Error Black: 10-fold CV Error KNN 41 STT592-002: Intro. to Statistical Learning

42 The Bootstrap Bootstrap is a widely applicable and extremely powerful statistical tool. Can be used to quantify the uncertainty associated with a given estimator or statistical learning method. Ideally, we can repeatedly obtain independent data sets from the population STT592-002: Intro. to Statistical Learning 43 The Bootstrap In practice, it will not work because for real data we can

NOT generate new samples from original population. Instead obtain distinct data sets by repeatedly sampling observations from original data set. The sampling is performed with replacement, which means that the replacement same observation can occur more than once in the bootstrap data set. STT592-002: Intro. to Statistical Learning The Bootstrap: Example #1 44 STT592-002: Intro. to Statistical Learning Bootstrap: Example #2 45

STT450-550: Statistical Data Mining Toy examples set.seed(100) sample(1:3, 3, replace = T) # set.seed(15) sample(1:3, 3, replace = T) # set.seed(594) sample(1:3, 3, replace = T) # set.seed(500) sample(1:3, 3, replace = T) # set.seed(200) sample(1:3, 3, replace = T) 46 STT592-002: Intro. to Statistical Learning

47 5-fold CV for Time Series Data Blogs: https://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time- series-model-selection http://francescopochetti.com/pythonic-cross-validation-time-series-pandas-scikit-learn/ Papers: https://www.sciencedirect.com/science/article/pii/S0020025511006773