This lesson is still being designed and assembled (Pre-Alpha version)

Applied Machine Learning With Tidymodels

Introduction

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is the purpose of this course?

Objectives
  • We are here to learn how to implement machine learning models with tidymodels.

What is Machine Learning?

Machine learning has received a lot of attention in recent years. It is a field of study that gives computers the ability to learn without being explicitly programmed. It is a subset of artificial intelligence (AI) and is often used to make predictions or classifications. Machine learning is used in many fields, including medicine, finance, and marketing.

Under the hood, machine learning algorithms are just a series of mathematical equations trying to find patterns in data. Once these patterns are represented in the form of a model, they can be used to make predictions or classifications on new data.

What is a Model?

A model is some sort of mathematical or logical system used to represent a real-world phenomenon. For example, a linear regression model is a mathematical equation that represents the relationship between two variables. The model can then be used to make predictions about what will happen in the future.

Models generally have some assumptions or rules that they assume are true. For example, a linear regression model assumes that the relationship between two variables is linear. If this assumption is not true, then the model will not be accurate.

When trying to model a phenomenon, it is assumed that there is some underlying process that generates the data. The model is then used to represent this process. This process or relationship is often referred to as the “structure” of the problem. As well as structure, there is also “noise” in the data. This is the random variation that is not explained by the model. Teh aim of a model is to capture the structure of the data and ignore the noise.

Parameters and Hyperparameters

A model has two types of parameters: parameters and hyperparameters. Parameters are the values that are estimated from the data. For example, in a linear regression model, the parameters are the intercept and slope. Hyperparameters are the values that are set by the user. For example, in a linear regression model, the hyperparameters are the number of variables to include in the model and the type of regression to use.

We have to be careful when choosing hyperparameters because they can have a big impact on the performance of the model. For example, if we choose too many variables to include in the model, then it will be overfit to the data and will not generalize well to new data. If we choose too few variables, then it will be underfit to the data and will not capture the structure of the data. The choice of model itself can also be considered to be a hyperparameter, so we have to be careful when choosing the model as well and make sure that it is appropriate for the problem at hand.

What is a Workflow?

A workflow is a series of steps that are performed in order to achieve a goal. In machine learning, a workflow is a series of steps that are performed in order to build a model. The steps in a workflow can be broken down into three categories: data preparation, model building, and model evaluation.

In this course we are going to use the tidymodels framework to build our models. The tidymodels framework is a collection of packages that are designed to work together to make it easier to build models. The packages in the tidymodels framework are:

As well as these packages, we will also be using the tidyverse packages for data manipulation and visualization.

Example of a Workflow

Why use Tidymodels?

The tidymodels framework is designed to make it easier to build models. It does this by providing a consistent interface for all of the packages in the framework. This means that you can use the same code to build different types of models. For example, you can use the same code to build a linear regression model and a random forest model. This makes it easier to compare different types of models and choose the best one for your problem.

Loading Data

library(tidyverse)
library(tidymodels)
library(mlbench)
library(caret)

data(BostonHousing)
housing <- BostonHousing |>
    as_tibble()

housing |>
    glimpse()
Rows: 506
Columns: 14
$ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829, 0.14455, 0.21124, 0.17004, 0.2248…
$ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 0.0, 0.0, 0.0, 0.0, 0.0…
$ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.87, 7.87, 7.87, 8.14, 8.14, 8.14, …
$ chas    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524, 0.524, 0.524, 0.524, 0.524, 0.538…
$ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631, 6.004, 6.377, 6.009, 5.889, 5.949…
$ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 94.3, 82.9, 39.0, 61.8, 84.5, 56.5,…
$ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505, 6.0821, 6.5921, 6.3467, 6.2267, …
$ rad     <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 311, 307, 307, 307, 307, 307, 307, 3…
$ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15.2, 15.2, 15.2, 21.0, 21.0, 21.0, …
$ b       <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90, 386.63, 386.71, 392.52, 396.90, …
$ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10, 20.45, 13.27, 15.71, 8.26, 10.26…
$ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15.0, 18.9, 21.7, 20.4, 18.2, 19.9, …

Key Points

  • tidymodels is a collection of packages that are designed to work together to make it easier to build models.


Machine Learning Basics

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Why do we split data into training and testing sets?

  • When do we use validation sets?

  • What is cross validation?

  • How do we evaluate a model?

  • What is bias and variance?

Objectives
  • Revise basic concepts in machine learning

Test/Train Split

Creating a model is all well and good, but how do we know that it is actually going to be useful in the real world? Essentially, we are asking how well does our model generalise to new data? To get an estimate of this, we can split our data into two sets: a training set and a test set. We can then train our model on the training set and evaluate it on the test set. This is known as the test/train split.

What proportion of the data should be used for training?

Discuss in small groups what proportion of data should be used for training. What are the advantages and disadvantages of using more or less data for training? If you have a lot of data, is it better to use more or less for training?

Test/Train Split

Validation Set

Let’s say we have two different models and we want to know which one to choose from, or maybe even two versions of the same type of model but with different hyperparameters, how do we know which one is best? At first this seems like a simple question, we just evaluate them on the test set and choose the one with the best performance. However, this is not a good idea. The test set is used to evaluate the final model, and so we should not use it to choose the model. If we do, we are likely to overfit to the test set and our model will not generalise well to new data.

Instead, we can split our data into three sets: a training set, a validation set and a test set. We can then train our models on the training set, evaluate them on the validation set and choose the best one. We can then evaluate the final model on the test set.

Test/Validation/Train Split

Cross Validation

Cross validation is a more advanced method for estimating the performance of a model. It is useful when we have a small dataset and we want to get a more accurate estimate of the performance of our model. It is also useful when we want to tune the hyperparameters of our model.

The basic idea is to split the data into k folds. We then train our model on k-1 folds and evaluate it on the remaining fold. We repeat this process k times, each time using a different fold for evaluation. We can then average the results to get an estimate of the performance of our model.

Cross Validation

Model Evaluation

There are many different ways to evaluate a model. The most common are error, accuracy, confusion matrix, sensitivity and specificity. Which one you use depends on the type of problem you are trying to solve.

Error

For regression tasks, the performance of a model is usually determined by the error. There are many different types of error, but the most common is the mean squared error (MSE). This is simply the mean of the squared differences between the true values and the predicted values. The reason for squaring the differences is that it penalises large errors more than small errors.

Accuracy

Accuracy is the simplest metric for evaluating a model. For classification tasks, it is simply the proportion of predictions that are correct. For regression tasks, it is the proportion of predictions that are within a certain threshold of the true value. For example, if we are trying to predict the price of a house, we might say that the prediction is correct if it is within £10,000 of the true value.

Confusion Matrix

A confusion matrix is used to visualise the performance of a classification model. It shows the number of true positives, true negatives, false positives and false negatives. It is useful for understanding where a model is going wrong. For example, if a model is predicting a lot of false positives, it might be useful to look at the false positives and see if there is a pattern. Perhaps the model is predicting a lot of false positives for a particular class, in which case it might be useful to look at the data for that class and see if there is a problem with the data.

Confusion matrices can be extended for multiclass problems. For example, if we have three classes, we can create a 3x3 matrix where the rows represent the true classes and the columns represent the predicted classes.

Confusion Matrix

Sensitivity and Specificity

Sensitivity and specificity are useful metrics for evaluating models when the classes are imbalanced. For example, if we are trying to predict whether a patient has a disease or not, we might have a dataset where 99% of the patients do not have the disease and only 1% do. If we were to create a model that always predicted that the patient did not have the disease, it would be 99% accurate, but it would not be very useful. Sensitivity and specificity are useful metrics for evaluating models in these situations.

Both of these can be considered as conditional probabilities. Sensitivity is the probability that the model predicts a positive result given that the true result is positive. Specificity is the probability that the model predicts a negative result given that the true result is negative. That is to say, sensitivity is the probability of a true positive and specificity is the probability of a true negative.

Plotting sensitivity and specificity against one another is often used to produce a receiver operating characteristic (ROC) curve. The area under the ROC curve (AUC) is a useful metric for evaluating models, but is not covered in this short course.

Bias and Variance

Bias and variance are useful concepts for understanding how a model will perform on new data. Bias is the difference between the expected value of the predictions and the true value. Variance is the variability of the predictions. A model with high bias will tend to underfit the data, whereas a model with high variance will tend to overfit the data.

Thinking back to the ideas of structure and noise, we can see that bias is related to structure and variance is related to noise. A model with high bias will tend to not capture all of the structure and a model with high variance will tend to have a lot of noise.

Bias and Variance

Key Points

  • Testing sets are used to estimate a models performance on new data

  • Validation sets are used to choose between different models or hyperparameters

  • Cross validation is a more advanced method for estimating the performance of a model

  • Accuracy is a simple metric for evaluating a model

  • Confusion matrices are a more detailed way of evaluating a model

  • Precision and recall are useful metrics for evaluating models when the classes are imbalanced

  • Bias and variance are useful concepts for understanding how a model will perform on new data


Data Preparation

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How do you split your data?

  • What is a recipe?

Objectives
  • Learn how to split your data.

  • Learn how to create a recipe.

Splitting Your Data

Test/Train Split

The first thing that you need to do when you have your data is to split it into a training and testing set. This is so that you can train your model on one set of data and then test it on another set of data. This is important because if you train your model on all of your data then you won’t have any data left to test it on. This means that you won’t be able to tell if your model is overfitting to your data.

housing_split <- housing |> 
  initial_split(prop = 0.8)

housing_train <- training(housing_split)

housing_test <- testing(housing_split)

housing_train
# A tibble: 404 × 14
      crim    zn indus chas    nox    rm   age   dis   rad   tax ptratio     b lstat  medv
     <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>
 1  0.207      0 27.7  0     0.609  5.09  98    1.82     4   711    20.1 318.  29.7    8.1
 2 11.8        0 18.1  0     0.718  6.82  76.5  1.79    24   666    20.2  48.4 22.7    8.4
 3  0.142      0  6.91 0     0.448  6.17   6.6  5.72     3   233    17.9 383.   5.81  25.3
 4  0.0883     0 10.8  0     0.413  6.42   6.6  5.29     4   305    19.2 384.   6.72  24.2
 5  0.127      0  6.91 0     0.448  6.77   2.9  5.72     3   233    17.9 385.   4.84  26.6
 6  0.0918     0  4.05 0     0.51   6.42  84.1  2.65     5   296    16.6 396.   9.04  23.6
 7  0.0178    95  1.47 0     0.403  7.14  13.9  7.65     3   402    17   384.   4.45  32.9
 8  4.67       0 18.1  0     0.713  5.98  87.9  2.58    24   666    20.2  10.5 19.0   12.7
 9  0.141      0 13.9  0     0.437  5.79  58    6.32     4   289    16   397.  15.8   20.3
10  0.0689     0  2.46 0     0.488  6.14  62.2  2.60     3   193    17.8 397.   9.45  36.2
# ℹ 394 more rows
# ℹ Use `print(n = ...)` to see more rows

Validation Splits

You can also use rsample to create validation splits. These are useful if you want to tune your model hyperparameters. You can use vfold_cv to create a validation split. This will create a number of folds (5 by default) and then repeat this a number of times (3 by default). You can then use these folds to tune your model (as wel will see in a later episode).

housing_folds <- vfold_cv(housing_train, v = 5, repeats = 3)

housing_folds
#  5-fold cross-validation repeated 3 times 
# A tibble: 15 × 3
   splits           id      id2  
   <list>           <chr>   <chr>
 1 <split [323/81]> Repeat1 Fold1
 2 <split [323/81]> Repeat1 Fold2
 3 <split [323/81]> Repeat1 Fold3
 4 <split [323/81]> Repeat1 Fold4
 5 <split [324/80]> Repeat1 Fold5
 6 <split [323/81]> Repeat2 Fold1
 7 <split [323/81]> Repeat2 Fold2
 8 <split [323/81]> Repeat2 Fold3
 9 <split [323/81]> Repeat2 Fold4
10 <split [324/80]> Repeat2 Fold5
11 <split [323/81]> Repeat3 Fold1
12 <split [323/81]> Repeat3 Fold2
13 <split [323/81]> Repeat3 Fold3
14 <split [323/81]> Repeat3 Fold4
15 <split [324/80]> Repeat3 Fold5

Preprocessing Your Data

Preprocessing your data should take you a really really really long time. Machine learning has a ‘garbage in = garbage out’ philosophy, which is to say that if you don’t put useful information into your model then you won’t get useful insights from your model.

Data preprocessing starts before you collect your data. When planning experiments or surveys it is important to have a clear question in mind and ensure that you collect all of the data that you need. Machine learning is often synonimised with ‘Big Data’ and therefore comes with the misconception that giving your model more data will always help it improve. Giving your model more data points (more observations) is beneficial to model performance, but there is an effect called the ‘Hughes Phenomynon” that says giving your model more variables will help to a point, but then having to deal with all of these variables will decrease the power of your model and will have a negative effect on performance.

Once you have collected your data, you might find that some (or all of) it needs transforming before it can be fed into a model. This can be done by hand, or there are ways to automate the process. To start investigating our data we can use the skimr package to create a short summary of each variable.

What Sort of Processing Might This (or a Similar) Dataset Need?

In small groups or on your own, look at the summary informations and discuss what sort of processing this data, or a similar data set might need. Remember that we will be applying our model to unseen data and this may have different properties to the data before you.

# library(skimr)

skim(housing_train)
── Data Summary ────────────────────────
                           Values       
Name                       housing_train
Number of rows             404          
Number of columns          14           
_______________________                 
Column type frequency:                  
  factor                   1            
  numeric                  13           
________________________                
Group variables            None         

── Variable type: factor ─────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique top_counts   
1 chas                  0             1 FALSE          2 0: 375, 1: 29

── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────
   skim_variable n_missing complete_rate    mean      sd        p0      p25     p50     p75    p100 hist 
 1 crim                  0             1   3.42    7.55    0.00632   0.0861   0.263   3.69   73.5   ▇▁▁▁▁
 2 zn                    0             1  10.4    22.5     0         0        0       3.12  100     ▇▁▁▁▁
 3 indus                 0             1  11.4     6.95    0.74      5.19     9.69   18.1    27.7   ▇▆▁▇▁
 4 nox                   0             1   0.556   0.117   0.389     0.453    0.538   0.624   0.871 ▇▇▅▃▁
 5 rm                    0             1   6.28    0.717   3.56      5.88     6.19    6.62    8.78  ▁▂▇▂▁
 6 age                   0             1  69.0    28.3     2.9      45.8     79.1    94.1   100     ▂▂▂▂▇
 7 dis                   0             1   3.78    2.11    1.13      2.10     3.17    5.14   12.1   ▇▅▂▁▁
 8 rad                   0             1   9.56    8.74    1         4        5      24      24     ▇▂▁▁▃
 9 tax                   0             1 409.    170.    187       280.     330     666     711     ▇▇▃▁▇
10 ptratio               0             1  18.5     2.20   12.6      17       19.1    20.2    22     ▁▃▃▅▇
11 b                     0             1 357.     90.4     0.32    375.     391.    396.    397.    ▁▁▁▁▇
12 lstat                 0             1  12.7     7.22    1.73      7.09    11.2    16.9    38.0   ▇▇▃▂▁
13 medv                  0             1  22.7     9.35    5        17.0     21.2    25      50     ▂▇▃▁▁

In ‘tidymodels’, we can use the ‘recipes’ package to create data preprocessing steps that can be pre-appended to the model. This means that when we give data to the model, the model will first use this recipe to transform the new observations and will then feed this tranformed data into the model and use that to make predictions.

What are the Advantages/Disadvantages of Using Data Preprocessors Over Manually Processing Data?

What are the advantages and disadvantages of using data preprocessing as part of a workflow are compared to processing new data by hand.

In this section we are going to build an extremely basic recipe to preprocess our data. You can create a recipe using the recipe function. This takes two arguments, the first is a formula stating what variable you are going to be predicting and what variables you would like to use as predictors (the . syntax is shorthand for saying “use all other variables as predictors”) and the second is the data that you would like to use to create the data preprocessor.

Why are We Only Using the Training Data?

Why are we only using the training data to create our data preprocessor and not the full data set? What would be the problems (if any) of including the testing data in the training of the preprocessor?

housing_rec <- recipe(medv ~ ., data = housing_train)

housing_rec
── Recipe ───────────────────────────────────────────────────────────────────────────────────────

── Inputs 
Number of variables by role
outcome:    1
predictor: 13

At the moment, the only information encoded in this recipe is what variable we are trying to predict and what variables we are using for prediction. One fairly standard transformation we may like to apply to our model is normalisation (or centering and scaling), to do this we create a recipe “step”. We pipe our recipe object into the step and it will create a sequence of instructions written in orser that we will apply to our data (just the same as piping multiple dplyr commands together). The function for normalising data is step_normalize, however we also need to tell it what variables we would like it to normalise. To do this you can explicitly type the name of the variable, use any of the dplyr selector functions or use some custom selector functions given by recipes. Here I have use a recipes selector function called all_numeric_predictors() that will transform any function registered as a predictor that is also numeric. You could also try using all_numeric() that will transform all variables, whether they are predictors or not.

housing_rec <- recipe(medv ~ ., data = housing_train) |>
  step_normalize(all_numeric_predictors())

housing_rec
── Recipe ───────────────────────────────────────────────────────────────────────────────────────

── Inputs 
Number of variables by role
outcome:    1
predictor: 13

── Operations 
• Centering and scaling for: all_numeric_predictors()

There is a problem with this though. If we have one numeric variable where all of the entries are the same, then this will have zero variance and so when we try and normalise it we will get a divide by zero error. On top of this, a variable where all of the entries are the same is not useful for training a model, so we should probably remove them. To do this we can use a near zero variance filter, this is specified using the step_nzv function and this time we can apply it to all predictors as it also isn’t useful if we have a categorical variable that only contains one category. Note: we have put our near zero variance filter before the normalisation step as we would like the filter to be applied first.

housing_rec <- recipe(medv ~ ., data = housing_train) |>
  step_nzv(all_predictors()) |> 
  step_normalize(all_numeric_predictors())

housing_rec
── Recipe ───────────────────────────────────────────────────────────────────────────────────────

── Inputs 
Number of variables by role
outcome:    1
predictor: 13

── Operations 
• Sparse, unbalanced variable filter on: all_predictors()
• Centering and scaling for: all_numeric_predictors()

There are lots of different recipe steps that you can apply (you can see them by typing recipes::step_ and scrolling through the auto complete) and we can’t cover all of them here. So we will just give another example of how you might go through your data preprocessing. Looking at the outputs of our skim function, we can see that the variable crim has a heavy positive skew. This might indicate that a log transform could be useful here. Let’s compare the raw value of crim with its log transform.

housing_train |>
  select(crim) |> 
  mutate(log_crim = log(crim)) |> 
  pivot_longer(cols = everything()) |> 
  ggplot() +
  aes(
    x = value
  ) +
  geom_histogram() +
  facet_wrap(
    ~ name,
    scales = "free",
    ncol = 1
  )

Comparison of Treated Variables

Looking at this graph, it’s likely the log transform of crim will be more useful than the raw values, so rather than manually transforming the data using dplyr::mutate we can add a recipe step to do it for us.

housing_rec <- recipe(medv ~ ., data = housing_train) |>
  step_log(crim) |> 
  step_nzv(all_predictors()) |> 
  step_normalize(all_numeric_predictors())

housing_rec
── Recipe ───────────────────────────────────────────────────────────────────────────────────────

── Inputs 
Number of variables by role
outcome:    1
predictor: 13

── Operations 
• Log transformation on: crim
• Sparse, unbalanced variable filter on: all_predictors()
• Centering and scaling for: all_numeric_predictors()

Most of the steps above are for dealing with numeric data, but we commonly have to deal with categorical data too. The most common way of dealing with categorical variables is through dummy variables. This means that if we have a categorical variable with $k$ categories, we create $k - 1$ new variables, one for each category. If an observation is in category $i$, then the $(i-1)$th dummy variable is set to 1 and all other dummy variables are set to 0. If the observation is in the first category though, we set all of the observations to 0, this is called the baseline. This means that we can use these dummy variables as predictors in our model.

Extracting Dummy Variables

To do this we can use the step_dummy function selecting all_factor_predictors(). This will create dummy variables for all of the categorical predictor variables in our data set.

housing_rec <- recipe(medv ~ ., data = housing_train) |>
  step_log(crim) |> 
  step_nzv(all_predictors()) |> 
  step_normalize(all_numeric_predictors()) |> 
  step_dummy(all_factor_predictors())

housing_rec
── Recipe ───────────────────────────────────────────────────────────────────────────────────────

── Inputs 
Number of variables by role
outcome:    1
predictor: 13

── Operations 
• Log transformation on: crim
• Sparse, unbalanced variable filter on: all_predictors()
• Centering and scaling for: all_numeric_predictors()
• Dummy variables from: all_factor_predictors()

At this point we should investigate every variable individually and decide what processing this variable will need, both from a machine learning point of view but also using your domain knowledge of where the data came from. There are step specifications for most, if not all, dplyr functions, there are extra packages that contain specifications for lots of different preprocessing steps and it is also possible to write your own.

Perform Your Own Preprocessing

Spend a few minutes now adding a new step to this recipe to perform some data transformation task. This could be as simple as applying a log transform to another variable or you could also look at problems such as missing data (you could look at removing or imputing missing data).

All we have done so far is tell the preprocessor what we would like to do, we haven’t actually fit it yet. We can do this by calling the prep function at the end, which will create a trained data preprocessor that (given the code below) will log the crim variable, not remove any columns (as all of the data has non-zero variance), will subtract the mean value of the training data set from each numeric predictor and will scale each numeric predictor by the standard deviation of the training set.

housing_rec <- recipe(medv ~ ., data = housing_train) |>
  step_log(crim) |> 
  step_nzv(all_predictors()) |>
  step_normalize(all_numeric_predictors()) |> 
  step_dummy(all_factor_predictors()) |> 
  prep()

housing_rec
── Recipe ───────────────────────────────────────────────────────────────────────────────────────

── Inputs 
Number of variables by role
outcome:    1
predictor: 13

── Training information 
Training data contained 404 data points and no incomplete rows.

── Operations 
• Log transformation on: crim | Trained
• Sparse, unbalanced variable filter removed: <none> | Trained
• Centering and scaling for: crim, zn, indus, nox, rm, age, dis, rad, tax, ... | Trained
• Dummy variables from: chas | Trained

To extract the transformed training data from the preprocessor we can use the juice function. This can be useful if you want to retrospectively see if there are any anomolies etc. included in the training data.

juice(housing_rec)
# A tibble: 404 × 14
      crim      zn  indus    nox     rm    age    dis    rad     tax ptratio     b  lstat  medv chas_X1
     <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl>   <dbl>
 1 -0.960   0.0643 -0.732 -1.28  -0.609 -1.70   1.34  -0.632 -0.364    0.197 0.445 -0.625  22         0
 2  1.31   -0.487   1.04   1.24  -1.13   1.11  -1.07   1.69   1.55     0.791 0.452  1.66   12.3       0
 3  0.675  -0.487   1.25   0.418  2.37   0.978 -0.819 -0.517 -0.0185  -1.72  0.159 -1.24   50         0
 4 -0.789  -0.487   0.262 -1.03  -0.615 -1.16   0.373 -0.517 -0.0483   0.105 0.443 -0.486  20.3       0
 5  2.03   -0.487   1.04   1.18  -1.23   1.11  -1.09   1.69   1.55     0.791 0.452  2.50    5         0
 6 -1.26   -0.487  -0.861 -0.360 -0.584 -0.335  0.913 -0.517 -1.09     0.791 0.430 -0.283  18.5       0
 7  1.35   -0.487   1.04   1.18   0.253  1.07  -0.975  1.69   1.55     0.791 0.400  0.629  13.1       0
 8  0.0779  0.395  -1.04   0.781  1.32   0.813 -0.876 -0.517 -0.848   -2.50  0.356 -0.625  36.5       0
 9 -1.09    3.04   -1.13  -1.37  -0.741 -1.36   1.40  -0.632 -0.412   -1.08  0.452 -0.328  19.4       0
10 -0.738   0.0643 -0.467 -0.282 -0.593 -1.07   0.834 -0.517 -0.567   -1.49  0.384  0.433  21.7       0
# ℹ 394 more rows
# ℹ Use `print(n = ...)` to see more rows

To transform new data using the data preprocessor we can use the bake function.

bake(housing_rec, housing_test)
# A tibble: 102 × 14
     crim      zn  indus    nox     rm    age     dis    rad    tax ptratio       b  lstat  medv chas_X1
    <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl> <dbl>   <dbl>
 1 -0.334  0.0643 -0.467 -0.282  0.117  0.910  1.27   -0.517 -0.567 -1.49    0.406   1.09   15         0
 2  0.378 -0.487  -0.428 -0.161 -0.526 -1.42   0.371  -0.632 -0.591  1.16    0.345  -0.837  23.1       0
 3  0.348 -0.487  -0.428 -0.161 -0.703  1.11   0.175  -0.632 -0.591  1.16    0.427   1.01   14.5       0
 4  0.504 -0.487  -0.428 -0.161 -0.504  0.469  0.124  -0.632 -0.591  1.16   -1.30    2.10   13.2       0
 5  0.418 -0.487  -0.428 -0.161 -0.866  0.935  0.0259 -0.632 -0.591  1.16    0.0465  0.800  13.1       0
 6 -0.720 -0.487  -0.748 -0.498 -0.663 -0.270 -0.173  -0.517 -0.758  0.334   0.246  -0.165  20         0
 7 -1.30   2.82   -1.19  -1.11   0.434 -1.69   0.810  -0.748 -0.919 -0.0772  0.439  -1.15   30.8       0
 8 -0.596 -0.487  -0.608 -0.939  0.688 -2.37   0.965  -0.748 -1.03  -0.260   0.330  -1.08   26.6       0
 9 -0.614 -0.487  -0.608 -0.939 -0.331 -1.04   0.965  -0.748 -1.03  -0.260   0.372  -0.424  21.2       0
10 -0.278 -0.487  -0.608 -0.939 -1.31   0.946  1.04   -0.748 -1.03  -0.260   0.452   2.53   14.4       0
# ℹ 92 more rows
# ℹ Use `print(n = ...)` to see more rows

When training a model with tidymodels we don’t actually have to transform (or bake) the data ourselves, we can add this preproceesor to the model worflow and it will automatically be applied when we predict using the model workflow.

Key Points

  • You can use rsample to split your data into training and testing sets.

  • You can use recipes to create a recipe to preprocess your data.


Building a Simple Workflow

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How do we make a workflow using tidymodels?

Objectives
  • Create a linear regression workflow using tidymodels.

Creating a Workflow

As discussed in the introduction, a workflow is a series of steps that are executed in order to accomplish a task, in our case this is to train or make predictions using a model. To initialise a workflow in tidymodels is very simple:

housing_wf <- workflow()

housing_wf
══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: None
Model: None

Here you can see that our workflow object has two slots: a preprocessor and a model. We made a preprocessor object in the last session (our recipe) so we can easily add this to our workflow, we can see our workflow now in the workflow summary and the different preproessing steps involved.

housing_wf <- workflow() |> 
  add_recipe(housing_rec)

housing_wf
══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: None

── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_log()
• step_nzv()
• step_normalize()
• step_dummy()

Creating a Model

The next thing we need to do is to create a model object to add to the workflow. This is the power of tidymodels as we can create a model object without having to worry about the underlying code. We can create a model object using the linear_reg() function from the parsnip package. This function creates a model object that is a linear regression model.

housing_lin_reg <- linear_reg(
  mode = "regression",
  engine = "lm"
)

housing_lin_reg
Linear Regression Model Specification (regression)

Computational engine: lm 

Here you can see we have created a linear regression model object. The mode argument specifies that we want to create a regression model and the engine argument specifies that we want to use the lm function from the stats package to fit the model. For each type of model there are many different implementations of that model (generally different packages implement them slightly differently) and the engine argument allows us to specify which implementation we want to use which can make quite a big difference in some circumstances.

We can also add this to our model and that’s it! We have a workflow that we can use to train a model.

housing_wf <- workflow() |> 
  add_recipe(housing_rec) |> 
  add_model(housing_lin_reg)

housing_wf
══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_log()
• step_nzv()
• step_normalize()
• step_dummy()

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 

Fitting a Workflow

Now that we have a workflow we can use it to fit a model. To do this we use the fit() function. This function takes two arguments: the workflow and the data. The data is the data that we want to use to train the model. In this case we want to use the training data that we created in the last session. You can see the coefficients of the model in the output below.

housing_fit <- housing_wf |> 
  fit(housing_train)

housing_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_log()
• step_nzv()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept)         crim           zn        indus          nox           rm          age  
    22.4329       0.8177       0.5474       0.1569      -2.2445       3.1357       0.1204  
        dis          rad          tax      ptratio            b        lstat      chas_X1  
    -2.4630       1.4247      -1.6586      -2.0529       1.1670      -3.9530       3.2286 

Predicting with a Workflow

Now that we have a fitted model we can use it to make predictions. To do this we use the predict() function. This function takes two arguments: the fitted model and the data. The data is the data that we want to use to make predictions. In this case we want to use the test data that we created in a previous session. You can see the predictions of the model in the output below.

housing_res <- housing_fit |> 
  predict(housing_train)

housing_res
# A tibble: 404 × 1
   .pred
   <dbl>
 1 21.3 
 2 12.0 
 3 39.5 
 4 22.5 
 5  9.16
 6 18.8 
 7 20.3 
 8 35.9 
 9 24.0 
10 20.4 
# ℹ 394 more rows
# ℹ Use `print(n = ...)` to see more rows

We can plot our predictions against the true values to see how well our model is doing.

housing_train |> 
  select(medv) |> 
  bind_cols(housing_res) |> 
  ggplot() +
  aes(
    x = medv,
    y = .pred
  ) +
  geom_point(
    alpha = 0.6
  ) +
  geom_abline(
    slope = 1,
    colour = "blue",
    lty = 2
  ) +
  labs(
    x = "True Value",
    y = "Predicted Value"
  ) +
  coord_equal() +
  theme_bw()

True vs. Predicted Values For Linear Regression Model

Key Points

  • A workflow is a series of steps that are executed in order to accomplish a task.


Hyperparameter Tuning

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How do we tune the hyperparameters of a model?

Objectives
  • Create a random forest model using tidymodels.

Hyperparameter Tuning

Linear regression is considered to be a simpler model both due to its constraints and that it has no hyperparameters to tune. If you remember from a previous session, models have parameters and hyperparameters. Parameters are learnt by the model from the data but hyperparameters have to be set by us. Luckily, tidymodels has a lot of tools that we can use for finding the best hyperparameters and justifying the choices we make.

Hyperparameter tuning is performed in tidymodels using a package called dials. We are going to demonstrate how to use dials by training a random forest model, we aren’t going to go into the details of how random forests work but you can see we will be able to train one anyway! When it comes to applying this in the real world, you should always research the model you are using and the basics on how it works to see if it appropriate for your use case.

Identify the Hyperparameters

Use ?rand_forest to look at the help documentation for creating a random forest model. What are the hyperparameters for the model?

Firstly, lets create a new workflow object and add our recipe to it. Note how this is exactly the same as before and we are reusing the same preprocessor

housing_wf_rfrst <- workflow() |> 
  add_recipe(housing_rec)
══ Workflow ════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: None

── Preprocessor ────────────────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_log()
• step_nzv()
• step_normalize()
• step_dummy()

Now we can create our model. We are going to use the rand_forest() function from the parsnip package to create a random forest model object. Rather than setting the values of the hyperparameters, we are going to set them equal to the tune() function. This lets tidymodels know that we want to tune these hyperparameters and that this isn’t the final version of the model. You can see that this time we haven’t selected an engine to use but it has defaulted to the ranger package.

housing_rfrst <- rand_forest(
  mode = "regression",
  mtry = tune(),
  trees = tune(),
  min_n = tune()
)

housing_rfrst
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = tune()
  min_n = tune()

Computational engine: ranger 

We can now add this model to our workflow like before.

housing_wf_rfrst <- workflow() |> 
  add_recipe(housing_rec) |> 
  add_model(housing_rfrst)

housing_wf_rfrst
══ Workflow ════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_log()
• step_nzv()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = tune()
  min_n = tune()

Computational engine: ranger

Next we are going to perform cross validation to select the best combination of hyperparameters to use to train the model and to do this we need to decide the combinations to test. Thankfully, tidymodels can do this for us. We can use the extract_parameter_set_dials() to extract the parameters that need tuning from the model as well as what type of parameters they are.

tuning_params <- housing_rfrst |> 
  extract_parameter_set_dials()

 tuning_params 
Collection of 3 parameters for tuning

 identifier  type    object
       mtry  mtry nparam[?]
      trees trees nparam[+]
      min_n min_n nparam[+]

Model parameters needing finalization:
   # Randomly Selected Predictors ('mtry')

See `?dials::finalize` or `?dials::update.parameters` for more information.

You can see in the output that some hyperparameters need finalization. The tidymodels package is good at guessing different types of parameters but sometimes it needs some extra context. For example, the mtry parameter is the number of randomly selected predictors to use at each split which is dependent in the data set we are using (smaller datasets won’t need as many) and so to get a set of reasonable values we can show the model some example data and it can use this to estimate a good parameter range. We do this using the finalize() function, passing through our training data set.

tuning_params <- housing_rfrst |> 
  extract_parameter_set_dials() |> 
  finalize(housing_train)

tuning_params
Collection of 3 parameters for tuning

 identifier  type    object
       mtry  mtry nparam[+]
      trees trees nparam[+]
      min_n min_n nparam[+]

Now we can start generating sets of values to test, the simplest method for doing this is using a tuning grid. To create one, a predefined number of values are selected for each hyperparameter and every combination of these values is tested. We do this using the grid_regular() function, passing through the tuning parameters we extracted from the model and the number of values we want to test for each hyperparameter as ‘levels’. There are other, arguably more sophisticated, methods for generating tuning grids, you may wish to try using grid_max_entropy() or grid_latin_hypercube() instead (hint: if you wish to use these you will need to replace the levels argument with size). We are going to use 100 values for each hyperparameter.

tuning_grid <- grid_regular(
  tuning_params,
  levels = 100
)

tuning_grid
# A tibble: 54,600 × 3
    mtry trees min_n
   <int> <int> <int>
 1     1     1     2
 2     2     1     2
 3     3     1     2
 4     4     1     2
 5     5     1     2
 6     6     1     2
 7     7     1     2
 8     8     1     2
 9     9     1     2
10    10     1     2
# ℹ 54,590 more rows
# ℹ Use `print(n = ...)` to see more rows

Now that we have the combinations of hyperparameters that we would like to test we can use the tune_grid() function to train a model with each of these hyperparameter combinations. Rather than passing tune_grid() housing_train we have passed it housing_folds which has the specification for our cross validation sets.

housing_rfrst_tuning <- housing_wf_rfrst |> 
  tune_grid(
    housing_folds,
    tuning_params
  )

housing_rfrst_tuning
# Tuning results
# 5-fold cross-validation repeated 3 times 
# A tibble: 15 × 5
   splits           id      id2   .metrics          .notes          
   <list>           <chr>   <chr> <list>            <list>          
 1 <split [323/81]> Repeat1 Fold1 <tibble [20 × 7]> <tibble [0 × 3]>
 2 <split [323/81]> Repeat1 Fold2 <tibble [20 × 7]> <tibble [0 × 3]>
 3 <split [323/81]> Repeat1 Fold3 <tibble [20 × 7]> <tibble [0 × 3]>
 4 <split [323/81]> Repeat1 Fold4 <tibble [20 × 7]> <tibble [0 × 3]>
 5 <split [324/80]> Repeat1 Fold5 <tibble [20 × 7]> <tibble [0 × 3]>
 6 <split [323/81]> Repeat2 Fold1 <tibble [20 × 7]> <tibble [0 × 3]>
 7 <split [323/81]> Repeat2 Fold2 <tibble [20 × 7]> <tibble [0 × 3]>
 8 <split [323/81]> Repeat2 Fold3 <tibble [20 × 7]> <tibble [0 × 3]>
 9 <split [323/81]> Repeat2 Fold4 <tibble [20 × 7]> <tibble [0 × 3]>
10 <split [324/80]> Repeat2 Fold5 <tibble [20 × 7]> <tibble [0 × 3]>
11 <split [323/81]> Repeat3 Fold1 <tibble [20 × 7]> <tibble [0 × 3]>
12 <split [323/81]> Repeat3 Fold2 <tibble [20 × 7]> <tibble [0 × 3]>
13 <split [323/81]> Repeat3 Fold3 <tibble [20 × 7]> <tibble [0 × 3]>
14 <split [323/81]> Repeat3 Fold4 <tibble [20 × 7]> <tibble [0 × 3]>
15 <split [324/80]> Repeat3 Fold5 <tibble [20 × 7]> <tibble [0 × 3]>

This is hard for us to interpret and so we can use the collect_metrics() function to extract the metrics we are interested in.

housing_rfrst_tuning |> 
  collect_metrics()
# A tibble: 20 × 9
    mtry trees min_n .metric .estimator  mean     n std_err .config              
   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1     8  1089    26 rmse    standard   3.35     15  0.186  Preprocessor1_Model01
 2     8  1089    26 rsq     standard   0.869    15  0.0131 Preprocessor1_Model01
 3    11  1384    14 rmse    standard   3.25     15  0.180  Preprocessor1_Model02
 4    11  1384    14 rsq     standard   0.875    15  0.0122 Preprocessor1_Model02
 5     4   309     4 rmse    standard   3.13     15  0.188  Preprocessor1_Model03
 6     4   309     4 rsq     standard   0.889    15  0.0124 Preprocessor1_Model03
 7     6   466    38 rmse    standard   3.56     15  0.195  Preprocessor1_Model04
 8     6   466    38 rsq     standard   0.857    15  0.0144 Preprocessor1_Model04
 9     8  1638    18 rmse    standard   3.26     15  0.186  Preprocessor1_Model05
10     8  1638    18 rsq     standard   0.876    15  0.0127 Preprocessor1_Model05
11     5  1449    33 rmse    standard   3.51     15  0.194  Preprocessor1_Model06
12     5  1449    33 rsq     standard   0.863    15  0.0144 Preprocessor1_Model06
13     3   692    24 rmse    standard   3.61     15  0.197  Preprocessor1_Model07
14     3   692    24 rsq     standard   0.862    15  0.0151 Preprocessor1_Model07
15    12    74    11 rmse    standard   3.24     15  0.178  Preprocessor1_Model08
16    12    74    11 rsq     standard   0.876    15  0.0118 Preprocessor1_Model08
17    10  1824     8 rmse    standard   3.19     15  0.180  Preprocessor1_Model09
18    10  1824     8 rsq     standard   0.880    15  0.0120 Preprocessor1_Model09
19     2   981    30 rmse    standard   4.06     15  0.201  Preprocessor1_Model10
20     2   981    30 rsq     standard   0.833    15  0.0178 Preprocessor1_Model10

We can also quickly create a plot to visualise the results.

autoplot(housing_rfrst_tuning)

Hyperparameter Tuning Plot

To select the “best” model we can use the select_best() function. We pass these all of the tuning results and the metric we are interested in optimising. In this case we are interested in the model with the highest R-squared.

best_params <- housing_rfrst_tuning |> 
  select_best("rsq")

best_params
# A tibble: 1 × 4
   mtry trees min_n .config              
  <int> <int> <int> <chr>                
1     4   309     4 Preprocessor1_Model03

Now we need set the parameters in our workflow model to be the best parameters we have idenified above. If we take a look at our workflow object, it looks like this

housing_wf_rfrst
══ Workflow ════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_log()
• step_nzv()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = tune()
  min_n = tune()

Computational engine: ranger 

We can then set the hyperparameters using the finalize_workflow() function. We pass this the workflow object and the best parameters we have identified.

housing_wf_rfrst |> 
  finalize_workflow(best_params)
══ Workflow ════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_log()
• step_nzv()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 4
  trees = 309
  min_n = 4

Computational engine: ranger 

As well as finalizing the workflow, we can fit the model to the training data using the fit() function like before and now we have a fully tuned and trained model.

housing_rfrst_fit <- housing_wf_rfrst |> 
  finalize_workflow(best_params) |> 
  fit(housing_train)


housing_rfrst_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_log()
• step_nzv()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────────────────
Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~4L,      x), num.trees = ~309L, min.node.size = min_rows(~4L, x),      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 

Type:                             Regression 
Number of trees:                  309 
Sample size:                      404 
Number of independent variables:  13 
Mtry:                             4 
Target node size:                 4 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       9.67719 
R squared (OOB):                  0.8873346 

Just like before, we can then use the predict() function to generate predictions and then plot the results, as we have seen in the previous section.

housing_rfrst_res <- housing_rfrst_fit |> 
  predict(housing_train)

housing_train |> 
  select(medv) |> 
  bind_cols(housing_res) |> 
  ggplot() +
  aes(
    x = medv,
    y = .pred
  ) +
  geom_point(
    alpha = 0.6
  ) +
  geom_abline(
    slope = 1,
    colour = "blue",
    lty = 2
  ) +
  labs(
    x = "True Value",
    y = "Predicted Value"
  ) +
  coord_equal() +
  theme_bw()

True vs Predicted Plot For Random Forest

This is not the most robust of tests, but to look at how much better our tuned random forest model is compared to the linear regression model we can compare these truth vs. prediction plots.

Comparison of True vs Predicted Plots

Comparing Models

Which model do you think is performing better? Do you think you can be confident of your answer only using the plot above? Who or why not?

Key Points

  • Hyperparameters are the settings of a model that we have to set ourselves.


After Training a Model

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How do we test a model?

  • How do we save and load a model?

Objectives
  • Test a model on the testing data.

  • Save and load a model.

Testing a Model

Up until this point we have not touched our testing data, this is good as we should not touch it until after we have performed model selection and training. Once you have a model that you are happy with then you can see how it performs on the testing data to get a idea as to how it will perform in the real world. For this example, we are going to pretend that we have decided we want to use our linear regression model.

So far we have been using the fit() function to fit our model. However, tidymodels has another function called last_fit() which will fit the model to the training data and then test it on the testing data. We can use this function just like we used fit(), but instead of passing in the training data (housing_train) we pass in the training and testing data in the form of the split object we made earlier (housing_split).

housing_lin_reg_last_fit <- housing_wf |> 
  last_fit(housing_split)

housing_lin_reg_last_fit
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits            id               .metrics         .notes   .predictions       .workflow 
  <list>            <chr>            <list>           <list>   <list>             <list>    
1 <split [404/102]> train/test split <tibble [2 × 4]> <tibble> <tibble [102 × 4]> <workflow>

This is an object that contains our trained workflow, we can access the workflow by

housing_lin_reg_last_fit |> 
  extract_workflow()
══ Workflow [trained] ══════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_log()
• step_nzv()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept)         crim           zn        indus          nox           rm          age  
   22.51650      0.83052      0.96745      0.35432     -2.15434      2.73170     -0.04878  
        dis          rad          tax      ptratio            b        lstat      chas_X1  
   -2.72384      1.74809     -2.04659     -2.02828      0.99723     -4.26584      2.57621  

But the last_fit() function, as well as fitting the model, will evaluate your model on the testing data and provide you with metrics on how well it performed. We can access these metrics by using the collect_metrics() function.

housing_lin_reg_last_fit |> 
  collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       3.83  Preprocessor1_Model1
2 rsq     standard       0.791 Preprocessor1_Model1

Is this model any good?

Looking at these metrics, do you think that the model we have trained is any good? Why or why not?

We can also check whether there are any “structural errors” in our model by comparing plots of true and predicted values.

housing_lin_reg_test_res <- housing_lin_reg_last_fit |> 
  extract_workflow() |> 
  predict(housing_test)

housing_lin_reg_test_res
# A tibble: 102 × 1
   .pred
   <dbl>
 1 20.4 
 2 21.7 
 3 18.9 
 4 16.2 
 5 13.9 
 6 11.4 
 7  8.39
 8 22.8 
 9 34.4 
10 25.4 
# ℹ 92 more rows
# ℹ Use `print(n = ...)` to see more rows
housing_test |> 
  select(medv) |> 
  bind_cols(housing_lin_reg_test_res) |> 
  ggplot() +
  aes(
    x = medv,
    y = .pred
  ) +
  geom_point(
    alpha = 0.6
  ) +
  geom_abline(
    slope = 1,
    colour = "blue",
    lty = 2
  ) +
  labs(
    x = "True Value",
    y = "Predicted Value"
  ) +
  coord_equal() +
  theme_bw()

Test Comparison Plot

Is this model any good?

Looking at this plot, do you think that the model we have trained is any good? Why or why not?

Saving a Model

Once you have a model that you are happy with, you can save it to disk so that you can use it later. To do this we use the saveRDS() function. The first argument is the object you want to save, the second argument is the path to the file you want to save it to. In this case we are going to save our linear regression model last fit object to a file called model.rds.

saveRDS(
  housing_lin_reg_last_fit,
  "model.rds"
)

We can then read in this model again using the readRDS() function.

model <- readRDS("model.rds")

model |> 
  extract_workflow()
══ Workflow [trained] ══════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_log()
• step_nzv()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept)         crim           zn        indus          nox           rm          age  
   22.51650      0.83052      0.96745      0.35432     -2.15434      2.73170     -0.04878  
        dis          rad          tax      ptratio            b        lstat      chas_X1  
   -2.72384      1.74809     -2.04659     -2.02828      0.99723     -4.26584      2.57621

Key Points

  • We can use last_fit() to test a model on the testing data.

  • We can use saveRDS() and readRDS() to save and load a model.


Full Walkthrough For Classification

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Can we apply what we have learnt to a new dataset?

Objectives
  • Apply what we have learnt to a new dataset.

data <- iris |> 
  as_tibble()

data_split <- initial_split(data, prop = 0.8, strata = Species)
data_train <- training(data_split)
data_test <- testing(data_split)

data_folds <- vfold_cv(data_train, repeats = 5, strata = Species)

data_rec <- recipe(Species ~ ., data = data_train) |> 
  step_normalize(all_numeric()) |> 
  prep()

data_knn <- nearest_neighbor(
  mode = "classification",
  neighbors = tune(),
  weight_func = tune(),
  dist_power = tune()
)

data_wf <- workflow() |> 
  add_recipe(data_rec) |> 
  add_model(data_knn)

tuning_params <- data_knn |> 
  extract_parameter_set_dials()

data_tuning_grid <- grid_latin_hypercube(
  tuning_params,
  size = 100
)

data_tune <- tune_grid(
  data_wf,
  data_folds,
  tuning_params
)

data_tune |> 
  collect_metrics()

autoplot(data_tune)

best_params <- data_tune |> 
  select_best("roc_auc")

data_last_fit <- data_wf |> 
  finalize_workflow(best_params) |> 
  last_fit(data_split)

data_last_fit |> 
  collect_metrics()

iris_test_res <- data_test |> 
  select(Species) |> 
  bind_cols(
    data_last_fit |> 
      extract_workflow() |> 
      predict(data_test)
  ) |> 
  transmute(
    ref = Species,
    pred = .pred_class 
  )

library(caret)

confusionMatrix(data = iris_test_res$pred, reference = iris_test_res$ref)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0          8         2
  virginica       0          2         8

Overall Statistics
                                          
               Accuracy : 0.8667          
                 95% CI : (0.6928, 0.9624)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 2.296e-09       
                                          
                  Kappa : 0.8             
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8000           0.8000
Specificity                 1.0000            0.9000           0.9000
Pos Pred Value              1.0000            0.8000           0.8000
Neg Pred Value              1.0000            0.9000           0.9000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2667           0.2667
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.8500           0.8500

Key Points

  • We can apply what we have learnt to a new dataset.