24
24
< meta name ="author " content ="Tiffany Timbers, Trevor Campbell, and Melissa Lee Foreword by Roger Peng " />
25
25
26
26
27
- < meta name ="date " content ="2022-03-02 " />
27
+ < meta name ="date " content ="2022-07-05 " />
28
28
29
29
< meta name ="viewport " content ="width=device-width, initial-scale=1 " />
30
30
< meta name ="apple-mobile-web-app-capable " content ="yes " />
@@ -623,7 +623,7 @@ <h2><span class="header-section-number">6.4</span> Randomness and seeds</h2>
623
623
The trick is that in R—and other programming languages—randomness
624
624
is not actually random! Instead, R uses a < em > random number generator</ em > that
625
625
produces a sequence of numbers that
626
- are completely determined by a
626
+ are completely determined by a
627
627
< em > seed value</ em > . Once you set the seed value
628
628
using the < code > set.seed</ code > function, everything after that point may < em > look</ em > random,
629
629
but is actually totally reproducible. As long as you pick the same seed
@@ -634,27 +634,27 @@ <h2><span class="header-section-number">6.4</span> Randomness and seeds</h2>
634
634
we call < code > set.seed</ code > , and pass it any integer as an argument.
635
635
Here, we pass in the number < code > 1</ code > .</ p >
636
636
< div class ="sourceCode " id ="cb300 "> < pre class ="sourceCode r "> < code class ="sourceCode r "> < span id ="cb300-1 "> < a href ="classification2.html#cb300-1 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="fu "> set.seed</ span > (< span class ="dv "> 1</ span > )</ span >
637
- < span id ="cb300-2 "> < a href ="classification2.html#cb300-2 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers < span class ="ot "> <-</ span > < span class ="fu "> sample</ span > (< span class ="dv "> 0</ span > < span class ="sc "> :</ span > < span class ="dv "> 9</ span > , < span class ="dv "> 10</ span > , < span class ="at "> replace=</ span > < span class ="cn "> TRUE</ span > )</ span >
638
- < span id ="cb300-3 "> < a href ="classification2.html#cb300-3 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers </ span > </ code > </ pre > </ div >
637
+ < span id ="cb300-2 "> < a href ="classification2.html#cb300-2 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers1 < span class ="ot "> <-</ span > < span class ="fu "> sample</ span > (< span class ="dv "> 0</ span > < span class ="sc "> :</ span > < span class ="dv "> 9</ span > , < span class ="dv "> 10</ span > , < span class ="at "> replace=</ span > < span class ="cn "> TRUE</ span > )</ span >
638
+ < span id ="cb300-3 "> < a href ="classification2.html#cb300-3 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers1 </ span > </ code > </ pre > </ div >
639
639
< pre > < code > ## [1] 8 3 6 0 1 6 1 2 0 4</ code > </ pre >
640
- < p > You can see that < code > random_numbers </ code > is a list of 10 numbers
640
+ < p > You can see that < code > random_numbers1 </ code > is a list of 10 numbers
641
641
from 0 to 9 that, from all appearances, looks random. If
642
642
we run the < code > sample</ code > function again, we will
643
643
get a fresh batch of 10 numbers that also look random.</ p >
644
- < div class ="sourceCode " id ="cb302 "> < pre class ="sourceCode r "> < code class ="sourceCode r "> < span id ="cb302-1 "> < a href ="classification2.html#cb302-1 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers < span class ="ot "> <-</ span > < span class ="fu "> sample</ span > (< span class ="dv "> 0</ span > < span class ="sc "> :</ span > < span class ="dv "> 9</ span > , < span class ="dv "> 10</ span > , < span class ="at "> replace=</ span > < span class ="cn "> TRUE</ span > )</ span >
645
- < span id ="cb302-2 "> < a href ="classification2.html#cb302-2 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers </ span > </ code > </ pre > </ div >
644
+ < div class ="sourceCode " id ="cb302 "> < pre class ="sourceCode r "> < code class ="sourceCode r "> < span id ="cb302-1 "> < a href ="classification2.html#cb302-1 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers2 < span class ="ot "> <-</ span > < span class ="fu "> sample</ span > (< span class ="dv "> 0</ span > < span class ="sc "> :</ span > < span class ="dv "> 9</ span > , < span class ="dv "> 10</ span > , < span class ="at "> replace=</ span > < span class ="cn "> TRUE</ span > )</ span >
645
+ < span id ="cb302-2 "> < a href ="classification2.html#cb302-2 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers2 </ span > </ code > </ pre > </ div >
646
646
< pre > < code > ## [1] 4 9 5 9 6 8 4 4 8 8</ code > </ pre >
647
647
< p > If we want to force R to produce the same sequences of random numbers,
648
648
we can simply call the < code > set.seed</ code > function again with the same argument
649
649
value.</ p >
650
650
< div class ="sourceCode " id ="cb304 "> < pre class ="sourceCode r "> < code class ="sourceCode r "> < span id ="cb304-1 "> < a href ="classification2.html#cb304-1 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="fu "> set.seed</ span > (< span class ="dv "> 1</ span > )</ span >
651
- < span id ="cb304-2 "> < a href ="classification2.html#cb304-2 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers < span class ="ot "> <-</ span > < span class ="fu "> sample</ span > (< span class ="dv "> 0</ span > < span class ="sc "> :</ span > < span class ="dv "> 9</ span > , < span class ="dv "> 10</ span > , < span class ="at "> replace=</ span > < span class ="cn "> TRUE</ span > )</ span >
652
- < span id ="cb304-3 "> < a href ="classification2.html#cb304-3 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers </ span > </ code > </ pre > </ div >
651
+ < span id ="cb304-2 "> < a href ="classification2.html#cb304-2 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers1_again < span class ="ot "> <-</ span > < span class ="fu "> sample</ span > (< span class ="dv "> 0</ span > < span class ="sc "> :</ span > < span class ="dv "> 9</ span > , < span class ="dv "> 10</ span > , < span class ="at "> replace=</ span > < span class ="cn "> TRUE</ span > )</ span >
652
+ < span id ="cb304-3 "> < a href ="classification2.html#cb304-3 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers1_again </ span > </ code > </ pre > </ div >
653
653
< pre > < code > ## [1] 8 3 6 0 1 6 1 2 0 4</ code > </ pre >
654
- < div class ="sourceCode " id ="cb306 "> < pre class ="sourceCode r "> < code class ="sourceCode r "> < span id ="cb306-1 "> < a href ="classification2.html#cb306-1 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers < span class ="ot "> <-</ span > < span class ="fu "> sample</ span > (< span class ="dv "> 0</ span > < span class ="sc "> :</ span > < span class ="dv "> 9</ span > , < span class ="dv "> 10</ span > , < span class ="at "> replace=</ span > < span class ="cn "> TRUE</ span > )</ span >
655
- < span id ="cb306-2 "> < a href ="classification2.html#cb306-2 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers </ span > </ code > </ pre > </ div >
654
+ < div class ="sourceCode " id ="cb306 "> < pre class ="sourceCode r "> < code class ="sourceCode r "> < span id ="cb306-1 "> < a href ="classification2.html#cb306-1 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers2_again < span class ="ot "> <-</ span > < span class ="fu "> sample</ span > (< span class ="dv "> 0</ span > < span class ="sc "> :</ span > < span class ="dv "> 9</ span > , < span class ="dv "> 10</ span > , < span class ="at "> replace=</ span > < span class ="cn "> TRUE</ span > )</ span >
655
+ < span id ="cb306-2 "> < a href ="classification2.html#cb306-2 " aria-hidden ="true " tabindex ="-1 "> </ a > random_numbers2_again </ span > </ code > </ pre > </ div >
656
656
< pre > < code > ## [1] 4 9 5 9 6 8 4 4 8 8</ code > </ pre >
657
- < p > And if we choose
657
+ < p > Notice that after setting the seed, we get the same two sequences of numbers in the same order. < code > random_numbers1 </ code > and < code > random_numbers1_again </ code > produce the same sequence of numbers, and the same can be said about < code > random_numbers2 </ code > and < code > random_numbers2_again </ code > . And if we choose
658
658
a different value for the seed—say, 4235—we
659
659
obtain a different sequence of random numbers.</ p >
660
660
< div class ="sourceCode " id ="cb308 "> < pre class ="sourceCode r "> < code class ="sourceCode r "> < span id ="cb308-1 "> < a href ="classification2.html#cb308-1 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="fu "> set.seed</ span > (< span class ="dv "> 4235</ span > )</ span >
@@ -822,7 +822,7 @@ <h3><span class="header-section-number">6.5.2</span> Preprocess the data</h3>
822
822
our test data does not influence any aspect of our model training. Once we have
823
823
created the standardization preprocessor, we can then apply it separately to both the
824
824
training and test data sets.</ p >
825
- < p > Fortunately, the < code > recipe</ code > framework from < code > tidymodels</ code > helps us handle
825
+ < p > Fortunately, the < code > recipe</ code > framework from < code > tidymodels</ code > helps us handle
826
826
this properly. Below we construct and prepare the recipe using only the training
827
827
data (due to < code > data = cancer_train</ code > in the first line).</ p >
828
828
< div class ="sourceCode " id ="cb320 "> < pre class ="sourceCode r "> < code class ="sourceCode r "> < span id ="cb320-1 "> < a href ="classification2.html#cb320-1 " aria-hidden ="true " tabindex ="-1 "> </ a > cancer_recipe < span class ="ot "> <-</ span > < span class ="fu "> recipe</ span > (Class < span class ="sc "> ~</ span > Smoothness < span class ="sc "> +</ span > Concavity, < span class ="at "> data =</ span > cancer_train) < span class ="sc "> |></ span > </ span >
@@ -918,8 +918,7 @@ <h3><span class="header-section-number">6.5.5</span> Compute the accuracy</h3>
918
918
the table of predicted labels and correct labels, using the < code > conf_mat</ code > function:</ p >
919
919
< div class ="sourceCode " id ="cb327 "> < pre class ="sourceCode r "> < code class ="sourceCode r "> < span id ="cb327-1 "> < a href ="classification2.html#cb327-1 " aria-hidden ="true " tabindex ="-1 "> </ a > confusion < span class ="ot "> <-</ span > cancer_test_predictions < span class ="sc "> |></ span > </ span >
920
920
< span id ="cb327-2 "> < a href ="classification2.html#cb327-2 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="fu "> conf_mat</ span > (< span class ="at "> truth =</ span > Class, < span class ="at "> estimate =</ span > .pred_class)</ span >
921
- < span id ="cb327-3 "> < a href ="classification2.html#cb327-3 " aria-hidden ="true " tabindex ="-1 "> </ a > </ span >
922
- < span id ="cb327-4 "> < a href ="classification2.html#cb327-4 " aria-hidden ="true " tabindex ="-1 "> </ a > confusion</ span > </ code > </ pre > </ div >
921
+ < span id ="cb327-3 "> < a href ="classification2.html#cb327-3 " aria-hidden ="true " tabindex ="-1 "> </ a > confusion</ span > </ code > </ pre > </ div >
923
922
< pre > < code > ## Truth
924
923
## Prediction M B
925
924
## M 39 6
@@ -991,7 +990,7 @@ <h3><span class="header-section-number">6.5.6</span> Critically analyze performa
991
990
< div id ="tuning-the-classifier " class ="section level2 " number ="6.6 ">
992
991
< h2 > < span class ="header-section-number "> 6.6</ span > Tuning the classifier</ h2 >
993
992
< p > The vast majority of predictive models in statistics and machine learning have
994
- < em > parameters</ em > . A < em > parameter</ em >
993
+ < em > parameters</ em > . A < em > parameter</ em >
995
994
is a number you have to pick in advance that determines
996
995
some aspect of how the model behaves. For example, in the < span class ="math inline "> \(K\)</ span > -nearest neighbors
997
996
classification algorithm, < span class ="math inline "> \(K\)</ span > is a parameter that we have to pick
@@ -1107,7 +1106,7 @@ <h3><span class="header-section-number">6.6.1</span> Cross-validation</h3>
1107
1106
## 3 <split [341/85]> Fold3
1108
1107
## 4 <split [341/85]> Fold4
1109
1108
## 5 <split [342/84]> Fold5</ code > </ pre >
1110
- < p > Then, when we create our data analysis workflow, we use the < code > fit_resamples</ code > function
1109
+ < p > Then, when we create our data analysis workflow, we use the < code > fit_resamples</ code > function
1111
1110
instead of the < code > fit</ code > function for training. This runs cross-validation on each
1112
1111
train/validation split.</ p >
1113
1112
< div class ="sourceCode " id ="cb335 "> < pre class ="sourceCode r "> < code class ="sourceCode r "> < span id ="cb335-1 "> < a href ="classification2.html#cb335-1 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="co "> # recreate the standardization recipe from before </ span > </ span >
@@ -1134,7 +1133,7 @@ <h3><span class="header-section-number">6.6.1</span> Cross-validation</h3>
1134
1133
## 3 <split [341/85]> Fold3 <tibble [2 × 4]> <tibble [0 × 1]>
1135
1134
## 4 <split [341/85]> Fold4 <tibble [2 × 4]> <tibble [0 × 1]>
1136
1135
## 5 <split [342/84]> Fold5 <tibble [2 × 4]> <tibble [0 × 1]></ code > </ pre >
1137
- < p > The < code > collect_metrics</ code > function is used to aggregate the < em > mean</ em > and < em > standard error</ em >
1136
+ < p > The < code > collect_metrics</ code > function is used to aggregate the < em > mean</ em > and < em > standard error</ em >
1138
1137
of the classifier’s validation accuracy across the folds. You will find results
1139
1138
related to the accuracy in the row with < code > accuracy</ code > listed under the < code > .metric</ code > column.
1140
1139
You should consider the mean (< code > mean</ code > ) to be the estimated accuracy, while the standard
0 commit comments