# Chapter 14 Naive Bayes Classification

There exist multiple penguin species throughout Antarctica, including the *Adelie*, *Chinstrap*, and *Gentoo*.
When encountering one of these penguins on an Antarctic trip, we might *classify* its species

\[Y = \begin{cases} A & \text{ Adelie} \\ C & \text{ Chinstrap} \\ G & \text{ Gentoo} \\ \end{cases}\]

by examining various physical characteristics, such as whether the penguin weighs more than the average 4200g,

\[X_{1} = \begin{cases} 1 & \text{ above-average weight} \\ 0 & \text{ below-average weight} \\ \end{cases}\]

as well as measurements of the penguin’s bill

\[\begin{split} X_{2} & = \text{ bill length (in mm)} \\ X_{3} & = \text{ flipper length (in mm)} \\ \end{split}\]

The `penguins_bayes`

data, originally made available by Gorman, Williams, and Fraser (2014) and distributed by Horst, Hill, and Gorman (2020), contains the above species and feature information for a sample of 344 Antarctic penguins:

```
# Load packages
library(bayesrules)
library(tidyverse)
library(e1071)
library(janitor)
# Load data
data(penguins_bayes)
<- penguins_bayes penguins
```

Among these penguins, 152 are Adelies, 68 are Chinstraps, and 124 are Gentoos. We’ll assume throughout that the proportional breakdown of these species in our dataset reflects the species breakdown in the wild. That is, our prior assumption about any new penguin is that it’s most likely an Adelie and least likely a Chinstrap:

```
%>%
penguins tabyl(species)
species n percent152 0.4419
Adelie 68 0.1977
Chinstrap 124 0.3605 Gentoo
```

Before proceeding with our analysis of this data, take a quick quiz.

Explain why neither the Normal nor logistic regression model would be appropriate for classifying a penguin’s species \(Y\) from its physical characteristics \(X\).

Since species \(Y\) is *categorical*, Normal regression isn’t an appropriate tool for understanding its relationship with physical characteristics \(X\).
And though logistic regression is (currently) our only tool for modeling and classifying categorical response variables, it too is inappropriate here.
Recall that with a Bernoulli data structure, the logistic regression model is limited to classifying *binary* response variables.
Yet our species variable \(Y\) has *three* categories.
Luckily, there’s a tool for everything.
In Chapter 14 we’ll explore an alternative to logistic regression: **naive Bayes classification**.
Relative to Bayesian logistic regression, naive Bayes classification has a few advantages:

- it can classify categorical response variables \(Y\) with two
*or more*categories; - it doesn’t require much theory beyond Bayes’ Rule; and
- it’s computationally efficient, i.e., doesn’t require MCMC simulation.

Sounds fantastic. But as you might expect, these benefits don’t come without a cost. We’ll observe throughout this chapter that naive Bayes classification is called “naive” for a reason.

- Explore the inner workings of naive Bayes classification.
- Implement naive Bayes classification in R.
- Develop strategies to evaluate the quality of naive Bayes classifications.

## 14.1 Classifying one penguin

We’ll start our naive Bayes classification with just a single penguin. Suppose an Antarctic researcher comes across a penguin that weighs less than 4200g with a 195mm-long flipper and 50mm-long bill. Our goal is to help this researcher identify the species of this penguin, Adelie, Chinstrap, or Gentoo.

### 14.1.1 One categorical predictor

Let’s begin by considering just the information we have on the *categorical* predictor \(X_1\): the penguin weighs less than 4200g, i.e., is not `above_average_weight`

.
Take the following quiz to first build some intuition.^{64}

Based on the plot below, for which species is a below-average weight most likely?

```
ggplot(penguins %>% drop_na(above_average_weight),
aes(fill = above_average_weight, x = species)) +
geom_bar(position = "fill")
```

```
This is an untitled chart with no subtitle or caption.
It has x-axis 'species' with labels Adelie, Chinstrap and Gentoo.
It has y-axis 'count' with labels 0.00, 0.25, 0.50, 0.75 and 1.00.
There is a legend indicating fill is used to show above_average_weight, with 2 levels:
0 shown as black fill and
1 shown as brilliant blue fill.
The chart is a bar chart with 6 vertical bars.
Bar 1 is centered horizontally at Adelie, and spans vertically from 0.17 to 1 with fill colour black which maps to above_average_weight = 0.
Bar 2 is centered horizontally at Adelie, and spans vertically from 0 to 0.17 with fill colour brilliant blue which maps to above_average_weight = 1.
Bar 3 is centered horizontally at Chinstrap, and spans vertically from 0.1 to 1 with fill colour black which maps to above_average_weight = 0.
Bar 4 is centered horizontally at Chinstrap, and spans vertically from 0 to 0.1 with fill colour brilliant blue which maps to above_average_weight = 1.
Bar 5 is centered horizontally at Gentoo, and spans vertically from 0.95 to 1 with fill colour black which maps to above_average_weight = 0.
Bar 6 is centered horizontally at Gentoo, and spans vertically from 0 to 0.95 with fill colour brilliant blue which maps to above_average_weight = 1.
These are stacked and shown as propotions of above_average_weight.
```

Notice in Figure 14.1 that Chinstrap penguins are the most likely to be below average weight.
Yet before declaring this penguin to be a Chinstrap, we should consider that this is the *rarest* species to encounter in the wild.
That is, we have to think like *Bayesians* by combining the information from our *data* with our *prior* information about species membership to construct a posterior model for our penguin’s species.
The naive Bayes classification approach to this task is nothing more than a direct appeal to the tried-and-true Bayes’ Rule from Chapter 2.
In general, to calculate the posterior probability that our penguin is species \(Y = y\) in light of its weight status \(X_1 = x_1\), we can plug into

\[\begin{equation} f(y \; | \; x_1) = \frac{\text{ prior} \cdot \text{likelihood}}{\text{ normalizing constant }} = \frac{f(y)L(y \; | \; x_1)}{f(x_1)} \tag{14.1} \end{equation}\]

where, by the Law of Total Probability,

\[\begin{equation} \begin{split} f(x_1) & = \sum_{\text{all } y'} f(y')L(y' \; | \; x_1) \\ & = f(y' = A)L(y'=A|x_1) + f(y' = C)L(y'=C|x_1) + f(y' = G)L(y'=G|x_1).\\ \end{split} \tag{14.2} \end{equation}\]

A table which breaks down the `above_average_weight`

weight status by `species`

provides the necessary information to complete this Bayesian calculation:

```
%>%
penguins select(species, above_average_weight) %>%
na.omit() %>%
tabyl(species, above_average_weight) %>%
adorn_totals(c("row", "col"))
0 1 Total
species 126 25 151
Adelie 61 7 68
Chinstrap 6 117 123
Gentoo 193 149 342 Total
```

In fact, we can directly calculate the posterior model of our penguin’s species from this table. For example, notice that among the 193 penguins that are below average weight, 126 are Adelies. Thus, there’s a roughly 65% posterior chance that this penguin is an Adelie:

\[f(y = A \; | \; x_1 = 0) = \frac{126}{193} \approx 0.6528 .\]

Let’s confirm this result by plugging information from our `tabyl`

above into Bayes’ Rule (14.1).
This tedious step is not to annoy, but to practice for generalizations we’ll have to make in more complicated settings.
First, our prior information about species membership indicates that Adelies are the most common and Chinstraps the least:

\[\begin{equation} f(y = A) = \frac{151}{342}, \;\;\;\;\; f(y = C) = \frac{68}{342}, \;\;\;\;\; f(y = G) = \frac{123}{342} . \tag{14.3} \end{equation}\]

Further, the likelihoods demonstrate that below-average weight is *most* common among Chinstrap penguins.
For example, 89.71% of Chinstraps but only 4.88% of Gentoos are below average weight:

\[\begin{split} L(y = A \; | \; x_1 = 0) & = \frac{126}{151} \approx 0.8344 \\ L(y = C \; | \; x_1 = 0) & = \frac{61}{68} \approx 0.8971 \\ L(y = G \; | \; x_1 = 0) & = \frac{6}{123} \approx 0.0488 \\ \end{split}\]

Plugging these priors and likelihoods into (14.2), the *total* probability of observing a below-average weight penguin across all species is

\[f(x_1 = 0) = \frac{151}{342} \cdot \frac{126}{151} + \frac{68}{342} \cdot \frac{61}{68} + \frac{123}{342} \cdot \frac{6}{123} = \frac{193}{342} .\]

Finally, by Bayes’ Rule we can confirm that there’s a 65% posterior chance that this penguin is an Adelie:

\[\begin{split} f(y = A \; | \; x_1 = 0) & = \frac{f(y = A) L(y = A \; | \; x_1 = 0)}{f(x_1 = 0)} \\ & = \frac{(151/342) \cdot (126/151)}{193/342} \\ & \approx 0.6528 . \\ \end{split}\]

Similarly, we can show that

\[f(y = C \; | \; x_1 = 0) \approx 0.3161 \;\;\;\; \text{ and } \;\;\;\; f(y = G \; | \; x_1 = 0) \approx 0.0311 . \]

All in all, the posterior probability that this penguin is an Adelie is more than double that of the other two species. Thus, our naive Bayes classification, based on our prior information and the penguin’s below-average weight alone, is that this penguin is an Adelie. Though below-average weight is relatively less common among Adelies than among Chinstraps, the final classification was pushed over the edge by the fact that Adelies are far more common.

### 14.1.2 One quantitative predictor

Let’s ignore the penguin’s weight for now and classify its species using only the fact that it has a 50mm-long bill.
Take the following quiz to build some intuition.^{65}

Based on Figure 14.2, among which species is a 50mm-long bill the most common?

```
ggplot(penguins, aes(x = bill_length_mm, fill = species)) +
geom_density(alpha = 0.7) +
geom_vline(xintercept = 50, linetype = "dashed")
```

```
This is an untitled chart with no subtitle or caption.
It has x-axis 'bill_length_mm' with labels 40, 50 and 60.
It has y-axis 'density' with labels 0.00, 0.05 and 0.10.
There is a legend indicating fill is used to show species, with 3 levels:
Adelie shown as black fill,
Chinstrap shown as brilliant blue fill and
Gentoo shown as strong orange yellow fill.
It has 2 layers.
Layer 1 is a density graph that VI can not process.
Layer 1 has alpha set to 0.7.
Layer 2 is a vline graph that VI can not process.
Layer 2 has linetype set to dashed.
```

Notice from the plot that a 50mm-long bill would be *extremely* long for an Adelie penguin.
Though the distinction in bill lengths is less dramatic among the Chinstrap and Gentoo, Chinstrap bills tend to be a tad longer.
In particular, a 50mm-long bill is fairly average for Chinstraps, though slightly long for Gentoos.
Thus, again, our *data* points to our penguin being a Chinstrap.
And again, we must weigh this data against the fact that Chinstraps are the rarest of these three species.
To this end, we can appeal to Bayes’ Rule to make a naive Bayes classification of \(Y\) from information that bill length \(X_2 = x_2\):

\[\begin{equation} f(y \; | \; x_2) = \frac{f(y)L(y \; | \; x_2)}{f(x_2)} = \frac{f(y)L(y \; | \; x_2)}{\sum_{\text{all } y'} f(y')L(y' \; | \; x_2)} . \tag{14.4} \end{equation}\]

Our question for you is this: What is the likelihood that we would observe a 50mm-long bill if the penguin is an Adelie, \(L(y = A | x_2 = 50)\)?
Unlike the penguin’s *categorical* weight status (\(X_1\)), its bill length is *quantitative*.
Thus, we can’t simply calculate \(L(y = A \; | \; x_2 = 50)\) from a table of `species`

vs `bill_length_mm`

.
Further, we haven’t assumed a model for data \(X_2\) from which to define likelihood function \(L(y|x_2)\).
This is where one “naive” part of naive Bayes classification comes into play.
The naive Bayes method typically assumes that any quantitative predictor, here \(X_2\), is **continuous** and **conditionally Normal**.
That is, within each species, bill lengths \(X_2\) are Normally distributed with possibly different means \(\mu\) and standard deviations \(\sigma\):

\[\begin{split} X_2 \; | \; (Y = A) & \sim N(\mu_A, \sigma_A^2) \\ X_2 \; | \; (Y = C) & \sim N(\mu_C, \sigma_C^2) \\ X_2 \; | \; (Y = G) & \sim N(\mu_G, \sigma_G^2) \\ \end{split} .\]

Though this is a naive blanket assumption to make for all quantitative predictors, it turns out to be reasonable for bill length \(X_2\). Reexamining Figure 14.2, notice that within each species, bill lengths are roughly bell shaped around different means and with slightly different standard deviations. Thus, we can tune the Normal model for each species by setting its \(\mu\) and \(\sigma\) parameters to the observed sample means and standard deviations in bill lengths within that species. For example, we’ll tune the Normal model of bill lengths for the Adelie species to have mean and standard deviation \(\mu_A = 38.8\)mm and \(\sigma_A = 2.66\)mm:

```
# Calculate sample mean and sd for each Y group
%>%
penguins group_by(species) %>%
summarize(mean = mean(bill_length_mm, na.rm = TRUE),
sd = sd(bill_length_mm, na.rm = TRUE))
# A tibble: 3 × 3
species mean sd<fct> <dbl> <dbl>
1 Adelie 38.8 2.66
2 Chinstrap 48.8 3.34
3 Gentoo 47.5 3.08
```

Plotting the tuned Normal models for each species confirms that this naive Bayes assumption isn’t *perfect* – it’s a bit more idealistic than the density plots of the raw data in Figure 14.2.
But it’s fine enough to continue.

```
ggplot(penguins, aes(x = bill_length_mm, color = species)) +
stat_function(fun = dnorm, args = list(mean = 38.8, sd = 2.66),
aes(color = "Adelie")) +
stat_function(fun = dnorm, args = list(mean = 48.8, sd = 3.34),
aes(color = "Chinstrap")) +
stat_function(fun = dnorm, args = list(mean = 47.5, sd = 3.08),
aes(color = "Gentoo")) +
geom_vline(xintercept = 50, linetype = "dashed")
```

```
This is an untitled chart with no subtitle or caption.
It has x-axis 'bill_length_mm' with labels 40, 50 and 60.
It has y-axis 'y' with labels 0.00, 0.05, 0.10 and 0.15.
There is a legend indicating colour is used to show species, with 3 levels:
Adelie shown as black colour,
Chinstrap shown as brilliant blue colour and
Gentoo shown as strong orange yellow colour.
It has 4 layers.
Layer 1 is a function graph that VI can not process.
Layer 2 is a function graph that VI can not process.
Layer 3 is a function graph that VI can not process.
Layer 4 is a vline graph that VI can not process.
Layer 4 has linetype set to dashed.
```

Recall that this Normality assumption provides the mechanism we need to evaluate the likelihood of observing a 50mm-long bill among each of the three species, \(L(y \; | \; x_2 = 50)\).
Connecting back to Figure 14.3, these likelihoods correspond to the Normal density curve heights at a bill length of 50 mm.
Thus, observing a 50mm-long bill is slightly more likely among Chinstrap than Gentoo, and *highly* unlikely among Adelie penguins.
More specifically, we can calculate the likelihoods using `dnorm()`

.

```
# L(y = A | x_2 = 50)
dnorm(50, mean = 38.8, sd = 2.66)
1] 2.12e-05
[
# L(y = C | x_2 = 50)
dnorm(50, mean = 48.8, sd = 3.34)
1] 0.112
[
# L(y = G | x_2 = 50)
dnorm(50, mean = 47.5, sd = 3.08)
1] 0.09317 [
```

We now have everything we need to plug into Bayes’ Rule (14.4). Weighting the likelihoods by the prior probabilities of each species (à la (14.2)), the marginal pdf of observing a penguin with a 50mm-long bill is

\[f(x_2 = 50) = \frac{151}{342} \cdot 0.0000212 + \frac{68}{342} \cdot 0.112 + \frac{123}{342} \cdot 0.09317 = 0.05579 .\]

Thus, the posterior probability that our penguin is an Adelie is

\[f(y = A \; | \; x_2 = 50) = \frac{(151/342) \cdot 0.0000212}{0.05579} \approx 0.0002 .\]

Similarly, we can calculate

\[f(y = C \; | \; x_2 = 50) \approx 0.3992 \;\;\;\; \text{ and } \;\;\;\; f(y = G \; | \; x_2 = 50) \approx 0.6006 . \]

It follows that our naive Bayes classification, based on our prior information and penguin’s bill length alone, is that this penguin is a Gentoo – it has the highest posterior probability. Though a 50mm-long bill is relatively less common among Gentoo than among Chinstrap, the final classification was pushed over the edge by the fact that Gentoo are much more common in the wild.

### 14.1.3 Two predictors

We’ve now made two naive Bayes classifications of our penguin’s species, one based solely on the fact that our penguin has below-average weight and the other based solely on its 50mm-long bill (in addition to our prior information).
And these classifications disagree: we classified the penguin as Adelie in the former analysis and Gentoo in the latter.
This discrepancy indicates that there’s room for improvement in our naive Bayes classification method.
In particular, instead of relying on any one predictor alone, we can incorporate *multiple* predictors into our classification process.

Consider the information that our penguin has a bill length of \(X_2 = 50\)mm *and* a flipper length of \(X_3 = 195\)mm.
Either one of these measurements alone might lead to a misclassification.
Just as it’s tough to distinguish between the Chinstrap and Gentoo penguins based on their bill lengths alone, it’s tough to distinguish between the Chinstrap and Adelie penguins based on their flipper lengths alone (Figure 14.4).

```
ggplot(penguins, aes(x = bill_length_mm, fill = species)) +
geom_density(alpha = 0.6)
ggplot(penguins, aes(x = flipper_length_mm, fill = species)) +
geom_density(alpha = 0.6)
```

```
This is an untitled chart with no subtitle or caption.
It has x-axis '' with labels .
It has y-axis '' with labels .
It has 2 layers.
Layer 1 is a drawgrob graph that VI can not process.
Layer 1 has xmin set to 0.
Layer 1 has xmax set to 0.772727272727273.
Layer 1 has ymin set to 0.
Layer 1 has ymax set to 1.
Layer 2 is a drawgrob graph that VI can not process.
Layer 2 has xmin set to 0.772727272727273.
Layer 2 has xmax set to 1.
Layer 2 has ymin set to 0.
Layer 2 has ymax set to 1.
```

BUT the species are fairly distinguishable when we *combine* the information about bill and flipper lengths.
Our penguin with a 50mm-long bill and 195mm-long flipper, represented at the intersection of the dashed lines in Figure 14.5, now lies squarely among the Chinstrap observations:

```
ggplot(penguins,
aes(x = flipper_length_mm, y = bill_length_mm, color = species)) +
geom_point()
```

```
This is an untitled chart with no subtitle or caption.
It has x-axis 'flipper_length_mm' with labels 180, 200 and 220.
It has y-axis 'bill_length_mm' with labels 30, 40, 50 and 60.
There is a legend indicating colour is used to show species, with 3 levels:
Adelie shown as black colour,
Chinstrap shown as brilliant blue colour and
Gentoo shown as strong orange yellow colour.
It has 2 layers.
Layer 1 is a set of 342 points.
Layer 1 has size set to 0.5.
Layer 2 is a segment graph that VI can not process.
Layer 2 has colour set to black.
Layer 2 has linetype set to dashed.
```

Let’s use naive Bayes classification to balance this data with our prior information on species membership. To calculate the posterior probability that the penguin is species \(Y = y\), we can adjust Bayes’ Rule to accommodate our two predictors, \(X_2 = x_2\) and \(X_3 = x_3\):

\[\begin{equation} f(y \; | \; x_2, x_3) = \frac{f(y) L(y \; | \; x_2, x_3)}{\sum_{\text{all } y'} f(y')L(y' \; | \; x_2, x_3)} . \tag{14.5} \end{equation}\]

This presents yet another new twist: How can we calculate the likelihood function that incorporates *two* variables, \(L(y \; | \; x_2, x_3)\)?
This is where yet another “naive” assumption creeps in.
Naive Bayes classification assumes that predictors are **conditionally independent**, thus

\[L(y \; | \; x_2, x_3) = f(x_2, x_3 \; | \; y) = f(x_2 \; | \; y)f(x_3 \; | \; y) .\]

In words, within each species, we assume that the length of a penguin’s bill has no relationship to the length of its flipper.
Mathematically and computationally, this assumption makes the naive Bayes algorithm *efficient* and manageable.
However, it might also make it *wrong*.
Revisit Figure 14.5.
Within each species, flipper length and bill length appear to be positively correlated, *not* independent.
Yet, we’ll naively roll with the imperfect independence assumption, and thus the possibility that our classification accuracy might be weakened.

Combined then, the multivariable naive Bayes model assumes that our two predictors are Normal and conditionally independent. We already tuned this Normal model for bill length \(X_2\) in Section 14.1.2. Similarly, we can tune the species-specific Normal models for flipper length \(X_3\) to match the corresponding sample means and standard deviations:

```
# Calculate sample mean and sd for each Y group
%>%
penguins group_by(species) %>%
summarize(mean = mean(flipper_length_mm, na.rm = TRUE),
sd = sd(flipper_length_mm, na.rm = TRUE))
# A tibble: 3 × 3
species mean sd<fct> <dbl> <dbl>
1 Adelie 190. 6.54
2 Chinstrap 196. 7.13
3 Gentoo 217. 6.48
```

Accordingly, we can evaluate the likelihood of observing a 195-mm flipper length for each of the three species, \(L(y \; | \; x_3 = 195) = f(x_3 = 195 \; | \; y)\):

```
# L(y = A | x_3 = 195)
dnorm(195, mean = 190, sd = 6.54)
1] 0.04554
[
# L(y = C | x_3 = 195)
dnorm(195, mean = 196, sd = 7.13)
1] 0.05541
[
# L(y = G | x_3 = 195)
dnorm(195, mean = 217, sd = 6.48)
1] 0.0001934 [
```

For each species, we now have the likelihood of observing a bill length of \(X_2 = 50\) mm (Section 14.1.2), the likelihood of observing a flipper length of \(X_3 = 195\) mm (above), and the prior probability (14.3).
Combined, the likelihoods of observing a 50mm-long bill *and* a 195mm-long flipper for each species \(Y = y\), weighted by the prior probability of the species are:

\[\begin{split} f(y' = A)L(y' = A \; | \; x_2 = 50, x_3 = 195) & = \frac{151}{342} \cdot 0.0000212 \cdot 0.04554 \\ f(y' = C)L(y' = C \; | \; x_2 = 50, x_3 = 195) & = \frac{68}{342} \cdot 0.112 \cdot 0.05541 \\ f(y' = G)L(y' = G \; | \; x_2 = 50, x_3 = 195) & = \frac{123}{342} \cdot 0.09317 \cdot 0.0001934 \\ \end{split}\]

with a sum of

\[\sum_{\text{all } y'}f(y')L(y' \; | \; x_2 = 50, x_3 = 195) \approx 0.001241 .\]

And plugging into Bayes’ Rule (14.5), the posterior probability that the penguin is an Adelie is

\[f(y = A | x_2 = 50, x_3 = 195) = \frac{\frac{151}{342} \cdot 0.0000212 \cdot 0.04554}{0.001241} \approx 0.0003 .\]

Similarly, we can calculate

\[\begin{split} f(y = C | x_2 = 50, x_3 = 195) & \approx 0.9944 \\ f(y = G | x_2 = 50, x_3 = 195) & \approx 0.0052 . \end{split}\]

In conclusion, our penguin is *almost certainly* a Chinstrap.
Though we didn’t come to this conclusion using any physical characteristic alone, together they paint a pretty clear picture.

**Naive Bayes classification**

Let \(Y\) denote a *categorical* response variable with two or more categories and \((X_1,X_2,\dots,X_p)\) be a set of \(p\) possible predictors of \(Y\).
Then the posterior probability that a new case with observed predictors \((X_1,X_2,\ldots,X_p) = (x_1,x_2,\ldots,x_p)\) belongs to class \(Y = y\) is

\[f(y | x_1, x_2, \ldots, x_p) = \frac{f(y)L(y \; | \; x_1, x_2, \ldots, x_p)}{\sum_{\text{all } y' }f(y')L(y' \; | \; x_1, x_2, \ldots, x_p)} .\]

Naive Bayes classification makes the following naive assumptions about the likelihood function of \(Y\):

Predictors \(X_i\) are

**conditionally independent**, thus\[L(y \; | \; x_1, x_2, \ldots, x_p) = \prod_{i=1}^p f(x_i \; | \; y)\; .\]

For

**categorical predictors**\(X_i\), the conditional pmf\[f(x_i \; | \; y) = P(X_i = x_i \; | \; Y = y)\]

is estimated by the sample proportion of observed \(Y = y\) cases for which \(X_i = x_i\).

For

**quantitative predictors**\(X_i\), the conditional pmf / pdf \(f(x_i \; | \; y)\) is defined by a Normal model:\[X_i | (Y = y) \sim N(\mu_{iy}, \sigma^2_{iy}).\]

That is, within each category \(Y = y\), we assume that \(X_i\) is Normally distributed around some mean \(\mu_{iy}\) with standard deviation \(\sigma_{iy}\). Further, we estimate \(\mu_{iy}\) and \(\sigma_{iy}\) by the sample mean and standard deviation of the observed \(X_i\) values in category \(Y = y\).

## 14.2 Implementing & evaluating naive Bayes classification

That was nice, but we needn’t do all of this work by hand.
To implement naive Bayes classification in R, we’ll use the `naiveBayes()`

function in the **e1071** package (Meyer et al. 2021).
As with `stan_glm()`

, we feed `naiveBayes()`

the `data`

and a formula indicating which data variables to use in the analysis.
Yet, since naive Bayes calculates prior probabilities directly from the data and implementation doesn’t require MCMC simulation, we don’t have to worry about providing information regarding prior models or Markov chains.
Below we build two naive Bayes classification algorithms, one using `bill_length_mm`

alone and one that also incorporates `flipper_length_mm`

:

```
<- naiveBayes(species ~ bill_length_mm, data = penguins)
naive_model_1 <- naiveBayes(species ~ bill_length_mm + flipper_length_mm,
naive_model_2 data = penguins)
```

Let’s apply both of these to classify `our_penguin`

that we studied throughout Section 14.1:

`<- data.frame(bill_length_mm = 50, flipper_length_mm = 195) our_penguin `

Beginning with `naive_model_1`

, the `predict()`

function returns the posterior probabilities of each species along with a final classification.
This classification follows a simple rule: classify the penguin as whichever species has the highest posterior probability.
The results of this process are similar to those that we got “by hand” in Section 14.1.2, just off a smidge due to some rounding error.
Essentially, based on its bill length alone, our best guess is that this penguin is a Gentoo:

```
predict(naive_model_1, newdata = our_penguin, type = "raw")
Adelie Chinstrap Gentoo1,] 0.000169 0.3978 0.602
[predict(naive_model_1, newdata = our_penguin)
1] Gentoo
[: Adelie Chinstrap Gentoo Levels
```

And just as we concluded in Section 14.1.3, if we take into account *both* its bill length and flipper length, our best guess is that this penguin is a Chinstrap:

```
predict(naive_model_2, newdata = our_penguin, type = "raw")
Adelie Chinstrap Gentoo1,] 0.0003446 0.9949 0.004787
[predict(naive_model_2, newdata = our_penguin)
1] Chinstrap
[: Adelie Chinstrap Gentoo Levels
```

We can similarly apply our naive Bayes models to classify any number of penguins.
As with logistic regression, we’ll take two common approaches to evaluating the *accuracy* of these classifications:

- construct
**confusion matrices**which compare the*observed*species of our sample penguins to their naive Bayes species*classifications*; and - for a better sense of how well our naive Bayes models classify
*new*penguins, calculate**cross-validated**estimates of classification accuracy.

To begin with the first approach, we classify each of the `penguins`

using both `naive_model_1`

and `naive_model_2`

and store these in `penguins`

as `class_1`

and `class_2`

:

```
<- penguins %>%
penguins mutate(class_1 = predict(naive_model_1, newdata = .),
class_2 = predict(naive_model_2, newdata = .))
```

The classification results are shown below for four randomly sampled penguins, contrasted against the actual `species`

of these penguins.
For the last two penguins, the two models produce the *same* classifications (Adelie) and these classifications are correct.
For the first two penguins, the two models lead to *different* classifications.
In both cases, `naive_model_2`

is correct.

```
set.seed(84735)
%>%
penguins sample_n(4) %>%
select(bill_length_mm, flipper_length_mm, species, class_1, class_2) %>%
rename(bill = bill_length_mm, flipper = flipper_length_mm)
# A tibble: 4 × 5
bill flipper species class_1 class_2 <dbl> <int> <fct> <fct> <fct>
1 47.5 199 Chinstrap Gentoo Chinstrap
2 40.9 214 Gentoo Adelie Gentoo
3 41.3 194 Adelie Adelie Adelie
4 38.5 190 Adelie Adelie Adelie
```

Of course, it remains to be seen whether `naive_model_2`

outperforms `naive_model_1`

overall.
To this end, the confusion matrices below summarize the models’ classification accuracy across all penguins in our sample.
Use these to complete the following quiz.^{66}

- In
`naive_model_1`

, as what other species is the Chinstrap most commonly misclassified? - Which model is better at classifying Chinstrap penguins?
- Which model has a higher overall accuracy rate?

```
# Confusion matrix for naive_model_1
%>%
penguins tabyl(species, class_1) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_ns()
species Adelie Chinstrap Gentoo95.39% (145) 0.00% (0) 4.61% (7)
Adelie 5.88% (4) 8.82% (6) 85.29% (58)
Chinstrap 6.45% (8) 4.84% (6) 88.71% (110)
Gentoo
# Confusion matrix for naive_model_2
%>%
penguins tabyl(species, class_2) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_ns()
species Adelie Chinstrap Gentoo96.05% (146) 2.63% (4) 1.32% (2)
Adelie 7.35% (5) 86.76% (59) 5.88% (4)
Chinstrap 0.81% (1) 0.81% (1) 98.39% (122) Gentoo
```

With these prompts in mind, let’s examine the two confusion matrices.
One quick observation is that `naive_model_2`

does a better job across the board.
Not only are its classification accuracy rates for *each* of the Adelie, Chinstrap, and Gentoo species higher than in `naive_model_1`

, its *overall* accuracy rate is also higher.
The `naive_model_2`

correctly classifies 327 (146 \(+\) 59 \(+\) 122) of the 344 total penguins (95%), whereas `naive_model_1`

correctly classifies only 261 (76%).
Where `naive_model_2`

enjoys the *greatest* improvement over `naive_model_1`

is in the classification of Chinstrap penguins.
In `naive_model_1`

, only 9% of Chinstraps are correctly classified, with a whopping 85% being misclassified as Gentoo.
At 87%, the classification accuracy rate for Chinstraps is *much* higher in `naive_model_2`

.

Finally, for due diligence, we can utilize 10-fold **cross-validation** to evaluate and compare how well our naive Bayes classification models classify *new* penguins, not just those in our sample.
We do so using the `naive_classification_summary_cv()`

function in the **bayesrules** package:

```
set.seed(84735)
<- naive_classification_summary_cv(
cv_model_2 model = naive_model_2, data = penguins, y = "species", k = 10)
```

The `cv_model_2$folds`

object contains the classification accuracy rates for each of the 10 folds whereas `cv_model_2$cv`

averages the results across all 10 folds:

```
$cv
cv_model_2
species Adelie Chinstrap Gentoo96.05% (146) 2.63% (4) 1.32% (2)
Adelie 7.35% (5) 86.76% (59) 5.88% (4)
Chinstrap 0.81% (1) 0.81% (1) 98.39% (122) Gentoo
```

The accuracy rates in this cross-validated confusion matrix are comparable to those in the non-cross-validated confusion matrix above.
This implies that our naive Bayes model appears to perform nearly as well on *new* penguins as it does on the original penguin sample that we used to build this model.

## 14.3 Naive Bayes vs logistic regression

Given the *three* penguin species categories, our classification analysis above required naive Bayes classification – taking logistic regression to this task wasn’t even an option.
However, in scenarios with a *binary* categorical response variable \(Y\), both logistic regression and naive Bayes are viable classification approaches.
In fact, we encourage you to revisit and apply naive Bayes to the rain classification example from Chapter 13.
In such binary settings, the question then becomes *which* classification tool to use.
Both naive Bayes and logistic regression have their pros and cons.
Though naive Bayes is certainly computationally efficient, it also makes some very rigid and often inappropriate assumptions about data structures.
You might have also noted that we lose some nuance with naive Bayes.
Unlike the logistic regression model with

\[\log\left(\frac{\pi}{1 - \pi}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_p ,\]

naive Bayes lacks regression coefficients \(\beta_i\). Thus, though naive Bayes can turn information about predictors \(X\) into classifications of \(Y\), it does so without much illumination of the relationships among these variables.

Whether naive Bayes or logistic regression is the right tool for a binary classification job depends upon the situation. In general, if the rigid naive Bayes assumptions are inappropriate or if you care about the specific connections between \(Y\) and \(X\) (i.e., you don’t simply want a set of classifications), you should use logistic regression. Otherwise, naive Bayes might be just the thing. Better yet, don’t choose! Try out and learn from both tools.

## 14.4 Chapter summary

Naive Bayes classification is a handy tool for classifying categorical response variables \(Y\) with *two or more* categories.
Letting \((X_1,X_2,\dots,X_p)\) be a set of \(p\) possible predictors of \(Y\), naive Bayes calculates the posterior probability of each category membership via Bayes’ Rule:

\[f(y | x_1, x_2, \ldots, x_p) = \frac{f(y)L(y \; | \; x_1, x_2, \ldots, x_p)}{\sum_{\text{all } y' }f(y')L(y' \; | \; x_1, x_2, \ldots, x_p)} .\]

In doing so, it makes some very naive assumptions about the data model from which we define the likelihood \(L(y \; | \; x_1, x_2, \ldots, x_p)\): the predictors \(X_i\) are **conditionally independent** and the values of *quantitative* predictors \(X_i\) vary Normally within each category \(Y = y\).
When appropriate, these simplifying assumptions make the naive Bayes model computationally efficient and straightforward to apply.
Yet when these simplifying assumptions are violated (which is common), the naive Bayes model can produce misleading classifications.

## 14.5 Exercises

### 14.5.1 Conceptual exercises

**Exercise 14.1 (Naive)**Why is naive Bayes classification called

*naive*?

**Exercise 14.2 (Which model?)**For each scenario below, indicate whether you could classify \(Y\) by \(X\) using logistic regression, naive Bayes classification, or both.

- We want to classify a person’s political affiliation \(Y\) (Democrat, Republican, or Other) by their age \(X\).
- We want to classify whether a person owns a car \(Y\) (yes or no) by their age \(X\).
- We want to classify whether a person owns a car \(Y\) (yes or no) by their location \(X\) (urban, suburban, or rural).

**Exercise 14.3 (Pros and cons)**Every modeling technique has some pros and cons.

- Name one pro of naive Bayes classification.
- Name one con of naive Bayes classification.

### 14.5.2 Applied exercises

**Exercise 14.4 (Fake news: exclamation points)**The

`fake_news`

data in the **bayesrules**package contains information about 150 news articles, some real news and some fake news. In the next exercises, our goal will be to develop a model that helps us classify an article’s

`type`

, real or fake, using a variety of predictors. To begin, let’s consider whether an article’s title has an exclamation point (`title_has_excl`

).
- Construct and discuss a visualization of the relationship between article
`type`

and`title_has_excl`

. - Suppose a new article is posted online and its title does
*not*have an exclamation point. Utilize naive Bayes classification to calculate the posterior probability that the article is real. Do so from scratch, without using`naiveBayes()`

with`predict()`

. - Check your work to part b using
`naiveBayes()`

with`predict()`

.

**Exercise 14.5 (Fake news: title length)**Consider another possible predictor of article type: the number of words in the title.

- Construct and discuss a visualization of the relationship between article
`type`

and the number of`title_words`

. - In using naive Bayes classification to classify an article’s
`type`

based on its`title_words`

, we assume that the number of`title_words`

are conditionally Normal. Do you think this is a fair assumption in this analysis? - Suppose a new article is posted online and its title has 15 words. Utilize naive Bayes classification to calculate the posterior probability that the article is real. Do so from scratch, without using
`naiveBayes()`

with`predict()`

. - Check your work to part c using
`naiveBayes()`

with`predict()`

.

**Exercise 14.6 (Fake news: title length and negative sentiment)**Of course, we can use more than one feature to help us classify whether an article is real or fake. Here, let’s consider both an article’s title length (

`title_words`

) and the percent of words in the article that have a `negative`

sentiment.
- Construct and discuss a visualization of the relationship between article
`type`

and`negative`

sentiment. - Construct a visualization of the relationship of article
`type`

with both`title_words`

and`negative`

. - Suppose a new article is posted online – it has a 15-word title and 6% of its words have negative associations. Utilize naive Bayes classification to calculate the posterior probability that the article is real. Do so from scratch, without using
`naiveBayes()`

with`predict()`

. - Check your work to part c using
`naiveBayes()`

with`predict()`

.

**Exercise 14.7 (Fake news: three predictors)**Suppose a new article is posted online – it has a 15-word title, 6% of its words have negative associations, and its title

*doesn’t*have an exclamation point. Based on these three features, utilize naive Bayes classification to calculate the posterior probability that the article is real. Do so using

`naiveBayes()`

with `predict()`

.
**Exercise 14.8 (Fake news: let’s pick a model)**We’ve now tried four different naive Bayes classification models of article

`type`

. In this exercise you’ll evaluate and compare the performance of these four models.
model | formula |
---|---|

`news_model_1` |
`type ~ title_has_excl` |

`news_model_2` |
`type ~ title_words` |

`news_model_3` |
`type ~ title_words + negative` |

`news_model_4` |
`type ~ title_words + negative + title_has_excl` |

- Construct a cross-validated confusion matrix for
`news_model_1`

. - Interpret each percentage in the confusion matrix.
- Similarly, construct cross-validated confusion matrices for the other three models.
- If our goal is to best detect when an article is
*fake*, which of the four models should we use?

**Exercise 14.9 (Logistic vs naive)**Naive Bayes isn’t the only approach we can take to classifying real vs fake news. Since article

`type`

is *binary*, we could also approach this task using

**Bayesian logistic regression**.

- Starting from weakly informative priors, use
`stan_glm()`

to simulate the posterior logistic regression model of news`type`

by all three predictors:`title_words`

,`negative`

, and`title_has_excl`

. - Perform and discuss some MCMC diagnostics.
- Construct and discuss a
`pp_check()`

. - Using a probability cutoff of 0.5, obtain cross-validated estimates of the sensitivity and specificity for the logistic regression model.
- Compare the cross-validated metrics of the logistic regression model to those of
`naive_model_4`

. Which model is better at detecting fake news?

### 14.5.3 Open-ended exercises

**Exercise 14.10 (Open exercise: climate change)**The

`pulse_of_the_nation`

data in the **bayesrules**package contains responses of 1000 adults to a wide-ranging survey. In one question, respondents were asked about their belief in

`climate_change`

, selecting from the following options: climate change is `Not Real at All`

, `Real but not Caused by People`

, or `Real and Caused by People`

. In this open-ended exercise, complete a naive Bayesian analysis of a person’s belief in `climate_change`

, selecting a variety of possible predictors that are of interest to you.
**Exercise 14.11 (Open exercise: vaccine safety)**The

`pulse_of_the_nation`

survey also asked respondents about their agreement with the statement that `vaccines_are_safe`

, selecting from the following options: `Neither Agree nor Disagree`

, `Somewhat Agree`

, `Somewhat Disagree`

, `Strongly Agree`

or `Strongly Disagree`

. In this open-ended exercise, complete a naive Bayesian analysis of belief in the safety of vaccines, selecting a variety of possible predictors that are of interest to you.
### References

*PLoS ONE*9(3) (e90081). https://doi.org/10.1371/journal.pone.0090081.

*Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data*. https://cran.r-project.org/web/packages/palmerpenguins/index.html.

*E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien*. https://CRAN.R-project.org/package=e1071.