Our previous discussion of naive Bayes led us to the problem of overfitting, specifically in dealing with rare words for text classification. We investigated this problem a bit more formally in the context of probabilistic modeling and discussed maximum likelihood, maximum a posteriori, and Bayesian methods for parameter estimation. With a Bernoulli model for word presence in mind, we looked at the toy problem of estimating the bias of a coin from observed flips. We saw that general principles from probabilistic modeling reproduced our intuitive parameter estimates from last week, and, furthermore, justified the hack of smoothing counts to avoid overfitting. See Chapter 2 of Bishop for more details.

Some notes on probabilistic inference:

- Under maximum likelihood inference we define the “best” parameter values as those for which the observed data are most probable:

$$

\hat{\theta}_{ML} = \mathrm{argmax}_{\theta} ~ p(D | \theta).

$$

Employing this framework for the coin flipping example reproduces the intuitive relative frequency estimate from last week: \(\hat{\theta}_{ML} = {n \over N} \). Unfortunately, as we saw for spam classification, maximum-likelihood estimates can often lead to overfitting (e.g., when n=0 or n=N). - In a subtle but important distinction, maximum a posteriori inference formulates the problem as one of identifying the most probable parameters given the observed data:

$$

\hat{\theta}_{MAP} = \mathrm{argmax}_{\theta} ~ p(\theta | D) = \mathrm{argmax}_{\theta} ~ p(D | \theta) p(\theta),

$$

where the right-hand side follows from an application of Bayes’ rule. Using a conjugate prior beta distribution \( \mathrm{Beta}(\theta; \alpha, \beta) \) for the coin flipping example reproduces the “smoothed” estimate: \(\hat{\theta}_{MAP} = {n + \alpha – 1 \over N + \alpha + \beta – 2} \), where \( \alpha \) and \( \beta \) act as pseudocounts for the number of heads and tails seen before any actual data are observed. This allows us to address the overfitting problem by specifying the shape of the prior distribution via \( \alpha \) and \( \beta \). Setting \( \alpha \) and \( \beta \) to 1 corresponds to a uniform prior distribution and yields the maximum likelihood estimate above, while larger values of \( \alpha \) and \( \beta \) bias our estimates more heavily towards our prior distribution. - Bayesian inference dispenses with the idea of point estimates for the parameters all together in favor of keeping full distributions over all unknown quantities:

$$

p(\theta | D) = {p(D | \theta) p(\theta) \over p(D)}

$$

In contrast to MAP estimation, this requires calculation of the normalizing constant \( p(D) \), often referred to as the marginal likelihood or evidence. Fortunately the choice of a conjugate prior distribution reduces the potentially difficult task of calculating the evidence to simple algebra. For example, the posterior distribution for the coin flipping example with a beta prior is simply a beta distribution with updated hyperparameters, \( \mathrm{Beta}(\theta; n + \alpha, N – n + \alpha + \beta) \). Likewise, the predictive distribution \( p(x|D) \), which provides the probability of future outcomes given the observed data, amounts to a simple ratio of beta distributions.

Returning to naive Bayes for text classification, any of the above methods may be used to estimate the many (independent) parameter values in our model. Maximum likelihood estimation has no tunable hyperparameters, whereas MAP estimation and Bayesian inference require the specification of a prior distribution that, in practice, is often tuned for optimal generalization error. Regardless of which method we choose, however, naive Bayes still models features

During the second part of class we looked at a python script to fetch email data from an imap server as well as some shell scripting to parse raw email data, and concluded with an introduction to Python.