Summary
Having had decent success in applying k-nearest neighbors (kNN) to hand-written digit recognition last week, we discussed several modifications and extensions of kNN. With text classification — e.g. spam filtering — as a motivating example, we then revisited some of the problems in applying kNN to high-dimensional problems, namely the curse of dimensionality. This led us to a simple alternative, naive Bayes for classification (Chapter 6, Segaran). In the latter part of class we explored some shell scripting to apply a primitive naive Bayes spam classifier to the Enron email data. Details follow.

kNN: Modifications & Extensions
As we saw last week, kNN did a reasonably good job at hand-written digit recognition — we achieved roughly 96% accuracy in classifying a data set of ~7000 16×16 pixel images. One of the niceties of this data set is that there are a relatively large number of training samples and pixel values are constrained to lie on the same scale (between -1.0 and +1.0). However, in many classification problems one has a heterogeneous set of features that may not be in the same range, which can cause problems when using kNN — e.g. if we’re classifying individuals and one feature is height (in meters) while the other is weight (in kilograms), our distance metric will be much more sensitive to differences in weight. (Likewise, measuring weight in pounds would increase the scale on this dimension even further.)
A simple pre-processing step to deal with such issues is to perform feature standardization, transforming all features to mean-zero, variance-one. An alternative, and more expensive option, is to fit an optimal scaling for each feature, as Segaran discusses. With this approach, we introduce a scaling factor alpha for each feature and use an optimization method — e.g. simulated annealing, which we’ve not yet discussed — to search over possible scalings, maximizing accuracy on test data.
We also discussed weighted kNN, in which we weight training examples by their distance to the query point when calculating predicted output values. The idea here is that if we’re looking at our 5 nearest neighbors, 3 of which are quite close and 2 far away, we might want to consider the labels of the 3 close points differently than the 2 distant ones. Several popular choices for weighting functions are: a hard threshold (corresponding to a ball of a fixed radius around the query point), a linearly-decreasing threshold, inverse weights, and Gaussian weights.
Curse of dimensionality

Next we discussed the performance of kNN in high-dimensional problems, which led to the curse of dimensionality. The basic intuition is the following: as we increase the dimensionality of the feature space for kNN, we need exponentially many more training examples to get reliable estimates of the class density throughout the feature space. There are several ways to see this point: one can imagine chunking the feature space into small hypercubes of fixed size, and see that the number of such cubes grows exponentially with the dimension of the feature space; alternatively, one can look at the ratio of the volume of a hypersphere to a hypercube in D dimensions, which rapidly approaches zero. The takeaway — that in high-dimensional spaces you have to go quite far to find points “near” you — is a bit tough to intuit given that we’re used to relatively low-dimensional spaces (e.g. 2 or 3). Nonetheless, this difficulty renders kNN relatively useless for high-dimensional problems such as document classification.
Naive Bayes

To address such problems, we discussed naive Bayes (NB) methods for classification. Although a relatively simple approach (and somewhat of a misnomer), NB is a surprisingly successful method for text classification in which documents are modeled as a set of independent words, and Bayes’ rule is used to calculate the probably that a document is spam given the words that appear in it.
As a brief review of the calculus of probabilities, we worked through a classic — and somewhat counter-intuitive — application of Bayes’ rule. The toy example we studied was one of a reliable medical test for a rare disease (Wiggins, 2006). We saw that the low false-positive rate on a large fraction of the population results in a large absolute number of false-positives, rendering the test results uncertain. We discussed a simple solution to the problem — specifying an arbitrary size for the population and working with absolute numbers — as well as the formal solution of “inverting” probabilities via Bayes’ rule.
To adapt the above to spam filtering (Horvitz, et. al., 1998; Graham, 2002), we merely replaced the patients’ health with the document class and the test results with the presence or absence of words in the document. “Training” then amounted to counting the the occurrence of words in documents of both the spam and legitimate mail (ham) classes.
P(spam|word) = P(word|spam) * P(spam) / P(word)
The “naive” assumption here is that we can treat words as occurring independently. While this is certainly untrue — e.g. the probability of the word “posterior” occurring in a document is increased if the word “Bayes” is known to be present in the document — NB often performs extremely well in practice. Key to the success of NB is its ability to scale to very high dimensions, such as the ~600 kilowords in the English dictionary, a direct result of treating features independently and thus avoiding the curse of dimensionality (which arises from considering a joint distribution over all features).
Application: Enron Email Data
As a somewhat primitive application of NB, we looked at classifying spam in the Enron email data set using a single word. This gave us an excuse to write our first Bash shell script, available on the github repository. Using grep (grep tutorial; regular expression cheatsheet) and wc, we counted the number of ham and spam messages in which a given word appears, and then implemented Bayes’ rule with bc to calculate the probability that a new message containing that word is spam. (If you’re feeling adventurous, check out NB a la awk.)
$ ./enron_naive_bayes.sh money
1500 spam examples
3672 ham examples
194 spam examples containing money
50 ham examples containing money
P(spam) = .2900
P(ham) = .7100
P(money|spam) = .1293
P(money|ham) = .0136
P(spam|money) = .7957
The above shows us that, from the given training sample, it’s relatively likely that a future email containing the word money is spam. Note, however, that this (frequentist) method for estimating probabilities can overfit, causing problems for rare words. For example, none of the spam messages in the training data contain the word “enron”, and so this naive version of NB predicts that any email containing “enron” must be legitimate mail. This leaves us susceptible to a simple attack on our spam filter in which rare words cause misclassifications — e.g. all a spammer must do to reach our inbox is include the name of our company in their message.
$ ./enron_naive_bayes.sh enron
1500 spam examples
3672 ham examples
0 spam examples containing enron
1478 ham examples containing enron
P(spam) = .2900
P(ham) = .7100
P(enron|spam) = 0
P(enron|ham) = .4025
P(spam|enron) = 0
A simple hack to fix this is to avoid zero counts by adding one to all word counts. This has the effect of replacing any zeros with small non-zero values, allowing for a finite probability that the message is spam. While this seems reasonable enough, in future lectures we’ll see that we can justify this modification as a “more Bayesian” version of NB by taking a Bayesian interpretation of probability.