Skip to content

Homework 02

14-Oct-09

Here’s the second homework, Homework 02.

The first problem is an implementation of k-nearest neighbors, applied to digit and spam classification. The second explores the curse of dimensionality via two simple simulations. The third is a number of exercises focusing on command line utilities (grep, wc, sort, uniq, tr).

See the included pdf file for a detailed description.

Lecture 03

13-Oct-09

Summary

Having had decent success in applying k-nearest neighbors (kNN) to hand-written digit recognition last week, we discussed several modifications and extensions of kNN. With text classification — e.g. spam filtering — as a motivating example, we then revisited some of the problems in applying kNN to high-dimensional problems, namely the curse of dimensionality. This led us to a simple alternative, naive Bayes for classification (Chapter 6, Segaran). In the latter part of class we explored some shell scripting to apply a primitive naive Bayes spam classifier to the Enron email data. Details follow.

spam_text

kNN: Modifications & Extensions

As we saw last week, kNN did a reasonably good job at hand-written digit recognition — we achieved roughly 96% accuracy in classifying a data set of ~7000 16×16 pixel images. One of the niceties of this data set is that there are a relatively large number of training samples and pixel values are constrained to lie on the same scale (between -1.0 and +1.0). However, in many classification problems one has a heterogeneous set of features that may not be in the same range, which can cause problems when using kNN — e.g. if we’re classifying individuals and one feature is height (in meters) while the other is weight (in kilograms), our distance metric will be much more sensitive to differences in weight. (Likewise, measuring weight in pounds would increase the scale on this dimension even further.)

A simple pre-processing step to deal with such issues is to perform feature standardization, transforming all features to mean-zero, variance-one. An alternative, and more expensive option, is to fit an optimal scaling for each feature, as Segaran discusses. With this approach, we introduce a scaling factor alpha for each feature and use an optimization method — e.g. simulated annealing, which we’ve not yet discussed — to search over possible scalings, maximizing accuracy on test data.

We also discussed weighted kNN, in which we weight training examples by their distance to the query point when calculating predicted output values. The idea here is that if we’re looking at our 5 nearest neighbors, 3 of which are quite close and 2 far away, we might want to consider the labels of the 3 close points differently than the 2 distant ones. Several popular choices for weighting functions are: a hard threshold (corresponding to a ball of a fixed radius around the query point), a linearly-decreasing threshold, inverse weights, and Gaussian weights.

Curse of dimensionality

Illustration of a sphere inscribed in a cube.

Next we discussed the performance of kNN in high-dimensional problems, which led to the curse of dimensionality. The basic intuition is the following: as we increase the dimensionality of the feature space for kNN, we need exponentially many more training examples to get reliable estimates of the class density throughout the feature space. There are several ways to see this point: one can imagine chunking the feature space into small hypercubes of fixed size, and see that the number of such cubes grows exponentially with the dimension of the feature space; alternatively, one can look at the ratio of the volume of a hypersphere to a hypercube in D dimensions, which rapidly approaches zero. The takeaway — that in high-dimensional spaces you have to go quite far to find points “near” you — is a bit tough to intuit given that we’re used to relatively low-dimensional spaces (e.g. 2 or 3). Nonetheless, this difficulty renders kNN relatively useless for high-dimensional problems such as document classification.

Naive Bayes

To address such problems, we discussed naive Bayes (NB) methods for classification. Although a relatively simple approach (and somewhat of a misnomer), NB is a surprisingly successful method for text classification in which documents are modeled as a set of independent words, and Bayes’ rule is used to calculate the probably that a document is spam given the words that appear in it.

As a brief review of the calculus of probabilities, we worked through a classic — and somewhat counter-intuitive — application of Bayes’ rule. The toy example we studied was one of a reliable medical test for a rare disease (Wiggins, 2006). We saw that the low false-positive rate on a large fraction of the population results in a large absolute number of false-positives, rendering the test results uncertain. We discussed a simple solution to the problem — specifying an arbitrary size for the population and working with absolute numbers — as well as the formal solution of “inverting” probabilities via Bayes’ rule.

To adapt the above to spam filtering (Horvitz, et. al., 1998; Graham, 2002), we merely replaced the patients’ health with the document class and the test results with the presence or absence of words in the document. “Training” then amounted to counting the the occurrence of words in documents of both the spam and legitimate mail (ham) classes.

P(spam|word) = P(word|spam) * P(spam) / P(word)

The “naive” assumption here is that we can treat words as occurring independently. While this is certainly untrue — e.g. the probability of the word “posterior” occurring in a document is increased if the word “Bayes” is known to be present in the document — NB often performs extremely well in practice. Key to the success of NB is its ability to scale to very high dimensions, such as the ~600 kilowords in the English dictionary, a direct result of treating features independently and thus avoiding the curse of dimensionality (which arises from considering a joint distribution over all features).

Application: Enron Email Data

As a somewhat primitive application of NB, we looked at classifying spam in the Enron email data set using a single word. This gave us an excuse to write our first Bash shell script, available on the github repository. Using grep (grep tutorial; regular expression cheatsheet) and wc, we counted the number of ham and spam messages in which a given word appears, and then implemented Bayes’ rule with bc to calculate the probability that a new message containing that word is spam. (If you’re feeling adventurous, check out NB a la awk.)

$ ./enron_naive_bayes.sh money
1500 spam examples
3672 ham examples
194 spam examples containing money
50 ham examples containing money
P(spam) = .2900
P(ham) = .7100
P(money|spam) = .1293
P(money|ham) = .0136
P(spam|money) = .7957

The above shows us that, from the given training sample, it’s relatively likely that a future email containing the word money is spam. Note, however, that this (frequentist) method for estimating probabilities can overfit, causing problems for rare words. For example, none of the spam messages in the training data contain the word “enron”, and so this naive version of NB predicts that any email containing “enron” must be legitimate mail. This leaves us susceptible to a simple attack on our spam filter in which rare words cause misclassifications — e.g. all a spammer must do to reach our inbox is include the name of our company in their message.

$ ./enron_naive_bayes.sh enron
1500 spam examples
3672 ham examples
0 spam examples containing enron
1478 ham examples containing enron
P(spam) = .2900
P(ham) = .7100
P(enron|spam) = 0
P(enron|ham) = .4025
P(spam|enron) = 0

A simple hack to fix this is to avoid zero counts by adding one to all word counts. This has the effect of replacing any zeros with small non-zero values, allowing for a finite probability that the message is spam. While this seems reasonable enough, in future lectures we’ll see that we can justify this modification as a “more Bayesian” version of NB by taking a Bayesian interpretation of probability.

Lecture 02

21-Sep-09

Summary

The first hour of our second lecture covered a brief review of least-squares regression with a linear model (from last week), discussed a simple adaptation of this to least-squares classification (Chapter 2, Hastie), and introduced k-nearest neighbors for classification (Chapters 2 & 13, Hastie; Chapter 8 Segaran; Section 2.5.2 Bishop).

As a motivating example, we look at the problem of recognizing handwritten digits, using the zip code data provided by Hastie, et. al.:

difficult digits

handwritten digits

In the second hour we covered a very basic introduction to the Bash shell, including simple navigation, reading/combining file contents, redirection & pipes, special characters & quotation/escaping, and just a bit of pattern matching with grep. I’ve posted a summary of the commands we covered to the course github repository.

The following gives a bit more detail on the methods we covered.

Linear Regression

In reviewing simple linear regression, I emphasized that while this method is quite simple it contains most of the important ingredients in the data-driven approach to modeling:

  1. Get data (skipped this for now)
  2. Represent data numerically (skipped as well, for now let’s say x and y values fell into our laps)
  3. Choose a model: here, a linear model (linear in the parameters)
  4. Choose a loss function: sum-of-squared loss
  5. Derive an optimization method: here we can optimize exactly, as we have a quadratic function of the unknown parameters (take derivative, set to zero)
  6. Assess performance: we looked at cross-validation to estimate generalization error, training on some percentage of the data and testing on the remaining data

Linear Classification

Next we adapted our least-squares solution to a classification setting, in which our output variable was categorical — say 0 or 1 — instead of real-valued, as in regression. Our results for parameter estimates carried through from regression, where labels are predicted by thresholding the predicted y value about 0.5. In examining this approach with synthetic two-class data from two-dimensional Gaussians, we found that a globally linear function is often overly restrictive. I briefly mentioned kernel methods — which we’ll discuss in future lectures — as a method of fitting non-linear decision boundaries via implicit mappings to transformed versions of the feature space.) We also saw that squared loss is quite an aggressive measure of classification performance.

linear least-squares classification on synthetic data

linear least-squares classification on synthetic data

k-Nearest Neighbors

Subsequently we examined the k-nearest neighbors (kNN) approach for classification, a nonparametric and highly local method. The idea behind kNN is simple enough: given training data (input vectors and their corresponding labels) and a query point (an input vector without a label), we predict the label by averaging the labels of the k closest training points to the query point. While there’s a long history to the analysis of and theory behind kNN (e.g. Fix and Hodges, 1951Cover and Hart, 1967), the intuition is that points that are “close” in the feature space have similar labels. Thus, as opposed to fitting a globally linear function, we fit a locally constant function.

While this simple technique is often a decent starting point for classification problems, there are a number of issues that arise. First, we need to store the entire training set, as we never obtain any compressed representation of the data (e.g. via parameters); this runs the risk of being expensive to store. In addition, evaluation on query points grows with the size of the training set: naively we need to calculate the distance from the query point to all training points, although we can do significantly better with more advanced data structures, e.g. kd-trees (see Andrew Moore’s tutorial references; implementations in MATLAB/C++, SciPy, R). This is an illustrative example of the impact of algorithms on practical prediction tasks: kd-trees scale logarithmically with the number of training points, whereas the naive method scales linearly.

With a large number of features, kNN also suffers from the curse of dimensionality (Bellman, 1961), wherein the number of training examples we need to get reasonable coverage of the feature space grows exponentially with dimension. Finally, selecting the number of neighbors (the value of k) should be done via cross-validation; too small a value of k results in overfitting while too large a value of k underfits.

k-nearest neighbors (k=1, k=20)

k-nearest neighbors (k=1, k=20)

Application: Digit Recognition

Nonetheless, we saw that kNN does a surprisingly reasonable job of classifying the handwritten digits, achieving over 96% accuracy out of the box. (Each 16×16 pixel digit is represented as a length-256 vector, where pixel value is between -1 and 1.) See the confusion matrix below comparing the predicted and actual digit values for ~7000 digits with a 2/3 training/test split.

0    1    2    3    4    5    6    7    8    9
402    0    1    1    0    7    3    0    1    1
0    313    0    0    3    0    0    0    0    0
1    0    207    2    3    1    0    0    1    0
0    0    4    201    1    7    0    0    6    1
0    0    1    1    225    0    0    3    0    0
0    0    1    1    0    175    1    1    3    0
1    0    1    0    0    3    216    0    3    0
0    1    4    0    0    0    0    217    2    5
0    0    0    4    1    0    0    0    173    0
0    0    0    0    7    2    0    0    2    209

There are a number of additional interesting notes on kNN, such as weighting functions and dimension scaling in Chapter 8 of Segaran. Chapter 13 of Hastie also discusses tangent distances, a metric which explicitly accounts for invariant transformations of image data (rotations, translations, etc.) when calculating neighbors.

Homework 01

14-Sep-09

The first homework, Homework 01, is posted.

The first exercise is to reproduce the demo shown in class, implementing cross-validation to find a best-fit polynomial. The second is a practical exercise to ensure that everyone has a working Bash shell and installation of Python. See the included pdf file for a detailed description.

Lecture 01

12-Sep-09

Our first lecture was this past Friday.

We began with a high-level overview and introduction to the course (slides). A few main points here. First, for many tasks — e.g. spam classification, image recognition, etc. — we (people) are able to learn relatively quickly and easily, such that we can generalize well after seeing few, loosely-structured examples. Second, it’s often the case that, while we learn from example well, we have difficulty in explicitly formalizing this procedure and the results, making it difficult to write programs that mimic our behaviors. The hope, however, is that we can devise methods to enable machines to systematically learn and generalize from observed data. The good news is that empirical results in many areas — from spam detection to face recognition to recommendation systems — show surprisingly encouraging results. The course will focus on some of these techniques, with an emphasis on practical aspects of obtaining, dealing with, and learning from real-world data.

Next, as a motivating example, we discussed simple regression with a linear model (as in, e.g., Chapter 2 of Hastie, et. al.). After deriving the solution, we saw a simple demo of fitting a polynomial and using cross-validation to select the model complexity (polynomial degree). We briefly discussed multivariate regression and arbitrary transformations of the input features.

We also looked at a quick example of basic access to a web API. After exploring the New York Times API Tool a bit, we looked at some simple Python code to programmatically access the API. The code is available on github. (Note: register for an NYT API key to use this code.)