This week we had our second guest lecture. Max Shron presented a live demo of using Google Transit data to analyze the effects of budget cuts on passenger wait times, adapting his original analysis for Chicago to more current New York City MTA data. Max highlighted several useful python modules, including csv.DictReader for easily parsing CSV files, doctest for simple unit testing, and itertools for efficient group-by operations. See the course github repository for the source code.

In the second half of class we discussed unsupervised learning. Specifically, we introduced k-means as a simple but effective clustering method, with applications in clustering image data. We then discussed Gaussian mixture models and expectation-maximization as a more flexible clustering framework. See the slides for more details on k-means (notes on GMMs coming soon).

In this lecture we discussed recommendation systems in general, and collaborative filtering in particular. With the Netflix Prize as a motivating example, we saw that simple memory-based methods (e.g., nearest-neighbors) are surprisingly effective in practice, although model-based methods (e.g., matrix factorization) often have a number of advantages in scalability and performance. See the slides for more details.

Here are details for the final project.

The main objective is to apply techniques we’ve discussed in class to real-world data. You may either use a pre-compiled data set or build your own data set (e.g., from APIs or public web sites); in either case please include full details of data acquisition and cleaning. Projects should follow the general paradigm presented in class, clearly stating the goal of the modeling problem along with details of model fitting and evaluation.

As usual, please submit executable code and an accompanying comprehensive report explaining the results.

We had our first guest lecture this week. John Myles White presented recent work on modeling data from functional MRI experiments to understand the relationship between various mental states and physical activity in the brain. For more details, see John’s slides and this review paper on “mind reading”.

We also looked at Amazon’s Web Services for cloud computing. In particular, we discussed S3 for distributed web storage and EC2 for rentable computing. We saw several ways to use these resources for processing large amounts of data, including Amazon’s Elastic MapReduce framework, which acts as a pay-as-you-go Hadoop solution. We concluded with a brief overview of Apache Pig, a high-level language that facilitates data processing by translating sequences of common operations (e.g., GROUP BY, FILTER, JOIN, etc.) to a series of MapReduce jobs.

After signing up for AWS, see the web services console to interact with Amazon’s various services. In addition to Amazon’s EC2 tutorial, Paul Stamatiou’s getting started guide for EC2 runs through using the EC2 API tools to start EC2 instances. Amazon provides a similar guide to using S3 for remote storage, as well as command line tools to transfer data to and from S3. Finally, see the Elastic MapReduce getting started guide for a quick and easy solution to running Hadoop jobs with EC2 and S3.

In this lecture we extended our toolbox of linear classification methods to include support vector machines (SVMs). We began with a unifying view of loss functions for classification, including squared, logistic, and misclassification loss. We then introduced the perceptron loss as a relaxation of misclassification error and discussed Rosenblatt’s perceptron algorithm. We concluded with an introduction to maximum margin classifiers and showed that the SVM optimization problem is equivalent to minimizing hinge loss with L2 regularization. See Chapters 4 and 7 of Bishop, as well as Section 4.5 and Chapter 12 of Hastie, for more detail.

Some notes:

- Coding our outcomes for binary classification as \( y_i \in

\{-1,1\} \), we can rewrite squared and log loss as

$$

\mathcal{L}_{\mathrm{squared}} = \sum_i (1 – y_i \hat{y_i})^2

$$

and

$$

\mathcal{L}_{\mathrm{log}} = – \sum_i \log (1 + e^{-y_i \hat{y_i}}),

$$

where \( \hat{y_i} \) is the predicted label.
- Misclassification loss, on the other hand, simply incurs a cost of 1 when the signs of predicted and actual labels differ, and no cost when they agree:

$$

\mathcal{L}_{\mathrm{misclass}} = \sum_i \Theta( -y_i \hat{y_i} ).

$$

While simple to calculate, misclassification error is, unfortunately, difficult to optimize due to the discontinuity at the origin. Comparing these, we see that squared and log loss penalize incorrect predictions more heavily than misclassification error, and squared loss heavily penalizes “obviously correct” answers where the predicted and actual labels are of the same sign, but the prediction is greater that 1.
- The perceptron loss can be viewed as a relaxation of misclassification error, where correctly classified examples aren’t penalized at all and the cost of misclassification is linear in the predicted label:

$$

\mathcal{L}_{\mathrm{percep}} =~ – \sum_{\{i | y_i \hat{y_i} < 0\}} y_i \hat{y_i}.

$$Following Rosenblatt’s original treatment of the perceptron leads to a simple stochastic gradient descent algorithm for minimizing this loss wherein we iteratively update our weight vector in the direction of misclassified examples. Unfortunately, even for linearly separable training data, the perceptron algorithm may converge slowly and the final solution depends on the order in which training examples are visited. Furthermore, the resulting boundary may lie close to training examples, increasing our risk of misclassifying future examples.
- We can avoid this last issue by instead searching for a maximum margin classifier—that is, we look for the boundary which is as far as possible from all training points. A bit of linear algebra reveals that we can phrase this minimization problem as:

$$

\mathrm{min}_{w, w_0} {1 \over 2} ||w||^2 \\

\textrm{subject to}~ y_i(w \cdot x_i + w_0) \ge 1

$$

for all points \(i\), where the constraint simply states that all points are correctly classified. Intuitively we can interpret this as searching for the orientation of the thickest wedge that fits between the positive and negative classes. For small-scale problems we can use standard quadratic programming solvers to find the optimal weight vector \( w \) and bias \( w_0 \) that satisfy the above.
- The dual formulation of the problem both provides further insight into the underlying loss function and structure of SVM solutions. Specifically, one can show that SVMs minimize the hinge loss with L2 regularization:

$$

\mathcal{L}_{\mathrm{svm}} =~ \sum_i [1 – y_i \hat{y_i}]_+ + \lambda ||w||^2,

$$

where \( [ \cdot ]_+ \) indicates the positive part of the argument. This corresponds to a shifted version of the perceptron loss, where instead of paying no cost when the predicted and actual values have the same sign, correct predictions are penalized slightly if they’re too close to the boundary. Appendix E of Bishop nicely details Lagrange multipliers for inequality constraints, along with the KKT conditions, which lead to the above interpretation.
- Finally, by invoking the representer theorem and decomposing the weight vector as a linear combination of training examples \( w = \Phi^T a \), we see that SVM solutions are typically sparse in this space due to the KKT conditions. That is, most of the \( a_i \) are zero, with the non-zero multipliers corresponding exactly to the “support vectors” that serve to define the boundary.

The figure below, plagiarized from Chapter 7 of Bishop, nicely summarizes the various loss functions discussed above. In short, there are many ways of quantifying what we mean by a “good” linear predictor, and different choices lead to different types of solutions. SVMs are particularly interesting as they have a number of practical and theoretical properties of interest, most of which we haven’t discussed here. For a more detail comparison of various loss functions and their relative performance, see also Ryan Rifkin’s thesis, “Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning.

In the second half of class we covered an introduction to MapReduce and specifically Hadoop, an open source framework for efficient parallel data processing. See the slides, borrowed from a previous tutorial, below.

In this lecture we looked at non-linear feature transformations to accomodate more complex decision boundaries, introduced regularization to avoid overfitting, and covered the kernel trick for learning non-linear predictors. See sections 1.2.5 and 3.1.4 of Bishop along with section 3.4 of Hastie for discussions of regularized least squares. Chapter 6 of Bishop covers kernel methods.

Notes on these topics:

- Whereas previously we considered predictors \( \hat{y} = w \cdot x \) which were linear in both the unknown weights \( w \) and the features \( x \), the machinery of linear regression is unchanged under arbitrary feature transformations \(x \rightarrow \phi(x) \), as the optimization problem remains linear in the weights. In principle the solution in the new feature space is easily obtained by replacing our matrix of examples \( X \) with the matrix of transformed examples \( \Phi \).
- In practice, however, such mappings often result in more features than examples, introducing the possibility of overfitting to noise in the training data. In the case of least-squares regression, for instance, this leaves us with uninvertible normal equations, as \( \Phi^T \Phi \) is rank deficient in this setting. Regularization methods, which can often be interpreted as placing a prior over model parameters, mitigate this issue by adding a term to the loss function which balances model fidelity and complexity.
- L2 or ridge regression explicitly penalizes the squared norm of the weight vector to encourage “smooth” solutions:

$$

\mathcal{L}(w) = {1 \over 2}\sum_i \left( y_i – w \cdot \phi(x_i) \right)^2 + {\lambda \over 2} ||w||^2,

$$

where \( \lambda \) is a tunable parameter usually selected by cross-validation in practice. Larger values for \( \lambda \) place more emphasis on smoother solutions, while smaller values focus on fit to the training data.
- With this modified loss function, the solution to the normal equations becomes

$$

\hat{w} = (\Phi^T \Phi + \lambda \mathbb{I}_K)^{-1} \Phi^T y,

$$

where the addition of the K-by-K identity matrix \( \mathbb{I}_K \) stabilizes the matrix inversion and shrinks weight estimates. Likewise, for gradient descent the regularization term has the effect of shrinking of the weight vector on each update before stepping in the direction of the gradient:

$$

\hat{w} \leftarrow (1-\eta\lambda)\hat{w} + \eta \Phi^T(y-\Phi \hat{w}).

$$
- Even with regularization to prevent overfitting, it’s often quite expensive to explicitly construct and work with the matrix \( \Phi \) for very high-dimensional feature spaces. Fortunately, for loss functions which depend only on dot products between examples, we can use the kernel trick as a shortcut to avoid this explicit mapping and directly compute dot products between mapped examples from their original feature vectors. That is, we can compute \( \phi(x_i) \cdot \phi(x_j) \) directly from the \(x\)’s as \(K(x_i,x_j)\) without ever explicitly mapping \( x \rightarrow \phi(x) \).
- Many linear methods can be kernelized, with kernelized ridge regression being an instructive example to work through. Differentiating the regularized loss with respect to \(w\) allows us to represent the weight vector as a linear combination of training examples: \(

w = \sum_i a_i \phi(x_i), \) where the dual variables are \( a_i = {1 \over \lambda} \left( y_i – w \cdot \phi(x_i) \right). \) Substituting the first equation in the second and solving for \(a\) yields

$$

a = (K + \lambda \mathbb{I}_N)^{-1} y,

$$

where \(K = \Phi \Phi^T\) is the N-by-N Gram matrix of dot products between all pairs of examples. Using a kernel to directly compute the gram matrix allows us to work in very high-dimensional implicit feature spaces, but we pay the price of needing to store and invert an N-by-N matrix, which can be expensive when we have many examples.
- The kernel trick works for a relatively small subset of implicit feature spaces, and constructing valid kernels is, in general, a difficult task. Commonly used kernels include the polynomial, Gaussian, and exponential kernels, each with their own free parameters that are often tuned by cross-validation. For example, we can efficiently model all pairwise interactions between features using a second degree polynomial kernel \(K(x,z) = (x \cdot z + c)^2 \).

Interestingly, the dual formulation of ridge regression allows us to reinterpret this parametric model as a memory-based one, similar to k-nearest neighbors. As above, to determine \(w\), we take a weighted sum over *all* examples \( \phi(x_i) \) where we’ve *learned* the weights \(a_i\) from the training data. Next week we’ll look at support vector machines as an example of *sparse* kernel methods, where many of the \(a_i\) are zero, corresponding to a weighted sum over a *small subset* of training examples (i.e., the support vectors).

In the second half of class we covered screen scraping for acquiring data not available through an API.^{1} Manually inspecting web pages with tools like Firebug or Chrome’s developer tools is a good way to get a general understanding of a page’s html structure. With sufficiently regular page structure, a combination of curl, grep, etc. often gets the job done. Even so, tools such as lxml, BeautifulSoup, and mechanize can greatly facilitate screen scraping, and allow you to do everything from parse simple xml to fill out and submit browser forms while dealing with cookies.

[1] Disclaimer: always consult a site’s terms of service before screen scraping, of course.

The second homework is posted.

The first problem is an exercise in using cross-validation to select the best-fit polynomial for some synthetic data where both the degree and coefficients are unknown. The second problem is an application of naive Bayes for multiclass text classification in which you’re asked to train a classifier that, given the text of an article from the New York Times, predicts the section to which the article belongs.

You’ll need the polyfit.tsv data file for the first problem and the stopwords.txt file for the second.

In this lecture we studied maximum likelihood inference for linear classifiers. We saw that ordinary least squares regression can be phrased as maximum likelihood inference under the assumption of additive Gaussian noise. We then derived the closed-form solution to the normal equations for small-scale problems and discussed alternative optimization methods, namely gradient descent, for larger-scale settings. We noted several potential issues in directly applying linear regression to classification problems and explored logistic regression as an alternative. Maximum likelihood inference for logistic regression led us to first- and second-order gradient descent methods for parameter inference. See Chapter 3 of Bishop and Chapter 3 of Hastie for reference.

A few notes on linear and logistic regression:

- Ordinary least squares regression is, in principle, easily solved by inverting the normal equations:

$$

\hat{w} = (X^T X)^{-1} X^T y .

$$

In practice, however, it often computationally expensive to invert \( X^T X \) for models with many features, even with specialized numerical methods for doing so.
- Gradient descent offers an alternative solution to the normal equations, replacing potentially expensive matrix inversion with an iterative method where we update parameters by moving in the direction of steepest increase of the likelihood landscape:

$$

\hat{w} \leftarrow \hat{w} + \eta X^T (y – X\hat{w}),

$$

where \( \eta \) is a tunable step size. Choosing \( \eta \) too small leads to slow convergence, whereas too large a step size may result in undesirable oscillations about local optima. Intuitively, gradient descent updates each component of \( \hat{w} \) by a sum of the corresponding feature values over all examples, where examples are weighted by the error between actual and predicted labels.
- Two high-level issues arise when directly applying ordinary least squares regression to classification problems. First, our model predicts continuous outputs while our training labels are discrete. Second, squared loss is sensitive to outliers and penalizes “obviously correct” predictions for which the predicted value \( \hat{y} \) is much larger than the observed value \( y \) but of the correct sign.
- Logistic regression addresses these issues by modeling the class-conditional probabilities directly, using a logistic function to transform predictions to lie in the unit interval:

$$

p(y=1|x, w) = {1 \over 1 + e^{-w \cdot x}}

$$

While maximum likelihood inference for logistic regression does not permit a closed-form solution, gradient descent results in the following update equations:

$$

\hat{w} \leftarrow \hat{w} + \eta X^T (y – p).

$$

In smaller-scale settings one can improve on these updates by using second-order methods such as Newton-Raphson that leverage the local curvature of the likelihood landscape to determine the step size at each iteration.

Naive Bayes, linear regression, and logistic regression are all generalized linear models, meaning that predictions \( \hat{y} \) are simply transformations of a weighted sum of the features \( x \) for some weights \( w \), i.e. \( \hat{y}(x; w) = f(w \cdot x) \). Linear regression models the response directly, whereas Naive Bayes and logistic regression model the probability of categorical outcomes via the logistic link function. Model fitting amounts to inferring the “best” values for the weights \( w \) from training data; each of these methods quantify the notion of “best” differently and thus results in different estimates for \( w \). In particular, Naive Bayes estimates each component of the weight vector independently, while linear and logistic regression account for correlations amongst features.

In the second half of class we discussed APIs for accessing data from web services. As an example, we used Python’s urllib2 and json modules to interact with the New York Times Developer API. See the github repository for more details.

Our previous discussion of naive Bayes led us to the problem of overfitting, specifically in dealing with rare words for text classification. We investigated this problem a bit more formally in the context of probabilistic modeling and discussed maximum likelihood, maximum a posteriori, and Bayesian methods for parameter estimation. With a Bernoulli model for word presence in mind, we looked at the toy problem of estimating the bias of a coin from observed flips. We saw that general principles from probabilistic modeling reproduced our intuitive parameter estimates from last week, and, furthermore, justified the hack of smoothing counts to avoid overfitting. See Chapter 2 of Bishop for more details.

Some notes on probabilistic inference:

- Under maximum likelihood inference we define the “best” parameter values as those for which the observed data are most probable:

$$

\hat{\theta}_{ML} = \mathrm{argmax}_{\theta} ~ p(D | \theta).

$$

Employing this framework for the coin flipping example reproduces the intuitive relative frequency estimate from last week: \(\hat{\theta}_{ML} = {n \over N} \). Unfortunately, as we saw for spam classification, maximum-likelihood estimates can often lead to overfitting (e.g., when n=0 or n=N).
- In a subtle but important distinction, maximum a posteriori inference formulates the problem as one of identifying the most probable parameters given the observed data:

$$

\hat{\theta}_{MAP} = \mathrm{argmax}_{\theta} ~ p(\theta | D) = \mathrm{argmax}_{\theta} ~ p(D | \theta) p(\theta),

$$

where the right-hand side follows from an application of Bayes’ rule. Using a conjugate prior beta distribution \( \mathrm{Beta}(\theta; \alpha, \beta) \) for the coin flipping example reproduces the “smoothed” estimate: \(\hat{\theta}_{MAP} = {n + \alpha – 1 \over N + \alpha + \beta – 2} \), where \( \alpha \) and \( \beta \) act as pseudocounts for the number of heads and tails seen before any actual data are observed. This allows us to address the overfitting problem by specifying the shape of the prior distribution via \( \alpha \) and \( \beta \). Setting \( \alpha \) and \( \beta \) to 1 corresponds to a uniform prior distribution and yields the maximum likelihood estimate above, while larger values of \( \alpha \) and \( \beta \) bias our estimates more heavily towards our prior distribution.
- Bayesian inference dispenses with the idea of point estimates for the parameters all together in favor of keeping full distributions over all unknown quantities:

$$

p(\theta | D) = {p(D | \theta) p(\theta) \over p(D)}

$$

In contrast to MAP estimation, this requires calculation of the normalizing constant \( p(D) \), often referred to as the marginal likelihood or evidence. Fortunately the choice of a conjugate prior distribution reduces the potentially difficult task of calculating the evidence to simple algebra. For example, the posterior distribution for the coin flipping example with a beta prior is simply a beta distribution with updated hyperparameters, \( \mathrm{Beta}(\theta; n + \alpha, N – n + \alpha + \beta) \). Likewise, the predictive distribution \( p(x|D) \), which provides the probability of future outcomes given the observed data, amounts to a simple ratio of beta distributions.

Returning to naive Bayes for text classification, any of the above methods may be used to estimate the many (independent) parameter values in our model. Maximum likelihood estimation has no tunable hyperparameters, whereas MAP estimation and Bayesian inference require the specification of a prior distribution that, in practice, is often tuned for optimal generalization error. Regardless of which method we choose, however, naive Bayes still models features independently. Next we’ll extend our discussion of linear methods for classification to linear and logistic regression, which account for correlations amongst model features.

During the second part of class we looked at a python script to fetch email data from an imap server as well as some shell scripting to parse raw email data, and concluded with an introduction to Python.

With document classification (e.g., spam filtering) as a motivating example, we reviewed issues with kNN for high-dimensional classification problems—namely the curse of dimensionality—and explored linear methods as an alternative. We first reviewed Bayes’ rule for inverting conditional probabilities via a simple, but perhaps counterintuitive, medical diagnosis example and then adapted this to an (extremely naive) one-word spam classifier. We improved upon this by considering all words present in a document and arrived at naive Bayes—a simple linear method for classification in which we model each word occurence *independently* and use Bayes’ rule to calculate the probability the document belongs to each class given the words it contains. Chapter 6 of Segaran covers related material and additional references are provided below. Historical references include Horvitz, et. al., 1998 and Graham, 2002.

Some notes on naive Bayes:

- Naive Bayes is in some ways the extreme opposite of kNN—whereas kNN estimates a piecewise constant density from the training examples, naive Bayes provides a globally linear estimate where all weights are estimated independently. Fitting to this constrained parametric form allows us learn efficiently in high-dimensional feature spaces (at the risk of assuming an overly simple model of the data).
- For simplicity we looked at a Bernoulli model of word occurrence in documents. That is, we represented each document by a binary vector over all words and modeled the probability of each word occurring in a document as an independent coin flip. There are, of course, many other document representations one might consider, along with their corresponding likelihood models. Choosing Which Naive Bayes to use is largely an empirical matter.
- Training a naive Bayes classifier is computationally cheap and scalable. Using a frequentist estimate for the Bernoulli model, this amounts to simply counting the number of documents of each class containing each word in our dictionary. It’s also easy to update the model upon receiving new training data by simply incrementing these counts.
- Likewise, making predictions with a trained naive Bayes model is also computationally cheap, although this perhaps isn’t immediately obvious. With a simple manipulation of the posterior class probabilities, we saw that the log-odds are linear and sparse in the document features. Details are in the slides as well as David Lewis’ review of Naive Bayes at Forty.
- The name “naive Bayes” is a bit of misnomer. While the independence assumption is certainly untrue, in practice naive Bayes works reasonably well and turns out be Not So Stupid After All. Likewise, there’s nothing necessarily Bayesian about simply using Bayes’ rule—that is, there’s no point at which one must take a subjective view of probabilities to employ naive Bayes.
- This being said, simple frequentist estimates of model parameters can often result in overfitting. For example, in the Enron email data we looked at, there aren’t any spam messages containing the company’s name. Thus, a simple frequentist estimate might conclude that it’s impossible for the word “Enron” to occur in future spam, resulting in an easily gamed classifier. Smoothing estimates by adding pseudocounts to observed word occurrences (e.g., pretending each word occurred at least once in each document class) is one potential solution to this problem.

Next week we’ll see that one way to justify this last point is by considering maximum a posteriori inference for model parameters as an extension of usual maximum likelihood estimates.

In the latter part of class we looked at a Bash shell script for classifying spam in the Enron email data set using a single word. We used grep (grep tutorial; regular expression cheatsheet) and wc, to count the number of ham and spam messages in which a given word appears, and then implemented Bayes’ rule with bc to calculate the probability that a new message containing that word is spam. The full script is available on github. We concluded with an introduction to awk for processing delimited text files. See some famous awk one-liners and their explanations for reference. And if you’re feeling adventurous, see this implementation of naive Bayes in awk.