## Lecture 06

In this lecture we looked at non-linear feature transformations to accomodate more complex decision boundaries, introduced regularization to avoid overfitting, and covered the kernel trick for learning non-linear predictors. See sections 1.2.5 and 3.1.4 of Bishop along with section 3.4 of Hastie for discussions of regularized least squares. Chapter 6 of Bishop covers kernel methods.

Notes on these topics:

• Whereas previously we considered predictors $$\hat{y} = w \cdot x$$ which were linear in both the unknown weights $$w$$ and the features $$x$$, the machinery of linear regression is unchanged under arbitrary feature transformations $$x \rightarrow \phi(x)$$, as the optimization problem remains linear in the weights. In principle the solution in the new feature space is easily obtained by replacing our matrix of examples $$X$$ with the matrix of transformed examples $$\Phi$$.
• In practice, however, such mappings often result in more features than examples, introducing the possibility of overfitting to noise in the training data. In the case of least-squares regression, for instance, this leaves us with uninvertible normal equations, as $$\Phi^T \Phi$$ is rank deficient in this setting. Regularization methods, which can often be interpreted as placing a prior over model parameters, mitigate this issue by adding a term to the loss function which balances model fidelity and complexity.
• L2 or ridge regression explicitly penalizes the squared norm of the weight vector to encourage “smooth” solutions:
$$\mathcal{L}(w) = {1 \over 2}\sum_i \left( y_i – w \cdot \phi(x_i) \right)^2 + {\lambda \over 2} ||w||^2,$$
where $$\lambda$$ is a tunable parameter usually selected by cross-validation in practice. Larger values for $$\lambda$$ place more emphasis on smoother solutions, while smaller values focus on fit to the training data.

• With this modified loss function, the solution to the normal equations becomes
$$\hat{w} = (\Phi^T \Phi + \lambda \mathbb{I}_K)^{-1} \Phi^T y,$$
where the addition of the K-by-K identity matrix $$\mathbb{I}_K$$ stabilizes the matrix inversion and shrinks weight estimates. Likewise, for gradient descent the regularization term has the effect of shrinking of the weight vector on each update before stepping in the direction of the gradient:
$$\hat{w} \leftarrow (1-\eta\lambda)\hat{w} + \eta \Phi^T(y-\Phi \hat{w}).$$

• Even with regularization to prevent overfitting, it’s often quite expensive to explicitly construct and work with the matrix $$\Phi$$ for very high-dimensional feature spaces. Fortunately, for loss functions which depend only on dot products between examples, we can use the kernel trick as a shortcut to avoid this explicit mapping and directly compute dot products between mapped examples from their original feature vectors. That is, we can compute $$\phi(x_i) \cdot \phi(x_j)$$ directly from the $$x$$’s as $$K(x_i,x_j)$$ without ever explicitly mapping $$x \rightarrow \phi(x)$$.
• Many linear methods can be kernelized, with kernelized ridge regression being an instructive example to work through. Differentiating the regularized loss with respect to $$w$$ allows us to represent the weight vector as a linear combination of training examples: $$w = \sum_i a_i \phi(x_i),$$ where the dual variables are $$a_i = {1 \over \lambda} \left( y_i – w \cdot \phi(x_i) \right).$$ Substituting the first equation in the second and solving for $$a$$ yields
$$a = (K + \lambda \mathbb{I}_N)^{-1} y,$$
where $$K = \Phi \Phi^T$$ is the N-by-N Gram matrix of dot products between all pairs of examples. Using a kernel to directly compute the gram matrix allows us to work in very high-dimensional implicit feature spaces, but we pay the price of needing to store and invert an N-by-N matrix, which can be expensive when we have many examples.

• The kernel trick works for a relatively small subset of implicit feature spaces, and constructing valid kernels is, in general, a difficult task. Commonly used kernels include the polynomial, Gaussian, and exponential kernels, each with their own free parameters that are often tuned by cross-validation. For example, we can efficiently model all pairwise interactions between features using a second degree polynomial kernel $$K(x,z) = (x \cdot z + c)^2$$.

Interestingly, the dual formulation of ridge regression allows us to reinterpret this parametric model as a memory-based one, similar to k-nearest neighbors. As above, to determine $$w$$, we take a weighted sum over all examples $$\phi(x_i)$$ where we’ve learned the weights $$a_i$$ from the training data. Next week we’ll look at support vector machines as an example of sparse kernel methods, where many of the $$a_i$$ are zero, corresponding to a weighted sum over a small subset of training examples (i.e., the support vectors).

In the second half of class we covered screen scraping for acquiring data not available through an API.1 Manually inspecting web pages with tools like Firebug or Chrome’s developer tools is a good way to get a general understanding of a page’s html structure. With sufficiently regular page structure, a combination of curl, grep, etc. often gets the job done. Even so, tools such as lxml, BeautifulSoup, and mechanize can greatly facilitate screen scraping, and allow you to do everything from parse simple xml to fill out and submit browser forms while dealing with cookies.