UML final

https://www.lesswrong.com/posts/GYRBWJKTm42haE5FL/uml-final

Contents

Multiclass Prediction

For this chapter, we’re back in the realm of supervised learning. Previously in this sequence, we’ve looked at binary classification (where |\mathcal{Y}| = 2) and regression (where \mathcal{Y} = \mathbb{R}). In multiclass prediction or multiclass categorization, we have |\mathcal{Y}| \in \mathbb{N} but |\mathcal{Y}| > 2. For simplicity, we take the next simplest case, i.e., |\mathcal{Y}| = 3. (Going up from there is straight-forward.) More specifically, consider the problem of recognizing dogs and cats in pictures, which we can model by setting \mathcal{Y} = {\text{dog}, \text{cat}, \text{noanimal}}. For the purposes of this chapter, we ignore the difficult question of how to represent the instance space \mathcal{X}. We take the following approach:

Approach 1: Reduction to Binary Classification

Suppose (unrealistically) that \mathcal{X} = \mathbb{R}^2 and our training data looks like this: where We could now train a separate scoring function for each label. Starting with cats, we would like a function f_\text{cat} : \mathcal{X} \rightarrow \mathbb{R} that assigns each point a score based on how much it looks like a cat. For f_\text{cat}, the label \text{cat} corresponds to the label 1, whereas \text{dog} and \text{no_animal} both correspond to the label 0. In this setting, learning f_\text{cat} corresponds to finding a passable hyperplane. (Apply linear programming for Soft Support Vector Machines.) Note that, while hyperplanes are used to classify points, they do, in fact, produce a numerical score based on how far on the respective side each point is. Next, we train a f_\text{dog} function. Perhaps its hyperplane looks like so: And, finally, a f_\text{no_animal} function in a similar fashion. Then, we define \hspace{90pt} f(x,\text{cat}) := f_\text{cat}(x) \hspace{5pt} \forall x \in \mathcal{X} And likewise with \text{dog} and \text{no_animal}. As before, we let h_f(x) \in \text{argmax}{y \in \mathcal{Y}}f(x,y). Note that a point might be classified correctly, even if it is on the wrong side of the respective hyperplane. For example, consider this point: f\text{cat} thought this point was not a cat, but it thought it less intensely than f_\text{dog} thought that it was not a dog. Thus, depending on f_\text{no_animal}’s verdict, the \text{argmax} that h computes might still put out a cat label. Of course, the reverse is also true – a point might be classified as a cat by f_\text{cat} but also as a dog by f_\text{dog}, and if the dog classification is more confident, the \text{argmax} will pick that as the label. The reduction approach is also possible with predictors that merely put out a binary yes/​no. However, numerical scores are preferable, precisely because of cases like the above. If we had used classifiers only, then h_\text{cat}(x) = h_\text{dog}(x) = h_\text{no_animal}(x) = 0 for x = \text{this point} and we could do no better than to guess. While the reduction approach can work, it won’t have the best performance in practice, so we look for alternatives.

Approach 2: Linear Programming

Using the fact that we are in the linear setting, we can rephrase our problem as follows: let \text{cat} = 0, \text{ dog = 1}, \text{ no_animal} = 2, then we wish to find three *vectors *\mathbf{w}_0, \mathbf{w}_1, \mathbf{w}_2 such that, for all training points (x,i), we have \langle \mathbf{w}_i, x\rangle > \langle \mathbf{w}_j, x \rangle for all j \neq i. In words: for each training point, the inner product with the vector measuring the similarity to the correct label needs to be larger than the inner product with the other vectors (in this case, the 2 others). This can be phrased as a linear program in much the same way as for Support Vector Machines (post VIII). However, it only works if such vectors exist. In practice, this is only going to be the case if the dimension is extremely large.

Approach 3: Surrogate Loss and Stochastic Gradient Descent

A general trick I first mentioned in post V is to apply Stochastic Gradient Descent (post VI) with respect to a surrogate loss function (i.e., a loss function that is convex and upper-bounds the first loss). In more detail (but still only sketched), this means we

Remaining Material

Now we get to the stuff that I read at least partially but decided to cover in less detail.

Naive Bayes Classifier

Naive Bayes is a simple type of classifier that can be used for specific categorization tasks (supervised learning), both binary classification and multi-classification but not regression. The "naive" does not imply that Bayes is naive; instead, it is naive and uses Bayes’ rule. We usually frame our goal in terms of wanting to learn a function h : \mathcal{X} \rightarrow \mathcal{Y}. However, we can alternatively ask to learn the conditional probability distribution \mathcal{D}((x,y) \hspace{2pt}|\hspace{2pt} x). If we know this probability – i.e., the probability of observing a (point, label) pair once we see the point – it is easy to predict new points. For the rest of this section, we’ll write "probability" as \Pr rather than \mathcal{D}, and we’ll write \Pr(y\hspace{2pt}|\hspace{2pt}x) rather than \Pr((x,y)\hspace{2pt}|\hspace{2pt}x). Now. For any pair (x,y) \in \mathcal{X} \times \mathcal{Y}, we can express \Pr(y\hspace{2pt}|\hspace{2pt}x) in terms of \Pr(x \hspace{2pt}|\hspace{2pt} y) and P(x) and P(y) using Bayes’ rule. This implies that knowing all the \Pr(x \hspace{2pt} | \hspace{2pt} y) is sufficient to estimate any \Pr(y\hspace{2pt}|\hspace{2pt}x) (the priors on x and y are generally not an issue). The difficulty is that \mathcal{X} is usually infinite and we only have access to some finite training sequence S. For this reason, Naive Bayes will only be applicable to certain types of problems. One of them, which will be our running example, is *document classification *in the bag of words model. Here, \mathcal{X} = {0,1}^d where d is the number of words in our dictionary, and each document is represented as a vector \mathbf{x} = (x_1, ..., x_d) \in \mathcal{X}. where bit x_i indicates whether or not the i-th word of our dictionary appears in [the document represented by \mathbf{x}]. Thus, we throw away order and repetitions but keep everything else. Furthermore, \mathcal{Y} is a set of possible topics, and the goal is to learn a predictor that sees a new document and assigns it a topic. The crucial property of this example is that each coordinate of our vectors is very simple. On the other hand, the vectors themselves are not simple: we have d \approx 170.000 (number of words in the dictionary), which means we have about 2^{170000} parameters to learn (all P(y \hspace{2pt} | \hspace{2pt} \mathbf{x})). The solution is the "naive" part: we assume that all feature coordinates are independent so that \Pr((x_1, ..., x_d) \hspace{2pt} | \hspace{2pt} y) = \prod_{i = 1}^d \Pr(x_i \hspace{2pt} | \hspace{2pt} y). This assumption is, of course, incorrect: the probabilities for x_i = \text{"football"} and x_j = \text{"goal"} are certainly not independent. But an imperfect model can still make reasonable predictions. Thus, naive Bayes works by estimating the parameters \Pr(x_i \hspace{2pt} | \hspace{2pt} y) for any i \in [d] and y \in \mathcal{Y}. If \mathcal{Y} consists of 100 possible topics, then we wish to estimate 100 \cdot d \approx 17.000.000 parameters, which may be feasible. Given a labeled document collection (the training sequence S), estimating these parameters is now both simple and fast, which is the major strength of Naive Bayes. In fact, it is the fastest possible predictor in the sense that each training point only has to be checked once. Note that \Pr(x_i \hspace{2pt} | \hspace{2pt} y) is the probability of encountering word x_i in a document about topic y. For example, it could be the probability of encountering the word "Deipnophobia" (fear of dinner parties) in a document about sports. To estimate this probability, we read through all documents about sport in our collection and keep track of how many of them contain this term (spoilers: it’s zero). On that basis, we estimate that \hspace{15pt}\Pr(\text{"Deipnophobia"} \hspace{2pt} | \hspace{2pt} \verb|sport|) = \frac{|{\text{documents about sport with this word}}| + 1}{|{\text{documents about sport}}| + 1} = \frac{1}{1532} Why the +1? Well, as this example illustrates, the natural estimate will often be 0. However, a 0 as our estimate will mean that any document which contains this word has no chance to be classified as a sports document (see equations below), which seems overkill. Hence we "smooth" the estimate by the +1. Small tangent: I think it’s worth thinking about why this problem occurs. Why isn’t the computed value our best guess? It’s not because there’s anything wrong with Bayesian updating – actually the opposite. While the output rule is Bayesian, the parameter estimation, as described above, is not. A Bayesian parameter estimation would have a prior on the value of \Pr(\text{"Deipnophobia"} \hspace{2pt} | \hspace{2pt} \verb|sport|), and this prior would be updated upon computing the above fraction. In particular, it would never go to 0. But, of course, the simple frequentist guess makes the model vastly more practical. Now suppose we have all these parameters. Given an unknown document \mathbf{x}, we would like to output a document y in the set \text{argmax}{y \in \mathcal{Y}} \Pr(y \hspace{2pt} | \hspace{2pt} \mathbf{x}). For each y, we can write \hspace{70pt} \Pr(y \hspace{2pt} | \hspace{2pt} \mathbf{x}) = \Pr(\mathbf{x} \hspace{2pt} | \hspace{2pt} y) \cdot \frac{\Pr(y)}{\Pr(\mathbf{x})} \propto \Pr(y) \Pr(\mathbf{x} \hspace{2pt} | \hspace{2pt} y) where we got rid of the term \Pr(\mathbf{x}) because it won’t change the \text{argmax}; it doesn’t matter how likely it was to encounter our new document, and it will be the same no matter which topic we consider. Now kicks in the naivety: we will choose a topic in the set \displaystyle \hspace{100pt} \text{argmax}{y \in \mathcal{Y}} \Pr(y) \prod_{i \in [d] \ x_i = 1} \Pr(x_i \hspace{2pt} | \hspace{2pt} y) This is all stuff we can compute – we can estimate the prior \Pr(y) based on our document collection, and the remaining probabilities are precisely our parameters. In practice, one uses a slightly more sophisticated way to estimate parameters (taking into account repetitions). One also smoothes slightly differently (smoothing is the +1 step) and maps everything into log space so that the values don’t get absurdly small, and we can add probabilities rather than multiplying them. Nonetheless, the important ideas are intact.

Feature Selection

Every learning problem can be divided into two steps, namely

Regularization and Stability

This is a difficult chapter that I’ve read partially but skipped entirely in this sequence. The idea of regularization is as follows: suppose we can represent our hypotheses as vectors \mathbf{w}i, i.e., \mathcal{H} = \mathbb{R}^d. Instead of minimizing the empirical loss \ell_S, we minimize the regularized empirical loss \ell^R_S defined as \hspace{100pt} \ell_S^R(\mathbf{w}) := \ell_S(\mathbf{w}) + R(\mathbf{w}) where R : \mathcal{H} \rightarrow \mathbb{R} is a regularization function. A common choice is R(\mathbf{w}) = \lambda ||\mathbf{w}||^2 for some parameter \lambda \in \mathbb{R}+. In that case, we wish to minimize a term of the form \ell_S(\mathbf{w}) + \lambda ||\mathbf{w}||^2. This approach is called Regularized Loss Minimization. There are at least two reasons for doing this. One is that the presence of the regularization function makes the problem easier. The other is that, if the norm of a hypothesis is a measure for its complexity, choosing a predictor with a small norm might be desirable. In fact, you might recall the Structural Risk Minimization paradigm, where we defined a sequence \mathcal{H}_1, \mathcal{H}2, ... of hypothesis classes such that \mathcal{H} = \bigcup{i = 1}^\infty \mathcal{H_i}. If we let \mathcal{H}i := {\mathbf{w} \in \mathcal{H} \hspace{5pt} | \hspace{5pt} ||\mathbf{w}||^2 \le i}, then both Structural Risk Minimization and Regularized Loss Minimization privilege predictors with smaller norm. The difference is merely that the tradeoff is continuous (rather than step-wise) in the case of Regularized Loss Minimization. The problem \hspace{120pt} \min{\mathbf{w} \in \mathcal{H}} \Big[\ell^2_S(\mathbf{w}) + \lambda ||\mathbf{w}||^2\Big] where \ell^2 is the squared loss can be solved with linear algebra – it comes down to inverting a matrix, where the \lambda ||\mathbf{w}||^2 term ensures that it is always invertible.

Generative Models

Recall the general model for supervised learning: we have the domain set \mathcal{X}, the target set \mathcal{Y}, and the probability distribution \mathcal{D} over \mathcal{X} \times \mathcal{Y} that generates the labels. Our goal is to learn a predictor h : \mathcal{X} \rightarrow \mathcal{Y}. It took me reaching this chapter to realize that this goal is not entirely obvious. While the function h : \mathcal{X} \rightarrow \mathcal{Y} is generally going to be the most interesting thing, it is also conceivable to be interested in the full distribution \mathcal{D}. In particular, even if the best predictor h is known (that’s the predictor h^* given by h^*(x) \in \text{argmax}_{y \in \mathcal{Y}} \mathcal{D((x,y) \hspace{5pt} | \hspace{5pt} x)}), one still doesn’t know how likely each domain point is to appear. Differently put, the probability distribution over \mathcal{X} is information separate from the predictor; it’s neither necessary nor sufficient for finding one. The idea of generative models is to attempt to estimate the distribution directly. A way to motivate this is by Tegmark et. al.’s observation that the process that generates labeled points often has a vastly simpler description than the points themselves (chapter 2.3 of post X). This view suggests that estimating \mathcal{D} could actually be *easier *than learning the predictor h : \mathcal{X} \rightarrow \mathcal{Y}. (Although not literally since learning \mathcal{D} implies learning the best predictor h.) The book has Naive Bayes as part of this chapter, even though it only involves estimating the P(y \hspace{2pt} | \hspace{2pt} x), not the probability distribution over \mathcal{X}.

Advanced Theory

The book is structured in four parts. The first is about the fundamentals, which I covered in posts I-III. The second and third are about learning models, which I largely covered in posts IV-XIV. If there is a meaningful distinction between parts two and three, it’s lost on me – they don’t seem to be increasing in difficulty, and I found both easier than part one. However, the fourth part, which is titled advanced theory, is noticeably harder – too difficult for me to cover it at one post per week. Here’s a list of its chapters:

Closing Thoughts on the Book

I’ve mentioned before that, while Understanding Machine Learning is the best resource on the subject I’ve seen so far, it feels weaker than various other textbooks I’ve tried from Miri’s list. Part one is pretty good – it actually starts from the beginning, it’s actually rigorous, and it’s reasonably well explained. Parts two and are mixed. Some chapters are good. Others are quite bad. Occasionally, I felt the familiar sense of annoyance-at-how-anyone-could-possibly-think-this-was-the-best-way-explain-this-concept that I get all the time reading academic papers very rarely with Miri-recommended textbooks. It feels a bit like the authors identified a problem (no good textbooks that offer a comprehensive theoretical treatment of the field), worked on a solution, but lost inspiration somewhere on the way and ended up producing several chapters that feel rather uninspired. The one on neural networks is a good example. (Before bashing the book some more, I should note that the critically acclaimed Elements of Statistical Learning made me angry immediately when I tried to read it – point being, I appear to be unusually hard to please.) The exercises are lacking throughout. I think the goal of exercises should be to help me internalize the concepts of the respective chapter, and they didn’t really do that. It seems that the motivation for them was often "let’s offload the proof of this lemma into the exercises to save space" or "here’s another interesting thing we didn’t get to cover, let’s make it an exercise." (Note that there’s nothing wrong with offloading the proof of a lemma to exercises per se, they can be perfectly fine, but they’re not automatically fine.) Conversely, in Topology, the first exercises of each sub-chapter usually felt like "let’s make sure you understood the definitions" and the later ones like "let’s make sure you get some real practice with these concepts," which seems much better. Admittedly, it’s probably more challenging to design exercises for Machine Learning than it is for a purely mathematical topic. Nonetheless, I think they could be improved dramatically. Next up (but not anytime soon) are \lambda-calculus and Category theory.

Comment

https://www.lesswrong.com/posts/GYRBWJKTm42haE5FL/uml-final?commentId=NFkfNQ7RqQR3n8PKn

I’ve been working through the textbook as well. I know I’ve commented a few times before, but, once again, thanks for writing up your thoughts for each chapter. They’ve been useful for summarizing /​ checking my own understanding.