diff --git a/book/bias.tex b/book/bias.tex index ab410c0..112a09f 100644 --- a/book/bias.tex +++ b/book/bias.tex @@ -3,7 +3,7 @@ \chapter{Bias and Fairness} \label{sec:bias} \chapterquote{Science and everyday life cannot\\and should not be separated.}{Rosalind~Franklin} \begin{learningobjectives} -\item +\item \end{learningobjectives} \dependencies{\chref{sec:dt},\chref{sec:knn},\chref{sec:perc},\chref{sec:prac}} @@ -83,7 +83,7 @@ \section{Unsupervised Adaptation} All examples are drawn according to some fixed base distribution $\Dbase$. Some of these are selected to go into the new distribution, and some of them are selected to go into the old distribution. The mechanism for deciding which ones are kept and which are thrown out is governed by a \emph{selection variable}, which we call $s$. -The choice of selection-or-not, $s$, is based \emph{only} on the input example $\vx$ and not on it's label.\thinkaboutit{What could go wrong if $s$ got to look at the label, too?} +The choice of selection-or-not, $s$, is based \emph{only} on the input example $\vx$ and not on its label.\thinkaboutit{What could go wrong if $s$ got to look at the label, too?} In particular, we define: ~ \begin{align} @@ -171,7 +171,7 @@ \section{Supervised Adaptation} {bias:easyadapt}% {\FUN{EasyAdapt}(\VAR{$\langle (\vxold_n,\yold_n) \rangle_{n=1}^N$}, \VAR{$\langle (\vxnew_m, \ynew_m) \rangle_{m=1}^M$}, \VAR{$\cA$})} { - \SETST{$D$}{$\left\langle ( \langle \VARm{\vxold_n}, \VARm{\vxold_n}, \vec 0 \rangle, \VARm{\yold_n} ) \right\rangle_{\VARm{n}=1}^{\VARm{N}} + \SETST{$D$}{$\left\langle ( \langle \VARm{\vxold_n}, \VARm{\vxold_n}, \vec 0 \rangle, \VARm{\yold_n} ) \right\rangle_{\VARm{n}=1}^{\VARm{N}} \bigcup \left\langle ( \langle \VARm{\vxnew_m}, \vec 0, \VARm{\vxnew_m} \rangle, \VARm{\ynew_m} ) \right\rangle_{\VARm{m}=1}^{\VARm{M}} $} \COMMENT{union} \\ \COMMENT{of transformed data} @@ -183,7 +183,7 @@ \section{Supervised Adaptation} Although this approach is general, it is most effective when the two distributions are ``not too close but not too far'': \begin{itemize} \item If the distributions are too far, and there's little information to share, you're probably better off throwing out the old distribution data and training just on the (untransformed) new distribution data. -\item If the distributions are too close, then you might as well just take the union of the (untransformed) old and new distribution data, and training on that. +\item If the distributions are too close, then you might as well just take the union of the (untransformed) old and new distribution data, and train on that. \end{itemize} In general, the interplay between how far the distributions are and how much new distribution data you have is complex, and you should always try ``old only'' and ``new only'' and ``simple union'' as baselines. @@ -212,8 +212,8 @@ \section{Fairness and Data Bias} Informally, the 80\% rule says that your rate of hiring women (for instance) must be at least 80\% of your rate of hiring men. Formally, the rule states: \begin{align} - \Pr(y = +1 \| \text{G} \neq \text{male}) -& \geq 0.8 ~\times~ \Pr(y = +1 \| \text{G} = \text{male}) + \Pr(y = +1 \| \text{G} \neq \text{male}) +& \geq 0.8 ~\times~ \Pr(y = +1 \| \text{G} = \text{male}) \end{align} Of course, gender/male can be replaced with any other protected attribute. @@ -243,7 +243,7 @@ \section{How Badly can it Go?} % The question is: how badly can $f$ do on the new distribution? -We can calculate this directly. +We can calculate this directly. % \begin{align} & \ep\xth{new} \nonumber \\ @@ -283,7 +283,7 @@ \section{How Badly can it Go?} The core idea is that if we're learning a function $f$ from some hypothesis class $\cF$, and this hypothesis class isn't rich enough to peek at the 29th decimal digit of feature 1, then perhaps things are not as bad as they could be. This motivates the idea of looking at a measure of distance between probability distributions that \emph{depends on the hypothesis class}. A popular measure is the \concept{$d_\cA$-distance} or the \concept{discrepancy}. -The discrepancy measure distances between probability distributions based on how much two function $f$ and $f'$ in the hypothesis class can disagree on their labels. +The discrepancy measure distances between probability distributions based on how much two functions $f$ and $f'$ in the hypothesis class can disagree on their labels. Let: % \begin{align} @@ -291,7 +291,7 @@ \section{How Badly can it Go?} &= \Ep_{\vx \sim P} \Big[ \Ind[ f(\vx) \neq f'(\vx) ] \Big] \end{align} % -You can think of $\ep_P(f,f')$ as the \emph{error} of $f'$ when the ground truth is given by $f$, where the error is taken with repsect to examples drawn from $P$. +You can think of $\ep_P(f,f')$ as the \emph{error} of $f'$ when the ground truth is given by $f$, where the error is taken with respect to examples drawn from $P$. Given a hypothesis class $\cF$, the discrepancy between $P$ and $Q$ is defined as: % \begin{align} @@ -304,7 +304,7 @@ \section{How Badly can it Go?} One very attractive property of the discrepancy is that you can estimate it from finite \emph{unlabeled} samples from $\Dold$ and $\Dnew$. Although not obvious at first, the discrepancy is very closely related to a quantity we saw earlier in unsupervised adaptation: a classifier that distinguishes between $\Dold$ and $\Dnew$. -In fact, the discrepancy is precisely twice the \emph{accuracy} of the best classifier from $\cH$ at separating $\Dold$ from $\Dnew$. +In fact, the discrepancy is precisely twice the \emph{accuracy} of the best classifier from $\cF$ at separating $\Dold$ from $\Dnew$. How does this work in practice? Exactly as in the section on unsupervised adaptation, we train a classifier to distinguish between $\Dold$ and $\Dnew$. @@ -324,7 +324,7 @@ \section{How Badly can it Go?} % \begin{align} \underbrace{\ep\xth{new}(f)}_{\textrm{error on } \Dnew} - &\leq + &\leq \underbrace{\ep\xth{old}(f)}_{\textrm{error on } \Dold} + \underbrace{\ep\xth{best}}_{\textrm{minimal avg error}} + \underbrace{d_\cA(\Dold,\Dnew)}_{\textrm{distance}} @@ -342,7 +342,7 @@ \section{Further Reading} TODO further reading -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/complex.tex b/book/complex.tex index f8507d9..b5d1ec5 100644 --- a/book/complex.tex +++ b/book/complex.tex @@ -182,8 +182,8 @@ \section{Learning with Imbalanced Data} \label{sec:imbalanced} that distribution. We will compute the expected error $\ep^w$ of $f$ on the weighted problem: \begin{align} - \ep^w - &= \Ep_{(\vx,y) \sim \cD^w} + \ep^w + &= \Ep_{(\vx,y) \sim \cD^w} \Big[ \al^{y=1} \big[f(\vx) \neq y\big] \Big] \\ &= \sum_{\vx \in \cX} \sum_{y \in \pm 1} \cD^w(\vx,y) \al^{y=1} \big[f(\vx) \neq y\big] \\ @@ -312,7 +312,7 @@ \section{Multiclass Classification} Algorithms~\ref{alg:complex:ovatrain} and \ref{alg:complex:ovatest}. In the testing procedure, the prediction of the $i$th classifier is added to the overall score for class $i$. Thus, if the prediction is -positive, class $i$ gets a vote; if the prdiction is negative, +positive, class $i$ gets a vote; if the prediction is negative, everyone else (implicitly) gets a vote. (In fact, if your learning algorithm can output a confidence, as discussed in Section~\ref{}, you can often do better by using the confidence as $y$, rather than a @@ -533,7 +533,7 @@ \section{Ranking} a large number of documents, somehow assimilating the preference function into an overall permutation. -For notationally simplicity, let $\vx_{nij}$ denote the features +For notational simplicity, let $\vx_{nij}$ denote the features associated with comparing document $i$ to document $j$ on query $n$. Training is fairly straightforward. For every $n$ and every pair $i \neq j$, we will create a binary classification example based on @@ -603,7 +603,7 @@ \section{Ranking} Second, rather than producing a list of scores and then calling an arbitrary sorting algorithm, you can actually use the preference function as the sorting function inside your own implementation of -quicksort. +quicksort. We can now formalize the problem. Define a ranking as a function $\si$ that maps the objects we are ranking (documents) to the desired @@ -825,7 +825,7 @@ \section{Further Reading} % \learningproblem{Collective Classification}{ % \item An input space $\cX$ and number of classes $K$ % \item An unknown distribution $\cD$ over $\cG(\cX\times[K])$ -% }{A function $f : \cG(\cX) \fto \cG([K])$ minimizing: +% }{A function $f : \cG(\cX) \fto \cG([K])$ minimizing: % $\Ep_{(V,E) \sim \cD} \left[ % \sum_{v \in V} \big[ \hat y_v \neq y_v \big] % \right]$, where $y_v$ is the label associated with vertex $v$ in $G$ @@ -950,7 +950,7 @@ \section{Further Reading} % ensure that your predictions at the $k$th layer are indicative of how % well the algorithm will actually do at test time. -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/courseml.lot b/book/courseml.lot deleted file mode 100644 index 2c628b5..0000000 --- a/book/courseml.lot +++ /dev/null @@ -1,29 +0,0 @@ -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\contentsline {table}{\numberline {5.1}{\ignorespaces Table of f-measures when varying precision and recall values.}}{69}{table.5.1} -\contentsline {table}{\numberline {5.2}{\ignorespaces Table of significance values for the t-test.}}{74}{table.5.2} -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\contentsline {table}{\numberline {10.1}{\ignorespaces Small XOR data set.}}{134}{table.10.1} -\addvspace {10\p@ } -\addvspace {10\p@ } -\contentsline {table}{\numberline {12.1}{\ignorespaces Data set for learning conjunctions.}}{161}{table.12.1} -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\addvspace {10\p@ } -\contentsfinish diff --git a/book/dt.tex b/book/dt.tex index 9ba95c1..07c89ff 100644 --- a/book/dt.tex +++ b/book/dt.tex @@ -226,7 +226,7 @@ \section{The Decision Tree Model of Learning} You want to find a feature that is \emph{most useful} in helping you guess whether this student will enjoy this course. A useful way to think about this is to look at the \concept{histogram} -of labels for each feature. +of labels for each feature. \sidenote{A colleague related the story of getting his 8-year old nephew to guess a number between 1 and 100. His nephew's first four questions @@ -247,12 +247,12 @@ \section{The Decision Tree Model of Learning} like this course. More formally, you will consider each feature in turn. You might -consider the feature ``Is this a System's course?'' This feature has -two possible value: no and yes. Some of the training examples have an +consider the feature ``Is this a Systems course?'' This feature has +two possible values: no and yes. Some of the training examples have an answer of ``no'' -- let's call that the ``NO'' set. Some of the training examples have an answer of ``yes'' -- let's call that the ``YES'' set. For each set (NO and YES) we will build a histogram over -the labels. This is the second histogram in +the labels. This is the fourth histogram (from the top) in Figure~\ref{fig:dt_histogram}. Now, suppose you were to ask this question on a random example and observe a value of ``no.'' Further suppose that you must \emph{immediately} guess the label for this @@ -414,7 +414,7 @@ \section{Formalizing the Learning Problem} Note that the loss function is something that \emph{you} must decide on based on the goals of learning. -\begin{mathreview}{Expectated Values} +\begin{mathreview}{Expected Values} We write $\Ep_{(\vx,y) \sim \cD} [ \ell(y, f(\vx)) ]$ for the expected loss. Expectation means ``average.'' This is saying ``if you drew a bunch of $(x,y)$ pairs independently at random from $\cD$, what would your \emph{average} loss be?% (More formally, what would be the average of $\ell(y,f(\vx))$ be over these random draws?) More formally, if $\cD$ is a discrete probability distribution, then this expectation can be expanded as: % @@ -426,12 +426,12 @@ \section{Formalizing the Learning Problem} If $D$ is a \emph{finite discrete distribution}, for instance defined by a finite data set $\{ (\vx_1,y_1), \dots, (\vx_N,y_N)$ that puts equal weight on each example (probability $1/N$), then we get: % \begin{align} -\Ep_{(\vx,y) \sim D} [ \ell(y, f(\vx)) ] +\Ep_{(\vx,y) \sim D} [ \ell(y, f(\vx)) ] &= \sum_{(\vx,y) \in D} [ D(\vx,y) \ell(y, f(\vx)) ] \becauseof{definition of expectation}\\ &= \sum_{n=1}^N [ D(\vx_n,y_n) \ell(y_n, f(\vx_n)) ] \becauseof{$D$ is discrete and finite}\\ -&= \sum_{n=1}^N [ \frac 1 N \ell(y_n, f(\vx_n)) ] +&= \sum_{n=1}^N [ \frac 1 N \ell(y_n, f(\vx_n)) ] \becauseof{definition of $D$}\\ &= \frac 1 N \sum_{n=1}^N [ \ell(y_n, f(\vx_n)) ] \becauseof{rearranging terms} @@ -501,7 +501,7 @@ \section{Formalizing the Learning Problem} $\hat \vx$ to corresponding prediction $\hat y$. The key property that $f$ should obey is that it should do well (as measured by $\ell$) on future examples that are \emph{also} drawn from $\cD$. Formally, -it's \concept{expected loss} $\ep$ over $\cD$ with repsect to $\ell$ +its \concept{expected loss} $\ep$ over $\cD$ with respect to $\ell$ should be as small as possible: \begin{align} \label{eq:expectederror} \ep @@ -542,7 +542,7 @@ \section{Formalizing the Learning Problem} \concept{generalize} beyond the training data to some future data that it might not have seen yet! -So, putting it all together, we get a formal definition of induction +So, putting it all together, we get a formal definition of induction in machine learning: \bigemph{Given (i) a loss function $\ell$ and (ii) a sample $D$ from some unknown distribution $\cD$, you must compute a function $f$ that has low expected error $\ep$ over $\cD$ with @@ -624,7 +624,7 @@ \section{Further Reading} \end{comment} -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/em.tex b/book/em.tex index 67ad862..3016966 100644 --- a/book/em.tex +++ b/book/em.tex @@ -1,8 +1,3 @@ -%%% Local Variables: -%%% mode: latex -%%% TeX-master: "courseml" -%%% End: - \chapter{Expectation Maximization} \label{sec:em} \chapterquote{A hen is only an egg's way of making another egg.}{Samuel~Butler} @@ -16,7 +11,7 @@ \chapter{Expectation Maximization} \label{sec:em} lower bounds. \item Implement EM for clustering with mixtures of Gaussians, and contrasting it with $k$-means. -\item Evaluate the differences betweem EM and gradient descent for +\item Evaluate the differences between EM and gradient descent for hidden variable models. \end{learningobjectives} @@ -100,7 +95,7 @@ \section{Grading an Exam without an Answer Key} &p(\vec a, \vec t, \vec s) \nonumber\\ &= \textcolor{darkpurple}{\left[ \prod_m 0.5^{t_m} 0.5^{1-t_m} \right]} \times \textcolor{darkblue}{\left[ \prod_n 1 \right]} \nonumber\\ - &\qquad \times \textcolor{darkred}{\left[ \prod_n\prod_m + &\qquad \times \textcolor{darkred}{\left[ \prod_n\prod_m s_n^{a_{n,m}t_m} (1-s_n)^{(1-a_{n,m})t_m} \right.} \nonumber\\ &\qquad\qquad\textcolor{darkred}{\left. s_n^{(1-a_{n,m})(1-t_m)} @@ -108,7 +103,7 @@ \section{Grading an Exam without an Answer Key} \right]} \\ &= \textcolor{darkpurple}{0.5^M} \textcolor{darkred}{ -\prod_n\prod_m +\prod_n\prod_m s_n^{a_{n,m}t_m} (1-s_n)^{(1-a_{n,m})t_m} s_n^{(1-a_{n,m})(1-t_m)} @@ -116,7 +111,7 @@ \section{Grading an Exam without an Answer Key} } \end{align} -Suppose we knew the true lables $\vec t$. We can take the log of this likelihood and differentiate it with respect to the score $s_n$ of some student (note: we can drop the $0.5^M$ term because it is just a constant): +Suppose we knew the true labels $\vec t$. We can take the log of this likelihood and differentiate it with respect to the score $s_n$ of some student (note: we can drop the $0.5^M$ term because it is just a constant): ~ \begin{align} \log p(\vec a, \vec t, \vec s) @@ -143,7 +138,7 @@ \section{Grading an Exam without an Answer Key} Putting this together, we get: % \begin{align} - s_n &= \frac 1 M \sum_m \big[ a_{n,m} t_m + (1-a_{n,m}) (1-t_m) \big] + s_n &= \frac 1 M \sum_m \big[ a_{n,m} t_m + (1-a_{n,m}) (1-t_m) \big] \end{align} % In the case of known $t$s, this matches exactly what we had in the heuristic. @@ -154,7 +149,7 @@ \section{Grading an Exam without an Answer Key} If we are going to compute expectations of $t$, we have to say: expectations according to which probability distribution? We will use the distribution $p(t_m \| \vec a, \vec s)$. Let $\tilde t_m$ denote $\Ep_{t_m \sim p(t_m \| \vec a, \vec s)}[t_m]$. -Because $t_m$ is a binary variable, its expectation is equal to it's probability; +Because $t_m$ is a binary variable, its expectation is equal to its probability; namely: $\tilde t_m = p(t_m \| \vec a, \vec s)$. How can we compute this? @@ -162,7 +157,7 @@ \section{Grading an Exam without an Answer Key} The computation is straightforward: % \begin{align} - C &= 0.5 \prod_n s_n^{a_{n,m}} (1-s_n)^{1-a_{n,m}} + C &= 0.5 \prod_n s_n^{a_{n,m}} (1-s_n)^{1-a_{n,m}} &= 0.5 \prod_{\substack{n : \\ a_{n,m} = 1}} s_n \prod_{\substack{n : \\ a_{n,m} = 0}} (1-s_n) \\ D &= 0.5 \prod_n s_n^{1-a_{n,m}} (1-s_n)^{a_{n,m}} &= 0.5 \prod_{\substack{n : \\ a_{n,m} = 1}} (1-s_n) \prod_{\substack{n : \\ a_{n,m} = 0}} s_n @@ -197,7 +192,7 @@ \section{Grading an Exam without an Answer Key} \section{Clustering with a Mixture of Gaussians} -In Chapter~\ref{sec:prob}, you learned about probabilitic models for +In Chapter~\ref{sec:prob}, you learned about probabilistic models for classification based on density estimation. Let's start with a fairly simple classification model that \emph{assumes} we have labeled data. We will shortly remove this assumption. Our model will state that we @@ -237,11 +232,11 @@ \section{Clustering with a Mixture of Gaussians} of the log likelihood: % \begin{align} -\th_k &= \text{fraction of training examples in class $k$} \\ +\th_k &= \text{fraction of training examples in class $k$} \label{eq:em:mlfrac}\\ &= \frac 1 N \sum_n [y_n = k] \nonumber\\ \vec\mu_k &= \text{mean of training examples in class $k$} \\ &= \frac {\sum_n [y_n = k] \vx_n} {\sum_n [y_n = k]} \nonumber\\ -\si^2_k &= \text{variance of training examples in class $k$} \\ +\si^2_k &= \text{variance of training examples in class $k$} \label{eq:em:mlvar}\\ &= \frac {\sum_n [y_n = k] \norm{\vx_n-\mu_k}} {\sum_n [y_n = k]} \nonumber \end{align} % @@ -251,19 +246,19 @@ \section{Clustering with a Mixture of Gaussians} $K$-means algorithm, one potential solution is to iterate. You can start off with guesses for the values of the unknown variables, and then iteratively improve them over time. In $K$-means, the approach -was the \emph{assign} examples to labels (or clusters). This time, +was to \emph{assign} examples to labels (or clusters). This time, instead of making hard assignments (``example $10$ belongs to cluster $4$''), we'll make \concept{soft assignments} (``example $10$ belongs half to cluster $4$, a quarter to cluster $2$ and a quarter to cluster $5$''). So as not to confuse ourselves too much, we'll introduce a -new variable, $\vec z_n = \langle z_{n,1}, \dots, z_{n,K}$ (that sums +new variable, $\vec z_n = \langle z_{n,1}, \dots, z_{n,K} \rangle$ (that sums to one), to denote a fractional assignment of examples to clusters. \TODOFigure{em:piecharts}{A figure showing pie charts} This notion of soft-assignments is visualized in Figure~\ref{fig:em:piecharts}. Here, we've depicted each example as a -pie chart, and it's coloring denotes the degree to which it's been +pie chart, and its coloring denotes the degree to which it's been assigned to each (of three) clusters. The size of the pie pieces correspond to the $\vec z_n$ values. @@ -282,9 +277,9 @@ \section{Clustering with a Mixture of Gaussians} the \concept{fractional assignments} $z_{n,k}$ are easy to compute. Now, akin to $K$-means, given fractional assignments, you need to recompute estimates of the model parameters. In analogy to the -maximum likelihood solution (Eqs~\eqref{}-\eqref{}), you can do this -by counting fractional points rather than full points. This gives the -following re-estimation updates: +maximum likelihood solution (Eqs~\eqref{eq:em:mlfrac}-\eqref{eq:em:mlvar}), you +can do this by counting fractional points rather than full points. This gives +the following re-estimation updates: % \begin{align} \th_k &= \text{fraction of training examples in class $k$} \\ @@ -433,7 +428,7 @@ \section{The Expectation Maximization Framework} % Note that this inequality holds for \emph{any} choice of function $q$, so long as its non-negative and sums to one. In particular, it -needn't even by the same function $q$ for each $n$. We will need to +needn't even be the same function $q$ for each $n$. We will need to take advantage of both of these properties. We have succeeded in our first goal: constructing a lower bound on @@ -457,7 +452,7 @@ \section{The Expectation Maximization Framework} order to ensure that an increase in the lower bound implies an increase in $\cL$, we need to ensure that $\cL(\mat X \| \vth) = \tilde\cL(\mat X \| \vth)$. In words: $\tilde\cL$ should be a lower -bound on $\cL$ that makes contact at the current point, $\vth$. +bound on $\cL$ that makes contact at the current point, $\vth$. %This %is shown in Figure~\ref{fig:em:lb}, including a case where the lower %bound does \emph{not} make contact, and thereby does not guarantee an @@ -491,7 +486,7 @@ \section{Further Reading} -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/ens.tex b/book/ens.tex index 82aaaa7..2f66f33 100644 --- a/book/ens.tex +++ b/book/ens.tex @@ -54,7 +54,7 @@ \section{Voting Multiple Classifiers} time, you can make a prediction by \emph{voting}. On a test example $\hat x$, you compute $\hat y_1 = f_1(\hat x)$, $\dots$, $\hat y_M = f_M(\hat x)$. If there are more $+1$s in the list $\langle y_1, -\dots, y_M$ then you predict $+1$; otherwise you predict $-1$. +\dots, y_M \rangle$ then you predict $+1$; otherwise you predict $-1$. The main advantage of ensembles of different classifiers is that it is unlikely that all classifiers will make the same mistake. In fact, as @@ -269,9 +269,9 @@ \section{Boosting Weak Learners} x-axis). As you can see, if you are willing to boost for many iterations, very shallow trees are quite effective. -In fact, a very popular weak learner is a decision \concept{decision +In fact, a very popular weak learner is a \concept{decision stump}: a decision tree that can only ask \emph{one} question. This -may seem like a silly model (and, in fact, it is on it's own), but +may seem like a silly model (and, in fact, it is on its own), but when combined with boosting, it becomes very effective. To understand why, suppose for a moment that our data consists only of binary features, so that any question that a decision tree might ask is of @@ -282,12 +282,12 @@ \section{Boosting Weak Learners} \thinkaboutit{Why do the functions have this form?} Now, consider the \emph{final} form of a function learned by -AdaBoost. We can expand it as follow, where we let $f_k$ denote the +AdaBoost. We can expand it as follows, where we let $f_k$ denote the single feature selected by the $k$th decision stump and let $s_k$ denote its sign: % \begin{align} -f(\vx) +f(\vx) &= \sgn\left[ \sum_k \al_k f\kth(\vx) \right] \\ &= \sgn\left[ \sum_k \al_k s_k (2 x_{f_k} - 1) \right] \\ &= \sgn\left[ \sum_k 2 \al_k s_k x_{f_k} - \sum_k \al_k s_k \right] \\ @@ -387,7 +387,7 @@ \section{Further Reading} \end{comment} -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/figs/srl_trellis.pdf b/book/figs/srl:trellis.pdf similarity index 100% rename from book/figs/srl_trellis.pdf rename to book/figs/srl:trellis.pdf diff --git a/book/figs/srl_trellis.svg b/book/figs/srl:trellis.svg similarity index 100% rename from book/figs/srl_trellis.svg rename to book/figs/srl:trellis.svg diff --git a/book/formal.tex b/book/formal.tex index 027bdb2..ecb5fcb 100644 --- a/book/formal.tex +++ b/book/formal.tex @@ -73,7 +73,7 @@ \section{Data Generating Distributions} % The take-home message is that if someone gave you access to the data distribution, forming an \emph{optimal} classifier would be trivial. -Unfortunately, no one gave you this distribution, so we need to +Unfortunately, no one gave you this distribution, so we need to figure out ways of learning the mapping from $x$ to $y$ given only access to a training set \emph{sampled from} $\cD$, rather than $\cD$ itself. % Unfortunately, no one gave you this distribution, but this analysis @@ -251,7 +251,7 @@ \section{Underfitting and Overfitting} full decision tree will have either $0$ or $1$ examples assigned to it ($20$ of the leaves will have one example; the rest will have none). For the leaves corresponding to training points, the full decision -tree will always make the correct prediction. +tree will always make the correct prediction. Given this, the training error, $\hat \ep$, is $0/20 = 0\%$. Of course our goal is \emph{not} to build a model that gets $0\%$ @@ -540,11 +540,11 @@ \section{Real World Applications of Machine Learning} In order to make these logs consumable by a machine learning algorithm, (6) we convert the data into input/output pairs: in this case, pairs of words from a bag-of-words representing the query and a bag-of-words representing the ad as input, and the click as a $\pm$ label. We then (7) select a model family (e.g., depth 20 decision trees), and thereby an inductive bias, for instance depth $\leq 20$ decision trees. -We're now ready to (8) select a specific subset of data to use as training data: in this case, data from April 2016. We split this into training and development and (9) learn a final decision tree, tuning the maximum depth on the development data. We can then use this decision tree to (10) make predictions on some held-out test data, in this case from the following month. We can (11) measure the overall quality of our predictor as zero/one loss (clasification error) on this test data and finally (12) deploy our system. +We're now ready to (8) select a specific subset of data to use as training data: in this case, data from April 2016. We split this into training and development and (9) learn a final decision tree, tuning the maximum depth on the development data. We can then use this decision tree to (10) make predictions on some held-out test data, in this case from the following month. We can (11) measure the overall quality of our predictor as zero/one loss (classification error) on this test data and finally (12) deploy our system. The important thing about this sequence of steps is that \emph{in any one, things can go wrong.} That is, between any two rows of this table, we are \emph{necessarily} accumulating some additional error against our original real world goal of increasing revenue. For example, in step 5, we decided on a representation that left out many possible variables we could have logged, like time of day or season of year. By leaving out those variables, we set an explicit upper bound on how well our learned system can do. -It is often an effective strategy to run an \concept{oracle experiment}. In an oracle experiment, we assume that everything below some line can be solved perfectly, and measure how much impact that will have on a higher line. As an extreme example, before embarking on a machine learning approach to the ad display problem, we should measure something like: if our classifier were \emph{perfect}, how much more money would we make? If the number is not very high, perhaps there is some better for our time. +It is often an effective strategy to run an \concept{oracle experiment}. In an oracle experiment, we assume that everything below some line can be solved perfectly, and measure how much impact that will have on a higher line. As an extreme example, before embarking on a machine learning approach to the ad display problem, we should measure something like: if our classifier were \emph{perfect}, how much more money would we make? If the number is not very high, perhaps there is some better use of our time. Finally, although this sequence is denoted linearly, the entire process is highly interactive in practice. A large part of ``debugging'' machine learning (covered more extensively in Chapter~\ref{sec:prac} involves trying to figure out where in this sequence the biggest losses are and fixing that step. In general, it is often useful to \emph{build the stupidest thing that could possibly work}, then look at how well it's doing, and decide if and where to fix it. @@ -554,7 +554,7 @@ \section{Further Reading} TODO further reading -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/halbook.tex b/book/halbook.tex index 1b8af23..1c93fe1 100644 --- a/book/halbook.tex +++ b/book/halbook.tex @@ -182,7 +182,7 @@ \node [mybox] (box) {\usebox{\mathreviewbox}}; \node [fancytitle, right=10pt] at (box.north west) {\normalfont\sffamily\Large\bfseries\scshape Math Review | \mathreviewargument}; \end{tikzpicture} - \caption[Math Review: \mathreviewargument]{~} + %\caption[Math Review: \mathreviewargument]{~} \end{figure*}% } @@ -347,7 +347,7 @@ \newcommand{\thinkaboutit}[1]{% \marginnote{ \noindent - \begin{tikzpicture}[transform shape, rotate=0, baseline=-3.5cm] + \begin{tikzpicture}[transform shape, rotate=0, baseline=-1cm] \node [draw=Blue, fill=Blue!10, very thick, rectangle, rounded corners, inner sep=8pt, inner ysep=5pt] (box) {% \begin{minipage}[t!]{1.8in} #1 diff --git a/book/imit.tex b/book/imit.tex index 8981939..20d837c 100644 --- a/book/imit.tex +++ b/book/imit.tex @@ -26,7 +26,7 @@ \chapter{Imitation Learning} \label{sec:imit} We want to watch the expert driving, and learn to imitate their behavior. Hence: \concept{imitation learning} (sometimes called \concept{learning by demonstration} or \concept{programming by example}, in the sense that programs are learned, and not implemented). -At each point in time $t = 1 \dots T$, the car recieves sensor information $\vx_t$ (for instance, a camera photo ahead of the car, or radar readings). +At each point in time $t = 1 \dots T$, the car receives sensor information $\vx_t$ (for instance, a camera photo ahead of the car, or radar readings). It then has to take an action, $a_t$; in the case of the car, this is one of the three available steering actions. The car then suffers some loss $\ell_t$; this might be zero in the case that it's driving well, or large in the case that it crashes. The world then changes, moves to time step $t+1$, sensor readings $\vx_{t+1}$ are observed, action $a_{t+1}$ is taken, loss $\ell_{t+1}$ is suffered, and the process continues. @@ -108,11 +108,11 @@ \section{Imitation Learning by Classification} So part of the question ``how well does this work'' is the more basic question of: what are we even trying to measure? There is a nice theorem\mycite{ross} that gives an upper bound on the loss suffered by the SupervisedIL algorithm (Algorithm~\ref{alg:imit:supertrain}) as a function of (a) the quality of the expert, and (b) the error rate of the learned classifier. -To be clear, we need to distinguish between the loss of the policy when run for $T$ steps to form a full trajectory, and the error rate of the learned classifier, which is just it's average multiclass classification error. +To be clear, we need to distinguish between the loss of the policy when run for $T$ steps to form a full trajectory, and the error rate of the learned classifier, which is just its average multiclass classification error. The theorem states, roughly, that the loss of the learned \emph{policy} is at most the loss of the expert plus $T^2$ times the error rate of the classifier. \begin{theorem}[Loss of SupervisedIL] - Suppose that one runs Algorithm~\ref{alg:imit:supertrain} using a multiclass classifier that optimizes the 0-1 loss (or an upperbound thereof). + Suppose that one runs Algorithm~\ref{alg:imit:supertrain} using a multiclass classifier that optimizes the 0-1 loss (or an upper bound thereof). Let $\ep$ be the error rate of the underlying classifier (in expectation) and assume that all instantaneous losses are in the range $[0, \ell\xth{max}]$. Let $f$ be the learned policy; then: \begin{align} @@ -124,10 +124,10 @@ \section{Imitation Learning by Classification} \end{theorem} Intuitively, this bound on the loss is about a factor of $T$ away from what we might hope for. -In particular, the multiclass classifier makes errors on an $\ep$ fraction of it's actions, measured by zero/one loss. +In particular, the multiclass classifier makes errors on an $\ep$ fraction of its actions, measured by zero/one loss. In the worst case, this will lead to a loss of $\ell\xth{max}\ep$ for a single step. Summing all these errors over the entire trajectory would lead to a loss on the order of $\ell\xth{max}T\ep$, which is a factor $T$ better than this theorem provides. -A natural question (addressed in the next section) is whether this is analysis is tight. +A natural question (addressed in the next section) is whether this analysis is tight. A related question (addressed in the section after that) is whether we can do better. Before getting there, though, it's worth highlighting that an extra factor of $T$ is \emph{really bad.} It can cause even very small multiclass error rates to blow up; in particular, if $\ep \geq 1/T$, we lose, and $T$ can be in the hundreds or more. @@ -140,7 +140,7 @@ \section{Failure Analysis} As a concrete example, perhaps the expert driver never ever gets themselves into a state where they are directly facing a wall. Moreover, the expert driver probably tends to drive forward more than backward. If the imperfect learner manages to make a few errors and get stuck next to a wall, it's likely to resort to the general ``drive forward'' rule and stay there forever. -This is the problem of \concept{compounding error}; +This is the problem of \concept{compounding error}; and yes, it does happen in practice. It turns out that it's possible to construct an imitation learning problem on which the $T^2$ compounding error is unavoidable. @@ -150,7 +150,7 @@ \section{Failure Analysis} The ``correct'' thing to do at $t=1$ is to press the button that corresponds to the image you've been shown. Pressing the correct button leads to $\ell_1=0$; the incorrect leads to $\ell_1=1$. Now, at time $t=2$ you are shown another image, again of a zero or one. -The correct thing to do in this time step is the xor of (a) the number written on the picture you see right now, and (b) the correct answer from the previous time step. +The ``correct'' thing to do in this time step is the xor of (a) the number written on the picture you see right now, and (b) the correct answer from the previous time step. This holds in general for $t>1$. There are two important things about this construction. @@ -253,7 +253,7 @@ \section{Dataset Aggregation} This is formalized in the following theorem: \begin{theorem}[Loss of Dagger] - Suppose that one runs Algorithm~\ref{alg:imit:dagger} using a multiclass classifier that optimizes the 0-1 loss (or an upperbound thereof). + Suppose that one runs Algorithm~\ref{alg:imit:dagger} using a multiclass classifier that optimizes the 0-1 loss (or an upper bound thereof). Let $\ep$ be the error rate of the underlying classifier (in expectation) and assume that all instantaneous losses are in the range $[0, \ell\xth{max}]$. Let $f$ be the learned policy; then: \begin{align} @@ -287,10 +287,10 @@ \section{Expensive Algorithms as Experts} \end{enumerate} Consider the game playing example, and for concreteness, suppose you are trying to learn to play solitaire (this is an easier example because it's a single player game). -When running DaggerTrain (Algorithm~\ref{alg:imit:dagger} to learn a chess-playing policy, the algorithm will repeatedly ask for $\VARm{\text{expert}}(\VARm{\vx})$, where $\vx$ is the current state of the game. +When running DaggerTrain (Algorithm~\ref{alg:imit:dagger}) to learn a solitaire-playing policy, the algorithm will repeatedly ask for $\VARm{\text{expert}}(\VARm{\vx})$, where $\vx$ is the current state of the game. What should this function return? Ideally, it should return the/an action $a$ such that, if $a$ is taken, and then the rest of the game is played optimally, the player wins. -Computing this exactly is going to be very difficult for anything except the simplest games, so we need to restort to an approxiamtion. +Computing this exactly is going to be very difficult for anything except the simplest games, so we need to resort to an approximation. \newalgorithm% {imit:dldfs}% @@ -319,7 +319,7 @@ \section{Expensive Algorithms as Experts} \TODOFigure{imit:dldfs}{Depth limited depth-first search} -A common strategy is to run a depth-limited depth first search, starting at state $\vx$, and terminating after at most three of four moves (see Figure~\ref{fig:imit:dldfs}). +A common strategy is to run a depth-limited depth first search, starting at state $\vx$, and terminating after at most three or four moves (see Figure~\ref{fig:imit:dldfs}). This will generate a search tree. Unless you are very near the end of the game, none of the leaves of this tree will correspond to the end of the game. So you'll need some heuristic, $h$, for evaluating states that are non-terminals. @@ -375,7 +375,7 @@ \section{Structured Prediction via Imitation Learning} &= \argmin_a \text{best}(\ell, \vy, \hat\vy \circ a) \end{align} ~ -Namely, it is the action that leads to the best possible completion \emph{after} taking that action. +Namely, it is the action that leads to the best possible completion \emph{after} taking that action. So in the example above, the expert action is ``adj''. For some problems and some loss functions, computing the expert is easy. In particular, for sequence labeling under Hamming loss, it's trivial. @@ -392,7 +392,7 @@ \section{Structured Prediction via Imitation Learning} The expert label for the $t$th word is just the corresponding label in the ground truth $\vy$. Given all this, one can run Dagger (Algorithm~\ref{alg:imit:dldfs}) exactly as specified. -Moving to structured prediction problems other than sequence labeling problems is beyond the scope of this book. +Moving to structured prediction problems other than sequence labeling problems is beyond the scope of this book. The general framework is to cast your structured prediction problem as a sequential decision making problem. Once you've done that, you need to decide on features (this is the easy part) and an expert (this is often the harder part). However, once you've done so, there are generic libraries for ``compiling'' your specification down to code. @@ -402,7 +402,7 @@ \section{Further Reading} TODO further reading -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/kernel.tex b/book/kernel.tex index 73da482..d64b3de 100644 --- a/book/kernel.tex +++ b/book/kernel.tex @@ -6,10 +6,9 @@ \chapter{Kernel Methods} \label{sec:kernel} \item Explain how kernels generalize both feature combinations and basis functions. \item Contrast dot products with kernel products. -\item Implement kernelized perceptron. +\item Implement a kernelized version of the perceptron. \item Derive a kernelized version of regularized least squares regression. -\item Implement a kernelized version of the perceptron. \item Derive the dual formulation of the support vector machine. \end{learningobjectives} @@ -181,7 +180,7 @@ \section{Kernelized Perceptron} \end{myproof} Now that you know that you can always write $\vw = \sum_n \al_n -\phi(\vx_n)$ for some $\al_i$s, you can additionall compute the +\phi(\vx_n)$ for some $\al_i$s, you can additionally compute the activations (line 4) as: % \begin{align} @@ -223,7 +222,7 @@ \section{Kernelized Perceptron} feature expansions like the quadratic feature expansion from the introduction for ``free.'' For example, for exactly the same cost as the quadratic features, you can use a \concept{cubic feature map}, -computed as $\ddot{\phi(\vx)}{\phi(\vz)} = (1 + \dotp{\vx}{\vz})^3$, +computed as $\dotp{\phi(\vx)}{\phi(\vz)} = (1 + \dotp{\vx}{\vz})^3$, which corresponds to three-way interactions between variables. (And, in general, you can do so for any polynomial degree $p$ at the same computational complexity.) @@ -262,7 +261,7 @@ \section{Kernelized K-means} compute norms. This can be done as follows: % \begin{align} -z_n +z_n &= \arg\min_k \norm{\textcolor{darkblue}{\phi(\vx_n)} - \textcolor{darkred}{\vec\mu\kth}}^2 \becauseof{definition of $z_n$} \\ &= \arg\min_k \norm{\textcolor{darkblue}{\phi(\vx_n)} - \textcolor{darkred}{\sum_m \al\kth_m \phi(\vx_m)}}^2 @@ -365,11 +364,11 @@ \section{What Makes a Kernel} can do this as follows: % \begin{align} -\int\!\!\!\int f(\vx) K(\vx,\vz) f(\vz) \ud \vx \ud \vz -&= \int\!\!\!\int f(\vx) \left[ K_1(\vx,\vz) + K_2(\vx,\vz) \right] f(\vz) \ud \vx \ud \vz +\int\!\!\!\int f(\vx) K(\vx,\vz) f(\vz) \ud \vx \ud \vz +&= \int\!\!\!\int f(\vx) \left[ K_1(\vx,\vz) + K_2(\vx,\vz) \right] f(\vz) \ud \vx \ud \vz \becauseof{definition of $K$}\\ &= \int\!\!\!\int f(\vx) K_1(\vx,\vz) f(\vz) \ud \vx \ud \vz \nonumber\\ -&\quad + \int\!\!\!\int f(\vx) K_2(\vx,\vz) f(\vz) \ud \vx \ud \vz +&\quad + \int\!\!\!\int f(\vx) K_2(\vx,\vz) f(\vz) \ud \vx \ud \vz \becauseof{distributive rule}\\ &> 0 + 0 \becauseof{$K_1$ and $K_2$ are psd} @@ -411,7 +410,7 @@ \section{What Makes a Kernel} in analysis (particularly, integration by parts), but otherwise not difficult. Again, the proof is provided in the appendix. -So far, you have seen two bsaic classes of kernels: polynomial kernels +So far, you have seen two basic classes of kernels: polynomial kernels ($K(\vx,\vz) = (1 + \dotp{\vx}{\vz})^d$), which includes the linear kernel ($K(\vx,\vz) = \dotp{\vx}{\vz}$) and RBF kernels ($K(\vx,\vz) = \exp[-\ga \norm{\vx-\vz}^2]$). The former have a direct connection to @@ -492,7 +491,7 @@ \section{Support Vector Machines} % \begin{align} \cL(\vw,b,\vec\xi,\vec\al,\vec\be) -&= +&= \textcolor{darkblue}{\frac 1 2 \norm{\vw}^2} %\\&\qquad + \textcolor{darkergreen}{C \sum_n \xi_n} @@ -501,7 +500,7 @@ \section{Support Vector Machines} \\&\qquad - \sum_n \al_n \left[ \textcolor{darkred}{y_n \left( \dotp{\vw}{\vx_n} + b \right) - 1 + \xi_n} - \right] + \right] \end{align} % The \emph{new} optimization problem is: @@ -518,17 +517,17 @@ \section{Support Vector Machines} $+\infty$, breaking the solution. You can solve this problem by taking gradients. This is a bit -tedious, but and important step to realize how everything fits +tedious, but an important step to realize how everything fits together. Since your goal is to remove the dependence on $\vw$, the first step is to take a gradient with respect to $\vw$, set it equal to zero, and solve for $\vw$ in terms of the other variables. % \begin{align} \grad_{\vw} \cL -= \textcolor{darkblue}{\vw} += \textcolor{darkblue}{\vw} - \sum_n \al_n \textcolor{darkred}{y_n \vx_n} = 0 \quad\Longleftrightarrow\quad -\textcolor{darkblue}{\vw} +\textcolor{darkblue}{\vw} = \sum_n \al_n \textcolor{darkred}{y_n \vx_n} \end{align} % @@ -543,7 +542,7 @@ \section{Support Vector Machines} % \begin{align} \cL(b,\vec\xi,\vec\al,\vec\be) -&= +&= \textcolor{darkblue}{\frac 1 2 \norm{\sum_m \al_m y_m \vx_m}^2} %\\&\qquad + \textcolor{darkergreen}{C \sum_n \xi_n} @@ -560,7 +559,7 @@ \section{Support Vector Machines} % \begin{align} \cL(b,\vec\xi,\vec\al,\vec\be) -&= +&= \textcolor{darkblue}{\frac 1 2 \sum_n \sum_m \al_n \al_m y_n y_m \dotp{\vx_n}{\vx_m} } @@ -568,7 +567,7 @@ \section{Support Vector Machines} \\&\qquad - \textcolor{darkred}{ \sum_n - \sum_m + \sum_m \al_n \al_m y_n y_m \dotp{\vx_n}{\vx_m} } - \textcolor{darkred}{ @@ -580,7 +579,7 @@ \section{Support Vector Machines} + \sum_n (\textcolor{darkergreen}{C} - \textcolor{darkpurple}{\be_n}) \xi_n \\&\qquad \textcolor{darkred}{ -- b \sum_n \al_n y_n +- b \sum_n \al_n y_n - \sum_n \al_n (\xi_n - 1)} \end{align} % @@ -598,7 +597,7 @@ \section{Support Vector Machines} \end{align} % This doesn't allow you to \emph{substitute} $b$ with something (as you -did with $\vw$), but it does mean that the fourth term ($b \sum_n +did with $\vw$), but it does mean that the thirs term ($b \sum_n \al_n y_n$) goes to zero at the optimum. The last of the original variables is $\xi_n$; the derivatives in this @@ -631,7 +630,7 @@ \section{Support Vector Machines} \end{align} % If you are comfortable with matrix notation, this has a very compact -form. Let $\vec 1$ denote the $N$-dimensional vector of all $1$s, +form. Let $\vec 1$ denote the $N$-dimensional vector of all $1$s, let $\vec y$ denote the vector of labels and let $\mat G$ be the $N \times N$ matrix, where $\mat G_{n,m} = y_n y_m K(\vx_n, \vx_m)$, then this has the following form: @@ -650,7 +649,7 @@ \section{Support Vector Machines} problem is: % \optimize{kernel:svmdual}{\vec\al}{% -- \cL(\vec \al) = +- \cL(\vec \al) = \frac 1 2 \sum_n \sum_m \al_n \al_m y_n y_m K(\vx_n,\vx_m) - \sum_n \al_n @@ -786,7 +785,7 @@ \section{Kernelized Regression} labels. This algorithm is, in some ways, even easier to kernelize than the -perceptron. The optimal solution has a closed form, and +perceptron. The optimal solution has a closed form, and \end{comment} \section{Further Reading} @@ -800,7 +799,7 @@ \section{Further Reading} % http://nlp.stanford.edu/IR-book/html/htmledition/nonlinear-svms-1.html -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" %%% End: diff --git a/book/knn.tex b/book/knn.tex index 3833e6d..b88b1b9 100644 --- a/book/knn.tex +++ b/book/knn.tex @@ -47,7 +47,7 @@ \section{From Data to Feature Vectors} the \concept{feature values}, and how they vary across examples, mean something to the machine. From this perspective, you can think about an example as being represented by a \concept{feature vector} -consisting of one ``dimension'' for each feature, where each dimenion +consisting of one ``dimension'' for each feature, where each dimension is simply some real value. Consider a review that said ``excellent'' three times, had one @@ -170,9 +170,6 @@ \section{From Data to Feature Vectors} \thinkaboutit{Verify that $d$ from Eq~\eqref{eq:euclidean} gives the same result ($4.24$) for the previous computation.} -\Figure{knn_classifyit}{A figure showing an easy NN classification - problem where the test point is a ? and should be negative.} - Now that you have access to distances between examples, you can start thinking about what it means to learn again. Consider Figure~\ref{fig:knn_classifyit}. We have a collection of training @@ -186,7 +183,7 @@ \section{From Data to Feature Vectors} of nearby points. This is an example of a new form of \concept{inductive bias}. -The \concept{nearest neighbor} classifier is build upon this insight. +The \concept{nearest neighbor} classifier is built upon this insight. In comparison to decision trees, the algorithm is ridiculously simple. At training time, we simply store the entire training set. At test time, we get a test example $\hat\vx$. To predict its label, @@ -196,11 +193,14 @@ \section{From Data to Feature Vectors} it has a corresponding label, $y$. We predict that the label of $\hat\vx$ is also $y$. +\Figure{knn_classifyit}{A figure showing an easy NN classification + problem where the test point is a ? and should be negative.} + Despite its simplicity, this nearest neighbor classifier is incredibly effective. (Some might say \emph{frustratingly} effective.) However, it is particularly prone to overfitting label noise. Consider the data in Figure~\ref{fig:knn_classifyitbad}. You would probably want -to label the test point positive. Unfortunately, it's nearest +to label the test point positive. Unfortunately, its nearest neighbor happens to be negative. Since the nearest neighbor algorithm only looks at the \emph{single} nearest neighbor, it cannot consider the ``preponderance of evidence'' that this point should probably @@ -220,6 +220,11 @@ \section{From Data to Feature Vectors} positive and one is negative. Through voting, positive would win. \thinkaboutit{Why is it a good idea to use an odd number for $K$?} +\thinkaboutit{Why is the sign of the sum computed in lines 2-4 the + same as the majority vote of the associated training examples?} +\thinkaboutit{Why can't you simply pick the value of $K$ that does + best on the training data? In other words, why do we have to treat + it like a hyperparameter rather than just a parameter.} \newalgorithm% {knn:knn}% @@ -262,9 +267,6 @@ \section{From Data to Feature Vectors} \emph{summing} the class labels for each of the $K$ nearest neighbors (lines 6-10) and using the \FUN{sign} of this sum as our prediction. -\thinkaboutit{Why is the sign of the sum computed in lines 2-4 the - same as the majority vote of the associated training examples?} - The big question, of course, is how to choose $K$. As we've seen, with $K=1$, we run the risk of overfitting. On the other hand, if $K$ is large (for instance, $K=N$), then \FUN{KNN-Predict} will always @@ -273,9 +275,14 @@ \section{From Data to Feature Vectors} between overfitting (small value of $K$) and underfitting (large value of $K$). -\thinkaboutit{Why can't you simply pick the value of $K$ that does - best on the training data? In other words, why do we have to treat - it like a hyperparameter rather than just a parameter.} +\MoveNextFigure{-7cm} +\Figure{knn:ski}{A figure of a ski and a snowboard.} +%\MoveNextFigure{-3cm} +\Figure{knn:skidata}{Classification data for ski vs snowboard in + 2d} +%\MoveNextFigure{-5cm} +\Figure{knn:skidatabad}{Classification data for ski vs snowboard in + 2d, with width rescaled to mm.} One aspect of \concept{inductive bias} that we've seen for KNN is that it assumes that nearby points should have the same label. Another @@ -289,13 +296,6 @@ \section{From Data to Feature Vectors} relevant features and lots of irrelevant features, KNN is likely to do poorly. -\MoveNextFigure{-18cm} -\Figure{knn:ski}{A figure of a ski and a snowboard.} - -\MoveNextFigure{-10cm} -\Figure{knn:skidata}{Classification data for ski vs snowboard in - 2d} - A related issue with KNN is \concept{feature scale}. Suppose that we are trying to classify whether some object is a ski or a snowboard (see Figure~\ref{fig:knn:ski}). We are given two features about this @@ -307,18 +307,13 @@ \section{From Data to Feature Vectors} well. Suppose, however, that our measurement of the width was computed in -millimeters (instead of centimeters). This yields the data shown in +centimeters (instead of millimeters). This yields the data shown in Figure~\ref{fig:knn:skidatabad}. Since the width values are now tiny, in comparison to the height values, a KNN classifier will effectively \emph{ignore} the width values and classify almost purely based on height. The predicted class for the displayed test point had changed because of this feature scaling. -\MoveNextFigure{-5cm} -\Figure{knn:skidatabad}{Classification data for ski vs snowboard in - 2d, with width rescaled to mm.} - - We will discuss feature scaling more in Chapter~\ref{sec:prac}. For now, it is just important to keep in mind that KNN does not have the power to decide which features are important. @@ -329,7 +324,7 @@ \section{From Data to Feature Vectors} vector. In general, if $\vx = \langle x_1, x_2, \dots, x_D \rangle$, then $x_d$ is it's $d$th component. So $x_3 = -6$ in the previous example. - + ~ \textbf{Vector sums} are computed pointwise, and are only defined when @@ -358,13 +353,13 @@ \section{Decision Boundaries} The standard way that we've been thinking about learning algorithms up to now is in the \emph{query model}. Based on training data, you learn something. I then give you a query example and you have to -guess it's label. +guess its label. \Figure{knn:db}{decision boundary for 1nn.} An alternative, less passive, way to think about a learned model is to ask: what sort of test examples will it classify as positive, and what -sort will it classify as negative. In Figure~\ref{fig:knn:db}, we have a +sort will it classify as negative? In Figure~\ref{fig:knn:db}, we have a set of training data. The background of the image is colored blue in regions that \emph{would} be classified as positive (if a query were issued there) and colored red in regions that \emph{would} be @@ -383,7 +378,7 @@ \section{Decision Boundaries} with a decision boundary that is really jagged (like the coastline of Norway) is really complex and prone to overfitting. A learned model with a decision boundary that is really simple (like the bounary -between Arizona and Utah) is potentially underfit. +between Arizona and Utah) is potentially underfit. %In %Figure~\ref{fig:knn:dbmany}, you can see the decision boundaries for KNN %models with $K \in \{1, 3, 5, 7\}$. As you can see, the boundaries @@ -420,7 +415,7 @@ \section{Decision Boundaries} axis-aligned cuts. The cuts must be axis-aligned because nodes can only query on a single feature at a time. In this case, since the decision tree was so shallow, the decision boundary was relatively -simple. +simple. \thinkaboutit{What sort of data might yield a very simple decision boundary with a decision tree and very complex decision boundary @@ -512,7 +507,7 @@ \section{Decision Boundaries} repeats until the centers converge. An obvious question about this algorithm is: does it converge? A -second question is: how long does it take to converge. The first +second question is: how long does it take to converge? The first question is actually easy to answer. Yes, it does. And in practice, it usually converges quite quickly (usually fewer than $20$ iterations). In Chapter~\ref{sec:unsup}, we will actually @@ -537,7 +532,7 @@ \section{Decision Boundaries} guaranteed to converge to the ``right answer.'' The key problem with unsupervised learning is that we have no way of knowing what the ``right answer'' is. Convergence to a bad solution is usually due to -poor initialization. +poor initialization. %For example, poor initialization in the data set %from before yields convergence like that seen in %Figure~\ref{fig:knn:kmeansbad}. As you can see, the algorithm @@ -561,7 +556,7 @@ \section{Warning: High Dimensions are Scary} at \url{gutenberg.org/ebooks/201}.} In addition to being hard to visualize, there are at least two -additional problems in high dimensions, both refered to as +additional problems in high dimensions, both referred to as \concept{the curse of dimensionality}. One is computational, the other is mathematical. @@ -592,7 +587,7 @@ \section{Warning: High Dimensions are Scary} dimensions, this gridding technique will only be useful if you have at least $95$ trillion training examples. -For ``medium dimensional'' data (approximately $1000$) dimesions, the +For ``medium dimensional'' data (approximately $1000$) dimensions, the number of grid cells is a $9$ followed by $698$ numbers before the decimal point. For comparison, the number of atoms in the universe is approximately $1$ followed by $80$ zeros. So even if each atom @@ -629,8 +624,8 @@ \section{Warning: High Dimensions are Scary} sphere in the middle so that it touches all four green spheres. We can easily compute the radius of this small sphere. The pythagorean theorem says that $1^2 + 1^2 = (1+r)^2$, so solving for $r$ we get $r -= \sqrt 2 - 1 \approx 0.41$. Thus, by calculation, the blue sphere -lies entirely within the cube (cube = square) that contains the grey += \sqrt 2 - 1 \approx 0.41$. Thus, by calculation, the red sphere +lies entirely within the cube (cube = square) that contains the green spheres. (Yes, this is also obvious from the picture, but perhaps you can see where this is going.) @@ -638,29 +633,29 @@ \section{Warning: High Dimensions are Scary} Now we can do the same experiment in three dimensions, as shown in Figure~\ref{fig:knn:cursethree}. Again, we can use the pythagorean -theorem to compute the radius of the blue sphere. Now, we get $1^2 + +theorem to compute the radius of the red sphere. Now, we get $1^2 + 1^2 + 1^2 = (1+r)^2$, so $r = \sqrt3 - 1 \approx 0.73$. This is still -entirely enclosed in the cube of width four that holds all eight grey +entirely enclosed in the cube of width four that holds all eight green spheres. At this point it becomes difficult to produce figures, so you'll have to apply your imagination. In four dimensions, we would have $16$ green spheres (called \concept{hyperspheres}), each of radius one. They would still be inside a cube (called a \concept{hypercube}) of -width four. The blue hypersphere would have radius $r = \sqrt4 - 1 = -1$. Continuing to five dimensions, the blue hypersphere embedded in +width four. The red hypersphere would have radius $r = \sqrt4 - 1 = +1$. Continuing to five dimensions, the red hypersphere embedded in $256$ green hyperspheres would have radius $r = \sqrt5-1 \approx 1.23$ and so on. In general, in $D$-dimensional space, there will be $2^D$ green hyperspheres of radius one. Each green hypersphere will touch exactly -$n$-many other hyperspheres. The blue hyperspheres in the middle will +$n$-many other hyperspheres. The red hyperspheres in the middle will touch them all and will have radius $r = \sqrt D - 1$. Think about this for a moment. As the number of dimensions grows, the -radius of the blue hypersphere \emph{grows without bound!}. For -example, in $9$-dimensions the radius of the blue hypersphere is now -$\sqrt9-1 = 2$. But with a radius of two, the blue hypersphere is now +radius of the red hypersphere \emph{grows without bound!}. For +example, in $9$-dimensions the radius of the red hypersphere is now +$\sqrt9-1 = 2$. But with a radius of two, the red hypersphere is now ``squeezing'' between the green hypersphere and \emph{touching} the edges of the hypercube. In $10$ dimensional space, the radius is approximately $2.16$ and it pokes outside the cube. @@ -674,8 +669,8 @@ \section{Warning: High Dimensions are Scary} %example, what you think looks like a ``round'' cluster in two or three %dimensions, might not look so ``round'' in high dimensions. -%\Figure{knn:uniform}{100 uniform random points in 1, 2 and 3 -% dimensions} +\Figure{knn:uniform}{100 uniform random points in 1, 2 and 3 + dimensions} The second strange fact we will consider has to do with the distances between points in high dimensions. We start by considering random @@ -702,7 +697,7 @@ \section{Warning: High Dimensions are Scary} \Big] \Big] \end{equation} We can actually compute this in closed form and arrive -at $\textit{avgDist}(D) = \sqrt D / 3$. Because we know that the maximum distance between two points grows like $\sqrt D$, this says that the ratio between average distance and maximum distance converges to $1/3$. +at $\textit{avgDist}(D) = \sqrt D / 3$. Because we know that the maximum distance between two points grows like $\sqrt D$, this says that the ratio between average distance and maximum distance converges to $1/3$. What is more interesting, however, is the \emph{variance} of the distribution of distances. You can show that in $D$ dimensions, the variance is \emph{constant} $1/\sqrt{18}$, \emph{independent of $D$}. This means that when you look at (variance) divided-by (max distance), the variance behaves like $1/\sqrt{18 D}$, which means that the effective variance continues to shrink as $D$ grows\mycite{brin95nn}. @@ -712,7 +707,7 @@ \section{Warning: High Dimensions are Scary} imagine you are. So I implemented it. In Figure~\ref{fig:knn:uniformhist} you can see the results. This presents a \emph{histogram} of distances between random points in $D$ -dimensions for $D \in \{1,2,3,10,20,100\}$. As you can see, all of +dimensions for $D \in \{2,8,32,128,512\}$. As you can see, all of these distances begin to concentrate around $0.4\sqrt{D}$, even for ``medium dimension'' problems. @@ -856,7 +851,7 @@ \section{Further Reading} K-means clustering \end{comment} -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/loss.tex b/book/loss.tex index 640626c..ff1beee 100644 --- a/book/loss.tex +++ b/book/loss.tex @@ -90,7 +90,7 @@ \section{The Optimization Framework for Linear Models} You might then come back and say: okay, well I don't really need an \emph{exact} solution. I'm willing to have a solution that makes one or two more errors than it has to. Unfortunately, the situation is -really bad. Zero/one loss is NP-hard to even \emph{appproximately +really bad. Zero/one loss is NP-hard to even \emph{approximately minimize}. In other words, there is no efficient algorithm for even finding a solution that's a small constant worse than optimal. (The best known constant at this time is $418/415 \approx 1.007$.) @@ -145,9 +145,12 @@ \section{Convex Surrogate Loss Functions} $1$. But adjusting it upwards by $0.00000009$ will have no effect. This makes it really difficult to figure out good ways to adjust the parameters. - +\MoveNextFigure{-5cm} \Figure{loss:zeroone}{plot of zero/one versus margin} +\Figure{loss:sigmoidzeroone}{plot of zero/one versus margin and an + S version of it} + To see this more clearly, it is useful to look at plots that relate \emph{margin} to \emph{loss}. Such a plot for zero/one loss is shown in Figure~\ref{fig:loss:zeroone}. In this plot, the horizontal axis @@ -159,9 +162,6 @@ \section{Convex Surrogate Loss Functions} that change the margin \emph{just a little bit} can have an enormous effect on the overall loss. -\Figure{loss:sigmoidzeroone}{plot of zero/one versus margin and an - S version of it} - You might decide that a reasonable way to address this problem is to replace the non-smooth zero/one loss with a smooth approximation. With a bit of effort, you could probably concoct an ``S''-shaped @@ -176,7 +176,7 @@ \section{Convex Surrogate Loss Functions} mnemonic is that you can hide under a con{\bf cave} function.) There are two equivalent definitions of a convex function. The first is that it's second derivative is always non-negative. The second, more -geometric, defition is that any \concept{chord} of the function lies +geometric, definition is that any \concept{chord} of the function lies above it. This is shown in Figure~\ref{fig:loss:convex}. There you can see a convex function and a non-convex function, both with two chords drawn in. In the case of the convex function, the chords lie @@ -299,7 +299,7 @@ \section{Weight Regularization} this means that feature $d$ is not used at all in the classification decision. If there are a large number of irrelevant features, you might want as many weights to go to zero as possible. This suggests -an alternative regularizer: $R\xth{cnt}(\vw,b) = \sum_d \Ind[x_d \neq +an alternative regularizer: $R\xth{cnt}(\vw,b) = \sum_d \Ind[w_d \neq 0]$. \thinkaboutit{Why might you not want to use $R\xth{cnt}$ as a regularizer?} @@ -314,22 +314,22 @@ \section{Weight Regularization} \norm{\vw}_p = \left( \sum_d \ab{w_d}^p \right)^{\frac 1 p} \end{equation} % -You can check that the $2$-norm exactly corresponds to the usual -Euclidean norm, and that the $1$-norm corresponds to the ``absolute'' -regularizer described above. - \thinkaboutit{You can actually identify the $R\xth{cnt}$ regularizer - with a $p$-norm as well. Which value of $p$ gives it to you? - (Hint: you may have to take a limit.)} + with a $p$-norm as well. Which value of $p$ gives it to you? + (Hint: you may have to take a limit.)} \TODOFigure{loss:norms2d}{level sets of the same $p$-norms} +You can check that the $2$-norm exactly corresponds to the usual +Euclidean norm, and that the $1$-norm corresponds to the ``absolute'' +regularizer described above. + When $p$-norms are used to regularize weight vectors, the interesting aspect is how they trade-off multiple features. To see the behavior of $p$-norms in two dimensions, we can plot their \concept{contour} (or \concept{level-set}). Figure~\ref{fig:loss:norms2d} shows the contours for the same $p$ norms in two dimensions. Each line denotes -the two-dimensional vectors to which this norm assignes a total value +the two-dimensional vectors to which this norm assigns a total value of $1$. By changing the value of $p$, you can interpolate between a square (the so-called ``max norm''), down to a circle ($2$-norm), diamond ($1$-norm) and pointy-star-shaped-thing ($p<1$ norm). @@ -350,13 +350,13 @@ \section{Optimization with Gradient Descent} \begin{mathreview}{Gradients} A gradient is a multidimensional generalization of a derivative. Suppose you have a function $f : \R^D \fto \R$ that takes a vector $\vx = \langle x_1, x_2, \dots, x_D \rangle$ as input and produces a scalar value as output. - You can differentite this function according to any one of the inputs; for instance, you can compute $\frac {\partial f} {\partial x_5}$ to get the derivative with respect to the fifth input. + You can differentiate this function according to any one of the inputs; for instance, you can compute $\frac {\partial f} {\partial x_5}$ to get the derivative with respect to the fifth input. The \concept{gradient} of $f$ is just the vector consisting of the derivative $f$ with respect to each of its input coordinates independently, and is denoted $\grad f$, or, when the input to $f$ is ambiguous, $\grad_{\vec x} f$. This is defined as: ~ \begin{align} - \grad_{\vec x} f &= \left\langle \frac {\partial f} {\partial x_1}~~,~~ - \frac {\partial f} {\partial x_2}~~,~~ + \grad_{\vec x} f &= \left\langle \frac {\partial f} {\partial x_1}~~,~~ + \frac {\partial f} {\partial x_2}~~,~~ \dots~~,~~ \frac {\partial f} {\partial x_D} \right\rangle \end{align} @@ -436,7 +436,7 @@ \section{Optimization with Gradient Descent} % \begin{equation} \cL(\vw,b) = -\sum_n +\sum_n \exp\big[-y_n (\dotp{\vec w}{\vx_n}+b)\big] + \frac \la 2 \norm{\vw}^2 \end{equation} @@ -459,7 +459,7 @@ \section{Optimization with Gradient Descent} \leftarrow b - \eta \partialof{\cL}{b}$. Consider positive examples: examples with $y_n=+1$. We would hope for these examples that the current prediction, $\dotp{\vw}{\vx_n}+b$, is as large as possible. -As this value tends toward $\infty$, the term in the $\exp[]$ goes to +As this value tends toward $\infty$, the $\exp[]$ term goes to zero. Thus, such points will not contribute to the step. However, if the current prediction is small, then the $\exp[]$ term will be positive and non-zero. This means that the bias term $b$ will be @@ -474,11 +474,11 @@ \section{Optimization with Gradient Descent} % \begin{align} \grad_\vw \cL -&= \grad_\vw \sum_n \exp\big[-y_n (\dotp{\vec w}{\vx_n}+b)\big] +&= \grad_\vw \sum_n \exp\big[-y_n (\dotp{\vec w}{\vx_n}+b)\big] + \grad_\vw\frac \la 2 \norm{\vw}^2 \\ -&= \sum_n \left( \grad_\vw -y_n (\dotp{\vec w}{\vx_n}+b) \right) \exp\big[-y_n (\dotp{\vec w}{\vx_n}+b)\big] +&= \sum_n \left( \grad_\vw -y_n (\dotp{\vec w}{\vx_n}+b) \right) \exp\big[-y_n (\dotp{\vec w}{\vx_n}+b)\big] + \la \vw \\ -&= -\sum_n y_n \vx_n \exp\big[-y_n (\dotp{\vec w}{\vx_n}+b)\big] +&= -\sum_n y_n \vx_n \exp\big[-y_n (\dotp{\vec w}{\vx_n}+b)\big] + \la \vw \end{align} % @@ -561,7 +561,7 @@ \section{Optimization with Gradient Descent} + \frac 1 2 f''(a) (z-a)^2 \end{align} The ``\dots'' is guaranteed to be non-negative because $f$ is - convex. + convex. \end{myproof} \end{comment} @@ -707,7 +707,7 @@ \section{Closed-form Optimization for Squared Loss} % \begin{equation} a_n -= \left[ \mat X \vec w \right]_n += \left[ \mat X \vec w \right]_n = \sum_d \mat X_{n,d} w_d \end{equation} % @@ -739,7 +739,7 @@ \section{Closed-form Optimization for Squared Loss} \left[ \begin{array}{c} w_{1} \\ - w_{2} \\ + w_{2} \\ \vdots \\ w_{D} \end{array} @@ -762,7 +762,7 @@ \section{Closed-form Optimization for Squared Loss} \left[ \begin{array}{c} y_{1} \\ - y_{2} \\ + y_{2} \\ \vdots \\ y_{N} \end{array} @@ -841,7 +841,7 @@ \section{Support Vector Machines} \label{sec:loss:svm} At the beginning of this chapter, you may have looked at the convex surrogate loss functions and asked yourself: where did these come from?! They are all derived from different underlying principles, -which essentially correspond to different inductive biases. +which essentially correspond to different inductive biases. \Figure{loss:geom}{picture of data points with three hyperplanes, RGB with G the best} @@ -932,7 +932,7 @@ \section{Support Vector Machines} \label{sec:loss:svm} You will have to pay a little bit to do so, but as long as you aren't moving a \emph{lot} of points around, it should be a good idea to do this. In this picture, the amount that you move the point is denoted -$\xi$ (xi). +$\xi$ ($x_i$). By introducing one slack parameter for each training example, and penalizing yourself for having to use slack, you can create an @@ -1007,7 +1007,7 @@ \section{Support Vector Machines} \label{sec:loss:svm} % \begin{align} \ga -&= \frac 1 2 \left[ \textcolor{darkblue}{d^+} +&= \frac 1 2 \left[ \textcolor{darkblue}{d^+} - \textcolor{darkred}{d^-} \right] \\ &= \frac 1 2 \left[ @@ -1085,7 +1085,8 @@ \section{Support Vector Machines} \label{sec:loss:svm} + \underbrace{C \sum_n \ell\xth{hin}(y_n, \dotp{\vw}{\vx_n}+b)}_{\text{small slack}}} % -Multiplying this objective through by $\la/C$, we obtain exactly the +Substituting $C = 1/2\lambda$ and multiplying this objective through by +$2\lambda$, we obtain exactly the regularized objective from Eq~\eqref{opt:loss:reg} with hinge loss as the loss function and the $2$-norm as the regularizer! @@ -1107,7 +1108,7 @@ \section{Further Reading} \end{comment} -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/nnet.tex b/book/nnet.tex index be31996..0000097 100644 --- a/book/nnet.tex +++ b/book/nnet.tex @@ -144,7 +144,7 @@ \section{Bio-inspired Multi-Layer Networks} construct a very small two-layer network for solving the XOR problem. For simplicity, suppose that the data set consists of four data points, given in Table~\ref{tab:nnet:xor}. The classification rule is -that $y=+1$ if an only if $x_1=x_2$, where the features are just $\pm +that $y=+1$ if and only if $x_1=x_2$, where the features are just $\pm 1$. You can solve this problem using a two layer network with two hidden @@ -215,7 +215,7 @@ \section{Bio-inspired Multi-Layer Networks} complex your function will be. Lots of hidden units $\Rightarrow$ very complicated function. %Figure~\ref{fig:nnet:units} shows training %and test error for neural networks trained with different numbers of -%hidden units. +%hidden units. As the number increases, training performance continues to get better. But at some point, test performance gets worse because the network has overfit the data. @@ -252,7 +252,7 @@ \section{The Back-propagation Algorithm} objective is: % \optimizeuc{nnet:twolayer}{\mat W,\vec v}{% - \sum_n \frac 1 2 \left( \textcolor{darkblue}{y_n - + \sum_n \frac 1 2 \left( \textcolor{darkblue}{y_n - \sum_i v_i f(\dotp{\vec w_i}{\vx_n})} \right)^2} % @@ -273,7 +273,7 @@ \section{The Back-propagation Algorithm} example. Then: % \begin{equation} -\grad_{\vec v} = -\sum_n e_n \vec h_n +\grad_{\vec v} = -\sum_n e_n \vec h_n \end{equation} % This is exactly like the linear case. One way of interpreting this @@ -293,8 +293,8 @@ \section{The Back-propagation Algorithm} over data points, we can compute: % \begin{align} -\cL(\mat W) &= -\frac 1 2 \left( \textcolor{darkblue}{y - +\cL(\mat W) &= +\frac 1 2 \left( \textcolor{darkblue}{y - \sum_i v_i f(\dotp{\vec w_i}{\vx})} \right)^2 \\ @@ -469,7 +469,7 @@ \section{Initialization and Convergence of Neural Networks} can often interpret high weights as indicative of positive examples and low weights as indicative of negative examples. In multilayer networks, it becomes very difficult to try to understand what the -different hidden units are doing. +different hidden units are doing. %TODO: maybe have something about doing images? @@ -734,7 +734,7 @@ \section{Further Reading} \end{comment} -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/opt.tex b/book/opt.tex index 420838c..bbc986d 100644 --- a/book/opt.tex +++ b/book/opt.tex @@ -52,7 +52,7 @@ \section{What Does it Mean to be Fast?} This is largely because it is currently not a well-understood area in machine learning. There are many aspects of parallelism that come into play, such as the speed of communication across the network, -whether you have shared memory, etc. Right now, this the general, +whether you have shared memory, etc. Right now, the general, poor-man's approach to parallelization, is to employ ensembles. \section{Stochastic Optimization} @@ -89,13 +89,13 @@ \section{Stochastic Optimization} % \begin{align} \vw^* -&= \arg\max_{\vw} \sum_n \ell(y_n, \dotp{\vw}{\vx_n}) + R(\vw) +&= \arg\max_{\vw} \sum_n \ell(y_n, \dotp{\vw}{\vx_n}) + R(\vw) \becauseof{definition}\\ -&= \arg\max_{\vw} \sum_n \left[ \ell(y_n, \dotp{\vw}{\vx_n}) + \frac 1 N R(\vw) \right] +&= \arg\max_{\vw} \sum_n \left[ \ell(y_n, \dotp{\vw}{\vx_n}) + \frac 1 N R(\vw) \right] \becauseof{move $R$ inside sum}\\ -&= \arg\max_{\vw} \sum_n \left[ \frac 1 N \ell(y_n, \dotp{\vw}{\vx_n}) + \frac 1 {N^2} R(\vw) \right] +&= \arg\max_{\vw} \sum_n \left[ \frac 1 N \ell(y_n, \dotp{\vw}{\vx_n}) + \frac 1 {N^2} R(\vw) \right] \becauseof{divide through by $N$}\\ -&= \arg\max_{\vw} \Ep_{(y,\vx) \sim D} \left[ \ell(y, \dotp{\vw}{\vx}) + \frac 1 N R(\vw) \right] +&= \arg\max_{\vw} \Ep_{(y,\vx) \sim D} \left[ \ell(y, \dotp{\vw}{\vx}) + \frac 1 N R(\vw) \right] \becauseof{write as expectation}\\ & \text{where } D \text{ is the training data distribution} \end{align} @@ -122,7 +122,7 @@ \section{Stochastic Optimization} \optimizeuc{opt:stochastic}{\vz}{ \Ep_{\ze}[ \cF(\vz, \ze) ]} % In the example, $\ze$ denotes the random choice of examples over the -dataset, $\vz$ denotes the weight vector and $\cF(\vw,\ze)$ denotes +dataset, $\vz$ denotes the weight vector and $\cF(\vz,\ze)$ denotes the loss on that example \emph{plus} a fraction of the regularizer. Stochastic optimization problems are formally \emph{harder} than @@ -162,7 +162,7 @@ \section{Stochastic Optimization} guaranteed for learning rates of the form: $\eta\kth = \frac {\eta_0} {\sqrt{k}}$, where $\eta_0$ is a fixed, initial step size, typically $0.01$, $0.1$ or $1$ depending on how quickly you expect the algorithm -to converge. Unfortunately, in comparisong to gradient descent, +to converge. Unfortunately, in comparison to gradient descent, stochastic gradient is quite sensitive to the selection of a good learning rate. @@ -179,7 +179,7 @@ \section{Stochastic Optimization} misses. However, if your data is too large for memory and resides on a magnetic disk that has a slow seek time, randomly seeking to new data points for each example is prohibitivly slow, and you will likely -need to forgo permuting the data. The speed hit in convergence speed +need to forgo permuting the data. The hit in convergence speed will almost certainly be recovered by the speed gain in not having to seek on disk routinely. (Note that the story is very different for solid state disks, on which random accesses really are quite @@ -274,7 +274,7 @@ \section{Sparse Regularization} % alternative regularizer is the KL-regularizer. The motivation for the % KL-regularizer differs slightly from the previous notion of sparsity. % Instead of ``hoping'' that most of the weights go to zero, you instead -% hope that most of the weights tend toward some positive constant $p$. +% hope that most of the weights tend toward some positive constant $p$. % The entropy % regularizer only works when the weights of your model are % non-negative, though this is not a big problem in practice: you can @@ -323,7 +323,7 @@ \section{Feature Hashing} % \begin{align} \phi(\vx)_p - &= \sum_d [h(d) = p] x_d + &= \sum_d [h(d) = p] x_d \quad= \sum_{d \in h\inv(p)} x_d \end{align} % @@ -343,7 +343,7 @@ \section{Feature Hashing} Consider the kernel defined by this hash mapping. Namely: % \begin{align} - K\xth{hash}(\vx, \vz) + K\xth{hash}(\vx, \vz) &= \dotp{\phi(\vx)}{\phi(\vz)} \\ &= @@ -387,7 +387,7 @@ \section{Further Reading} \end{comment} -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/perc.tex b/book/perc.tex index 0f85682..e92155f 100644 --- a/book/perc.tex +++ b/book/perc.tex @@ -47,7 +47,7 @@ \section{Bio-inspired Learning} \concept{activations}). Based on how much these incoming neurons are firing, and how ``strong'' the neural connections are, our main neuron will ``decide'' how strongly it wants to fire. And so on through the -whole brain. Learning in the brain happens by neurons becomming +whole brain. Learning in the brain happens by neurons becoming connected to other neurons, and the strengths of connections adapting over time. @@ -60,7 +60,7 @@ \section{Bio-inspired Learning} learning algorithm as a \emph{single} neuron. It receives input from $D$-many other neurons, one for each input feature. The strength of these inputs are the feature values. This is shown schematically in -Figure~\ref{fig:perc:neuron}. Each incoming connection has a weight +Figure~\ref{fig:perc:example}. Each incoming connection has a weight and the neuron simply sums up all the weighted inputs. Based on this sum, it decides whether to ``fire'' or not. Firing is interpreted as being a positive example and not firing is interpreted as being a @@ -74,17 +74,17 @@ \section{Bio-inspired Learning} \begin{equation} \label{eq:perc:sum} a = \sum_{d=1}^D w_d x_d \end{equation} -to determine it's amount of ``activation.'' If this activiation is +to determine its amount of ``activation.'' If this activation is positive (i.e., $a > 0$) it predicts that this example is a positive example. Otherwise it predicts a negative example. The weights of this neuron are fairly easy to interpret. Suppose that -a feature, for instance ``is this a System's class?'' gets a zero +a feature, for instance ``is this a Systems class?'' gets a zero weight. Then the activation is the same regardless of the value of this feature. So features with zero weight are ignored. Features with positive weights are indicative of positive examples because they cause the activation to increase. Features with negative weights are -indicative of negative examples because they cause the activiation to +indicative of negative examples because they cause the activation to decrease. \thinkaboutit{What would happen if we encoded binary features like @@ -143,7 +143,7 @@ \section{Error-Driven Updating: The Perceptron Algorithm} \COMMENT{initialize bias} \FOR{\VAR{iter} = \CON{1} \dots \VAR{MaxIter}} \FORALL{(\VAR{$\vx$},\VAR{$y$}) $\in$ \VAR{$\mat D$}} -\SETST{$a$}{$\sum_{\VAR{d}=\CON{1}}^{\VAR{D}}$ \VAR{$w_d$} \VAR{$x_d$} + \VAR{$b$}} +\SETST{$a$}{$\sum_{\VAR{d}=\CON{1}}^{\VAR{D}}$ \VAR{$w_d$} \VAR{$x_d$} + \VAR{$b$}} \COMMENT{compute activation for this example} \IF{\VAR{$y$}\VAR{$a$} $\leq \CON{0}$} \SETST{$w_d$}{\VAR{$w_d$} + \VAR{$y$}\VAR{$x_d$}, \text{for all } \VAR{$d$} = $\CON{1} \dots \VAR{D}$} @@ -274,7 +274,7 @@ \section{Geometric Intrepretation} % \vspace{-3em} % \includegraphics[width=2in]{figs/perc_dotprojection}. %\end{wrapfigure} - large and negative when $\vec u$ and $\vec v$ point in opposite directions, and is zero when their are perpendicular. + large and negative when $\vec u$ and $\vec v$ point in opposite directions, and is zero when they are perpendicular. A useful geometric interpretation of dot products is \concept{projection}. Suppose $\norm{\vec u} = 1$, so that $\vec u$ is a \concept{unit vector}. We can think of any other vector $\vec v$ as consisting of two components: (a) a component in the direction of $\vec u$ and (b) a component that's perpendicular to $\vec u$. @@ -312,7 +312,7 @@ \section{Geometric Intrepretation} \Figure{perc:geom}{picture of data points with hyperplane and weight vector} This is shown pictorially in Figure~\ref{fig:perc:geom}. Here, the -weight vector is shown, together with it's perpendicular plane. This +weight vector is shown, together with its perpendicular plane. This plane forms the decision boundary between positive points and negative points. The vector points in the direction of the positive examples and away from the negative examples. @@ -467,7 +467,7 @@ \section{Perceptron Convergence and Linear Separability} ``easy'' and ``hard'' in a meaningful way. One way to make this definition is through the notion of \concept{margin}. If I give you a data set and hyperplane that separates it% (like that shown in -%Figure~\ref{fig:perc:margin}) +%Figure~\ref{fig:perc:margin}) then the \emph{margin} is the distance between the hyperplane and the nearest point. Intuitively, problems with large margins should be easy (there's lots of ``wiggle room'' to @@ -499,7 +499,7 @@ \section{Perceptron Convergence and Linear Separability} set is the largest attainable margin on this data. Formally: \begin{equation} \label{eq:margin2} \textit{margin}(\mat D) -= += \sup_{\vw,b} \textit{margin}(\mat D, \vw, b) \end{equation} In words, to compute the margin of a data set, you ``try'' every @@ -555,7 +555,7 @@ \section{Perceptron Convergence and Linear Separability} $\vw\kth = \vw\kpth + y \vx$. We do a little computation: \begin{align} \dotp{\vw^*}{\vw\kth} - &= \dotp{\vw^*}{\left(\vw\kpth + y \vx\right)} + &= \dotp{\vw^*}{\left(\vw\kpth + y \vx\right)} \becauseof{definition of $\vw\kth$} \\ &= \dotp{\vw^*}{\vw\kpth} + y \dotp{\vw^*}{\vx} @@ -586,11 +586,11 @@ \section{Perceptron Convergence and Linear Separability} Now we put together the two things we have learned before. By our first conclusion, we know $\dotp{\vw^*}{\vw\kth} \geq k \ga$. But - our second conclusion, $\sqrt{k} \geq \norm{\vw\kth}^2$. Finally, + our second conclusion, $\sqrt{k} \geq \norm{\vw\kth}$. Finally, because $\vw^*$ is a unit vector, we know that $\norm{\vw\kth} \geq \dotp{\vw^*}{\vw\kth}$. Putting this together, we have: \begin{equation} - \sqrt k + \sqrt k \quad\geq\quad \norm{\vw\kth} \quad\geq\quad @@ -662,7 +662,7 @@ \section{Improved Generalization: Voting and Averaging} the prediction on a test point is: \begin{equation} \label{eq:perc:vote} \hat y = \sign \left( - \sum_{k=1}^K c\kth + \sum_{k=1}^K c\kth \sign \left( \dotp{\vw\kth}{\hat\vx} + b\kth \right) @@ -690,7 +690,7 @@ \section{Improved Generalization: Voting and Averaging} voting. In particular, the prediction is: \begin{equation} \label{eq:perc:avg} \hat y = \sign \left( - \sum_{k=1}^K c\kth + \sum_{k=1}^K c\kth \left( \dotp{\vw\kth}{\hat\vx} + b\kth \right) @@ -704,7 +704,7 @@ \section{Improved Generalization: Voting and Averaging} \begin{equation} \hat y = \sign \left( \dotp{\textcolor{darkblue}{\left( - \sum_{k=1}^K c\kth \vw\kth + \sum_{k=1}^K c\kth \vw\kth \right)}}{\hat\vx} + \textcolor{darkred}{\sum_{k=1}^K c\kth b\kth} \right) @@ -744,7 +744,7 @@ \section{Improved Generalization: Voting and Averaging} \COMMENT{increment counter regardless of update} \ENDFOR \ENDFOR -\RETURN \VAR{$\vw$} - $\frac 1 {\VAR{c}}$ \VAR{$\vec u$}, +\RETURN \VAR{$\vw$} - $\frac 1 {\VAR{c}}$ \VAR{$\vec u$}, \VAR{$b$} - $\frac 1 {\VAR{c}}$ \VAR{$\beta$} \COMMENT{return averaged weights and bias} } @@ -780,7 +780,7 @@ \section{Improved Generalization: Voting and Averaging} The averaged perceptron is almost always better than the perceptron, in the sense that it generalizes better to test data. However, that does not free you from having to do \concept{early stopping}. It -will, eventually, overfit. +will, eventually, overfit. %Figure~\ref{fig:perc:avgperc} shows the %performance of the vanilla perceptron and the averaged perceptron on %the same data set, with both training and test performance. As you @@ -820,7 +820,7 @@ \section{Limitations of the Perceptron} w_{\feat{execellent}} &= +1 & w_{\feat{terrible}} &= -1 & w_{\feat{not}} &= 0 \\ -w_{\feat{execllent-and-not}} &= -2 +w_{\feat{execllent-and-not}} &= -2 & w_{\feat{terrible-and-not}} &= +2 \end{align*} In this particular case, we have addressed the problem. However, if @@ -868,7 +868,7 @@ \section{Limitations of the Perceptron} % this data.} % In practice, there is little reason to \emph{actually} perform this -% mapping. It is more interesting from a theoretical perspective. +% mapping. It is more interesting from a theoretical perspective. \section{Further Reading} @@ -886,8 +886,7 @@ \section{Further Reading} \end{comment} -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: - +%%% End: diff --git a/book/prac.tex b/book/prac.tex index b83d4ba..9076e49 100644 --- a/book/prac.tex +++ b/book/prac.tex @@ -53,6 +53,8 @@ \section{The Importance of Good Features} you give it is trash, the learning algorithm is unlikely to be able to overcome it. +\Figure{prac_imagepix}{object recognition in pixels} + Consider a problem of object recognition from images. If you start with a $100 \times 100$ pixel image, a very easy feature representation of this image is as a $30,000$ dimensional vector, @@ -61,8 +63,6 @@ \section{The Importance of Good Features} in pixel $(1,1)$; feature 2 is the amount of green in that pixel; and so on. This is the \concept{pixel representation} of images. -\Figure{prac_imagepix}{object recognition in pixels} - One thing to keep in mind is that the pixel representation \emph{throws away} all locality information in the image. Learning algorithms don't care about features: they only care about feature @@ -98,7 +98,7 @@ \section{The Importance of Good Features} } - + \TODOFigure{prac:bow}{BOW repr of one positive and one negative review} In the context of \concept{text categorization} (for instance, the @@ -120,7 +120,7 @@ \section{Irrelevant and Redundant Features} prediction task. A feature $f$ whose expectation does not depend on the label $\Ep[f \| Y] = \Ep[f]$ might be irrelevant. For instance, the presence of the word ``the'' might be largely irrelevant for -predicting whether a course review is positive or negative. +predicting whether a course review is positive or negative. A secondary issue is how well these algorithms deal with \concept{redundant features}. Two features are redundant if they are @@ -141,7 +141,7 @@ \section{Irrelevant and Redundant Features} great features and $1$ bad feature. The interesting case is when the bad features outnumber the good features, and often outnumber by a large degree. %For instance, perhaps the number of good features is -%something like $\log D$ out of a set of $D$ total features. +%something like $\log D$ out of a set of $D$ total features. The question is how robust are algorithms in this case.\sidenote{You might think it's absurd to have so many irrelevant features, but the cases @@ -197,7 +197,7 @@ \section{Irrelevant and Redundant Features} Unfortunately, the situation is actually worse than this. In the above analysis we only considered the case of \emph{perfect} correlation. We could also consider the case of \emph{partial} -correlation, which would yield even higher probabilities. +correlation, which would yield even higher probabilities. Suffice it to say that even decision trees can become confused. @@ -232,7 +232,7 @@ \section{Irrelevant and Redundant Features} \section{Feature Pruning and Normalization} In text categorization problems, some words simply do not appear very -often. Perhaps the word ``groovy''\sidenote{This is typically +often. Perhaps the word ``groovy''\sidenote{This is typically a positive indicator, or at least it was back in the US in the 1970s.} appears in exactly one training document, which is positive. Is it really worth keeping this word around as a feature? It's a dangerous @@ -273,13 +273,13 @@ \section{Feature Pruning and Normalization} Suppose we have $N$ real valued numbers $z_1, z_2, \dots, z_N$. The \concept{sample mean} (or just \concept{mean}) of these numbers is just their average value, or expected value: $\mu = \frac 1 N \sum_n z_n$. The \concept{sample variance} (or just \concept{variance}) measures how much they vary around their mean: $\si^2 = \frac 1 {N-1} \sum_n (z_n - \mu)^2$, where $\mu$ is the sample mean. - + ~\\ The mean and variance have convenient interpretations in terms of prediction. Suppose we wanted to choose a single constant value to ``predict'' the next $z$, and were minimizing squared error. Call this constant value $a$. - Then $a = \argmin_{a \in \R} \frac 1 2 \sum_n (a - z_n)^2$. + Then $a = \argmin_{a \in \R} \frac 1 2 \sum_n (a - z_n)^2$. (Here, the $\frac 1 2$ is for convenience and does not change the answer.) To solve for $a$, we can take derivatives and set to zero: $\frac \partial {\partial a} \frac 1 2 \sum_n (a - z_n)^2 @@ -502,13 +502,14 @@ \section{Evaluating Model Performance} than a true distinction between X and Y. (Can you spot the spam? can you spot the relevant documents?) + For spotting problems (X versus not-X), there are often more appropriate success metrics than accuracy. A very popular one from information retrieval is the \concept{precision}/\concept{recall} metric. Precision asks the question: of all the X's that you found, how many of them were actually X's? Recall asks: of all the X's that were out there, how many of them did you find?\sidenote{A colleague - make the analogy to the US court system's saying ``Do you promise to + made the analogy to the US court system's saying ``Do you promise to tell the whole truth and nothing but the truth?'' In this case, the ``whole truth'' means high recall and ``nothing but the truth'' means high precision.''} Formally, precision and recall are defined @@ -526,8 +527,6 @@ \section{Evaluating Model Performance} nothing, your precision is always perfect; and if there is nothing to find, your recall is always perfect. -\TODOFigure{prac:spam}{show a bunch of emails spam/nospam sorted by model predicion, not perfect} - Once you can compute precision and recall, you are often able to produce \concept{precision/recall curves}. Suppose that you are attempting to identify spam. You run a learning algorithm to make @@ -541,6 +540,8 @@ \section{Evaluating Model Performance} \thinkaboutit{How would you get a confidence out of a decision tree or KNN?} +\TODOFigure{prac:spam}{show a bunch of emails spam/nospam sorted by model predicion, not perfect} + \TODOFigure{prac:prcurve}{precision recall curve} Once you have this sorted list, you can choose how aggressively you @@ -551,7 +552,7 @@ \section{Evaluating Model Performance} By considering \emph{every possible} place you could put this threshold, you can trace out a curve of precision/recall values, like the one in Figure~\ref{fig:prac:prcurve}. This allows us to ask the -question: for some fixed precision, what sort of recall can I get. +question: for some fixed precision, what sort of recall can I get? Obviously, the closer your curve is to the upper-right corner, the better. And when comparing learning algorithms A and B you can say that A \concept{dominates} B if A's precision/recall curve is always @@ -628,7 +629,7 @@ \section{Evaluating Model Performance} curve, you can compute the \concept{area under the curve} (or \concept{AUC}) metric, which also provides a meaningful single number for a system's performance. Unlike f-measures, which tend to be low -because the require agreement, AUC scores tend to be very high, even +because they require agreement, AUC scores tend to be very high, even for not great systems. This is because random chance will give you an AUC of $0.5$ and the best possible AUC is $1.0$. @@ -698,7 +699,7 @@ \section{Cross Validation} as your final model to make predictions with, or you can train a \emph{new} model on all of the data, using the hyperparameters selected by cross-validation. If you have the time, the latter is -probably a better options. +probably a better option. \newalgorithm{prac:loo}% {\FUN{KNN-Train-LOO}(\VAR{$\mat D$})} @@ -827,12 +828,12 @@ \section{Hypothesis Testing and Statistical Significance} $\geq 2.58$ & $99.5\%$ } -Suppose that you evaluate two algorithm on $N$-many examples. On each +Suppose that you evaluate two algorithms on $N$-many examples. On each example, you can compute whether the algorithm made the correct prediction. Let $a_1, \dots, a_N$ denote the error of the first algorithm on each example. Let $b_1, \dots, b_N$ denote the error of the second algorithm. You can compute $\mu_a$ and $\mu_b$ as the -means of $\vec a$ and $\vec b$, respecitively. Finally, center the +means of $\vec a$ and $\vec b$, respectively. Finally, center the data as $\hat a = \vec a - \mu_a$ and $\hat b = \vec b - \mu_b$. The t-statistic is defined as: \begin{equation} \label{eq:prac:t} @@ -892,7 +893,7 @@ \section{Hypothesis Testing and Statistical Significance} expensive. More folds typically leads to better estimates, but every new fold requires training a new classifier. This can get very time consuming. The technique of \concept{bootstrapping} (and closely -related idea of \concept{jack-knifing} can address this problem. +related idea of \concept{jack-knifing}) can address this problem. Suppose that you didn't want to run cross validation. All you have is a single held-out test set with $1000$ data points in it. You can run @@ -967,7 +968,7 @@ \section{Debugging Learning Algorithms} \textbf{Do you have train/test mismatch?} If you can fit the training data, but it doesn't generalize, it could be because there's something different about your test data. Try shuffling your training data and test data together and then randomly selecting a new test set. If you do well in that condition, then probably the test distribution is strange in some way. If reselecting the test data doesn't help, you have other generalization problems. -\textbf{Is your learning algorithm implemented correctly?} This often means: is it optimizing what you think it's optimizing. Instead of measuring accuracy, try measuring whatever-quantity-your-algorithm-is-supposedly-optimizing (like log loss or hinge loss) and make sure that the optimizer is successfully minimizing this quantity. It is usually useful to hand-craft some datasets on which you know the desired behavior. +\textbf{Is your learning algorithm implemented correctly?} This often means: is it optimizing what you think it's optimizing. Instead of measuring accuracy, try measuring whatever-quantity-your-algorithm-is-supposedly-optimizing (like log loss or hinge loss) and make sure that the optimizer is successfully minimizing this quantity. It is usually useful to hand-craft some datasets on which you know the desired behavior. For instance, you could run KNN on the XOR data. Or you could run perceptron on some easily linearly separable data (for instance positive points along the line $x_2 = x_1 + 1$ and negative @@ -975,7 +976,7 @@ \section{Debugging Learning Algorithms} axis-aligned data. Finally, can you compare against a reference implementation? -\textbf{Do you have an adequate representation?} +\textbf{Do you have an adequate representation?} If you cannot even fit the training data, you might not have a rich enough feature set. The easiest way to try to get a learning algorithm to overfit is to add a new feature to it. You can call this feature the @@ -1003,10 +1004,10 @@ \section{Bias/Variance Trade-off} ~ \begin{align} \text{error}(f) -&= +&= \underbrace{\left[ \text{error}(f) - \min_{f^* \in \cF} \text{error}(f^*) \right]}_{\text{\concept{estimation error}}} + -\underbrace{\left[ \min_{f^* \in \cF} \text{error}(f) \right]}_{\text{\concept{approximation error}}} +\underbrace{\left[ \min_{f^* \in \cF} \text{error}(f^*) \right]}_{\text{\concept{approximation error}}} \end{align} ~ Here, the second term, the \concept{approximation error}, measures the quality of the model family\footnote{The ``model family'' (such as depth 20 decision trees, or linear classifiers) is often refered to as the \concept{hypothesis class}. The hypothesis class $\cF$ denotes the set of all possible classifiers we consider, such as all linear classifiers. An classifier $f \in \cF$ is sometimes called a \concept{hypothesis}, though we generally avoid this latter terminology here.}. One way of thinking of approximation error is: suppose someone gave me infinite data to train on---how well could I do with this representation? @@ -1061,7 +1062,7 @@ \section{Further Reading} % and performance drops. % One can take this idea further. Instead of sequences of characters, -% we can consider sequences of \emph{bits}. +% we can consider sequences of \emph{bits}. % \thinkaboutit{Can you think of a text categorization problem where bag % of characters might actually be a reasonable representation?} @@ -1094,7 +1095,7 @@ \section{Further Reading} %TODO: answers to image questions -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/prob.tex b/book/prob.tex index 6f4fce1..e7d0ca7 100644 --- a/book/prob.tex +++ b/book/prob.tex @@ -182,7 +182,7 @@ \section{Statistical Estimation} % \begin{align} \label{eq:prob:binomial} p_\be(D) - &= p_\be(\texttt{HHTH}) + &= p_\be(\texttt{HHTH}) \becauseof{definition of $D$} \\ &= p_\be(\texttt{H}) p_\be(\texttt{H}) @@ -304,16 +304,16 @@ \section{Naive Bayes Models} of variables. You can try to simplify it by applying the \concept{chain rule} of probabilities: % -\begin{align} +\begin{align} \pth( x_1, x_2, \dots, x_D, y ) - &= \pth(y) \pth(x_1 \| y) - \pth(x_2 \| y, x_1) + &= \pth(y) \pth(x_1 \| y) + \pth(x_2 \| y, x_1) \pth(x_3 \| y, x_1, x_2) \nonumber\\ - &\quad\quad \cdots + &\quad\quad \cdots \pth(x_D \| y, x_1, x_2, \dots, x_{D-1}) \\ -&= \pth(y) \prod_d \pth(x_d \| y, x_1, \dots, x_{d-1}) -\label{eq:prob:nbchain} +&= \pth(y) \prod_d \pth(x_d \| y, x_1, \dots, x_{d-1}) +\label{eq:prob:nbchain} \end{align} % At this point, this equality is \emph{exact} for any probability @@ -339,7 +339,7 @@ \section{Naive Bayes Models} % \begin{align} \label{eq:prob:nb} \pth( (y, \vec x) ) -&= \pth(y) \prod_d \pth(x_d \| y) +&= \pth(y) \prod_d \pth(x_d \| y) \becauseof{naive Bayes assumption} \end{align} % @@ -376,7 +376,7 @@ \section{Naive Bayes Models} % \log \pth( (y, \vec x) ) % &= \textcolor{darkergreen}{[y=+1] \log \th_0 -% + [y=-1] \log (1-\th_0)} +% + [y=-1] \log (1-\th_0)} % \becauseof{take logs} % \nonumber\\ % &\quad\quad + @@ -387,7 +387,7 @@ \section{Naive Bayes Models} % \log\pth(D) % &= \sum_n \Big[ % \textcolor{darkergreen}{[y_n=+1] \log \th_0 -% + [y_n=-1] \log (1-\th_0)} +% + [y_n=-1] \log (1-\th_0)} % \becauseof{i.i.d. assumption} % \nonumber\\ % &\quad\quad + @@ -424,8 +424,8 @@ \section{Naive Bayes Models} There are a few common probability distributions that we use in this book. The first is the Bernouilli distribution, which models binary outcomes (like coin flips). A Bernouilli distribution, $\Ber(\th)$ is parameterized by a single scalar value $\th \in [0,1]$ that represents the probability of heads. The likelihood function is $\Ber(x \| \th) = \th^x (1-\th)^{1-x}$. - The generalization of the Bernouilli to more than two possible outcomes (like rolls of a die) is the Discrete distribution, $\Disc(\vec th)$. If the die has $K$ sides, then $\vec \th \in \R^K$ with all entries non-negative and $\sum_k \th_k = 1$. $\th_k$ is the probabability that the die comes up on side $k$. The likelihood function is $\Disc(x \| \vec\th) = \prod_k \th_k^{\Ind[x=k]}$. - The Binomial distribution is just like the Bernouilli distribution but for multiple flips of the rather than a single flip; it's likelihood is $\Bin(k \| n, \th) = n \choose k \th^k (1-\th)^{n-k}$, where $n$ is the number of flips and $k$ is the number of heads. + The generalization of the Bernouilli to more than two possible outcomes (like rolls of a die) is the Discrete distribution, $\Disc(\vec \th)$. If the die has $K$ sides, then $\vec \th \in \R^K$ with all entries non-negative and $\sum_k \th_k = 1$. $\th_k$ is the probabability that the die comes up on side $k$. The likelihood function is $\Disc(x \| \vec\th) = \prod_k \th_k^{\Ind[x=k]}$. + The Binomial distribution is just like the Bernouilli distribution but for multiple flips of the rather than a single flip; its likelihood is $\Bin(k \| n, \th) = n \choose k \th^k (1-\th)^{n-k}$, where $n$ is the number of flips and $k$ is the number of heads. The Multinomial distribution extends the Discrete distribution also to multiple rolls; it's likelihood is $\Mult(\vec x \| n, \vec\th) = \frac {n!} {\prod_k x_k!} \prod_k \th_k^{x_k}$, where $n$ is the total number of rolls and $x_k$ is the number of times the die came up on side $k$ (so $\sum_k x_k = n$). The preceding distributions are all discrete. ~\\~\\ @@ -469,7 +469,7 @@ \section{Prediction} w_d &= \log \frac {\th_{(+1),d} (1-\th_{(-1),d})} {\th_{(-1),d} (1-\th_{(+1),d})} \quad,\quad b = \sum_d \log \frac {1-\th_{(+1),d}} {1-\th_{(-1),d}} - + \log \frac {\th_0} {1-\th_0} + + \log \frac {\th_0} {1-\th_0} \end{align} % The result of the algebra is that the naive Bayes model has precisely @@ -493,11 +493,11 @@ \section{Prediction} % if $p(y=+1\|\vx) > \be$ or equivalent $p(y=-1\|\vx) < 1-\be$. % % % \begin{align} -% \Ep_{(\vx,y) \sim p} \left[ \al^{[y=+1]} [ p(y\|\vx) +% \Ep_{(\vx,y) \sim p} \left[ \al^{[y=+1]} [ p(y\|\vx) % f(\vx) \neq y \right] % &= \Ep_{(\vx,y) \sim p} \left[ \al^{[y=+1]} \frac {p(y=+1,\vx)} {p(y=-1,\vx)} - + \section{Generative Stories} @@ -543,10 +543,10 @@ \section{Generative Stories} % \begin{align} \log p(D) -&= +&= \sum_n \left[ - \log \th_{y_n} + + \log \th_{y_n} + \sum_d -\frac 1 2 \log(\si_{y_n,d}^2) -\frac 1 {2 \si_{y_n,d}^2} (x_{n,d} - \mu_{y_n,d})^2 @@ -566,9 +566,9 @@ \section{Generative Stories} - \sum_n \sum_d \frac 1 {2 \si_{y_n,d}^2} (x_{n,d} - \mu_{y_n,d})^2 \becauseof{ignore irrelevant terms} \\ &= \frac {\partial} {\partial \mu_{k,i}} - - \sum_{n:y_n=k} \frac 1 {2 \si_{k,d}^2} (x_{n,i} - \mu_{k,i})^2 + - \sum_{n:y_n=k} \frac 1 {2 \si_{k,i}^2} (x_{n,i} - \mu_{k,i})^2 \becauseof{ignore irrelevant terms} \\ -&= \sum_{n:y_n=k} \frac 1 {\si_{k,d}^2} (x_{n,i} - \mu_{k,i}) +&= \sum_{n:y_n=k} \frac 1 {\si_{k,i}^2} (x_{n,i} - \mu_{k,i}) \becauseof{take derivative} \end{align} % @@ -584,7 +584,7 @@ \section{Generative Stories} \begin{align} \frac {\partial \log p(D)} {\partial \si^2_{k,i}} &= \frac {\partial} {\partial \si^2_{k,i}} - - \sum_{y:y_n=k} \left[ \frac 1 2 \log(\si_{k,i}^2) + + - \sum_{y:y_n=k} \left[ \frac 1 2 \log(\si_{k,i}^2) + \frac 1 {2 \si_{k,i}^2} (x_{n,i} - \mu_{k,i})^2\right] \becauseof{ignore irrelevant terms} \\ &= - \sum_{y:y_n=k} \left[ @@ -773,7 +773,7 @@ \section{Regularization via Priors} Bayes' rule: % \begin{align} - \underbrace{p(\th \| D)}_{\text{posterior}} + \underbrace{p(\th \| D)}_{\text{posterior}} &= \frac { \overbrace{p(\th)}^{\text{prior}} \overbrace{p(D \| \th)}^{\text{likelihood}} @@ -829,7 +829,7 @@ \section{Further Reading} - Naive Bayes as a linear model \end{comment} -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/srl.tex b/book/srl.tex index 50124db..e3e9064 100644 --- a/book/srl.tex +++ b/book/srl.tex @@ -36,7 +36,7 @@ \section{Multiclass Perceptron} % maintain a weight vector $\vw\kth$ for \emph{every} possible class $k$. % To make a prediction on example $\vx$, choose the class $k$ that maximizes $\dotp{\vx}{\vw\kth}$. % Intuitively, $\vw\kth$ should point in the same direction as examples that should have label $k$. -% At training time, a prediction $\hat y$ is made. +% At training time, a prediction $\hat y$ is made. % If it's correct, continue. % If it's incorrect, adjust the weights for the \emph{true} class, $\vw\xth{y}$ toward $\vx$ and adjust the weights for the \emph{incorrect but predicted} class, $\vw\xthm{\hat y}$ away from $\vx$.\thinkaboutit{After a single update, if the same example is presented again, will the multiclass perceptron make a correct prediction?} @@ -65,7 +65,7 @@ \section{Multiclass Perceptron} % The multiclass perceptron training algorithm is summarized in Algorithm~\ref{alg:srl:multiclassperc}. % Like the standard perceptron algorithm, the multiclass perceptron works better under an averaging strategy. -In order to build up to structured problems, let's begin with a simplified by pedagogically useful stepping stone: multiclass classification with a perceptron. +In order to build up to structured problems, let's begin with a simplified but pedagogically useful stepping stone: multiclass classification with a perceptron. As discussed earlier, in multiclass classification we have inputs $\vx \in \R^D$ and output labels $y \in \{ 1, 2, \dots, K \}$. Our goal is to learn a scoring function $s$ so that $s(x,y) > s(x,\hat y)$ for all $\hat y \neq y$, where $y$ is the true label and $\hat y$ is an imposter label. The general form of scoring function we consider is a linear function of a joint feature vector $\phi(\vx,y)$: @@ -263,7 +263,7 @@ \section{Argmax for Sequences} However, we don't actually \emph{need} these counts. All we need for computing the $\argmax$ sequence is the dot product between the weights $\vw$ and these counts. In particular, we can compute $\dotp{\vw}{\phi(\vx,\vy)}$ as the dot product on all-but-the-last word plus the dot product on the last word: $\dotp{\vw}{\phi_{1:3}(\vx, \vy)} + \dotp{\vw}{\phi_4(\vx, \vy)}$. -Here, $\phi_{1:3}$ means ``features for everything up to and including position $3$'' and $\phi_{4}$ means ``features for position $4$.'' +Here, $\phi_{1:3}$ means ``features for everything up to and including position $3$'' and $\phi_{4}$ means ``features for position $4$.'' More generally, we can write $\phi(\vx,\vy) = \sum_{l=1}^L \phi_l(\vx,\vy)$, where $\phi_l(\vx,\vy)$ only includes features about position $l$.\footnote{In the case of Markov features, we think of them as pairs that \emph{end} at position $l$, so ``verb adj'' would be the active feature for $\phi_3$.} In particular, we're taking advantage of the associative law for addition: @@ -274,10 +274,10 @@ \section{Argmax for Sequences} &= \sum_{l=1}^L \dotp{\vw}{\phi_l(\vx, \vy)} \becauseof{associative law} \end{align} % -What this means is that we can build a graph like that in Figure~\ref{fig:srl:trellis}, with one verticle slice per time step ($l \ 1 \dots L$).\footnote{A graph of this sort is called a \concept{trellis}, and sometimes a \concept{lattice} in the literature.} +What this means is that we can build a graph like that in Figure~\ref{fig:srl:trellis}, with one vertical slice per time step ($l \ 1 \dots L$).\footnote{A graph of this sort is called a \concept{trellis}, and sometimes a \concept{lattice} in the literature.} Each \emph{edge} in this graph will receive a weight, constructed in such a way that if you take a complete path through the lattice, and add up all the weights, this will correspond exactly to $\dotp{\vw}{\phi(\vx,\vy)}$. -\FigureFull{srl_trellis}{A picture of a trellis sequence labeling. At each time step $l$ the corresponding word can have any of the three possible labels. Any path through this trellis corresponds to a unique labeling of this sentence. The gold standard path is drawn with bold red arrows. The highlighted edge corresponds to the edge between $l=2$ and $l=3$ for verb/adj as described in the text. That edge has weight $\dotp{\vw}{\phi_3(\vx, \dots \circ \text{verb} \circ \text{adj})}$.} +\FigureFull{srl:trellis}{A picture of a trellis sequence labeling. At each time step $l$ the corresponding word can have any of the three possible labels. Any path through this trellis corresponds to a unique labeling of this sentence. The gold standard path is drawn with bold red arrows. The highlighted edge corresponds to the edge between $l=2$ and $l=3$ for verb/adj as described in the text. That edge has weight $\dotp{\vw}{\phi_3(\vx, \dots \circ \text{verb} \circ \text{adj})}$.} To complete the construction, let $\phi_l(\vx, \dots \circ y \circ y')$ denote the \emph{unary} features at position $l$ together with the \emph{Markov} features that end at position $l$. These features depend \emph{only} on $\vx$, $y$ and $y'$, and \emph{not} any of the previous parts of the output. @@ -293,10 +293,10 @@ \section{Argmax for Sequences} A complete derivation of the dynamic program in this case is given in Section~\ref{sec:srl:dp} for those who want to implement it directly. The main benefit of this construction is that it is guaranteed to exactly compute the argmax output for sequences required in the structured perceptron algorithm, \emph{efficiently}. -In particular, it's runtime is $O(LK^2)$, which is an exponential improvement on the naive $O(K^L)$ runtime if one were to enumerate every possible output sequence. +In particular, its runtime is $O(LK^2)$, which is an exponential improvement on the naive $O(K^L)$ runtime if one were to enumerate every possible output sequence. The algorithm can be naturally extended to handle ``higher order'' Markov assumptions, where features depend on triples or quadruples of the output. The trellis becomes larger, but the algorithm remains essentially the same. -In order to handle a length $M$ Markov features, the resulting algorithm will take $O(LK^M)$ time. +In order to handle length $M$ Markov features, the resulting algorithm will take $O(LK^M)$ time. In practice, it's rare that $M>3$ is necessary or useful. @@ -305,7 +305,7 @@ \section{Argmax for Sequences} \section{Structured Support Vector Machines} -In Section~\ref{sec:loss:svm} we saw the support vector machine as a very useful general framework for binary classification. +In Section~\ref{sec:loss:svm} we saw the support vector machine as a very useful general framework for binary classification. In this section, we will develop a related framework for structured support vector machines. The two main advantages of structured SVMs over the structured perceptron are (1) it is regularized (though averaging in structured perceptron achieves a similar effect) and (2) we can incorporate more complex loss functions. @@ -346,7 +346,7 @@ \section{Structured Support Vector Machines} This suggests a set of constraints of the form: \begin{align} & s_\vw(\vx,\vy) - s_\vw(\vx,\hat\vy) - \geq + \geq \ell\xth{Ham}(\vy, \hat\vy) - \xi_{\hat\vy} & (\forall n, \forall \hat\vy \in \cY(\vx)) @@ -358,9 +358,9 @@ \section{Structured Support Vector Machines} + C \sum_n \sum_{\hat\vy \in \cY{\vx_n}} \xi_{n,\hat\vy} }{% s_\vw(\vx,\vy) - s_\vw(\vx,\hat\vy) \\ & -% \dotp{\vw}{\phi(\vx_n,\vy_n)} - +% \dotp{\vw}{\phi(\vx_n,\vy_n)} - % \dotp{\vw}{\phi(\vx_n,\hat\vy)} \\ & - \qquad\qquad\geq + \qquad\qquad\geq \ell\xth{Ham}(\vy_n, \hat\vy) - \xi_{n,\hat\vy}\nonumber & (\forall n, \forall \hat\vy \in \cY(\vx_n)) @@ -369,11 +369,11 @@ \section{Structured Support Vector Machines} \xi_{n,\hat\vy} \geq 0 & (\forall n, \forall \hat\vy \in \cY(\vx_n)) } % -This optimization problem asks for a large margin and small slack, where there is a slack very for every training example and every possible incorrect output associated with that training example. +This optimization problem asks for a large margin and small slack, where there is a slack variable for every training example and every possible incorrect output associated with that training example. In general, this is \emph{way too many} slack variables and \emph{way too many} constraints! There is a very useful, general trick we can apply. -If you focus on the first constraint, it roughly says (letting $s()$ denote score): +If you focus on the first constraint, it roughly says (letting $s()$ denote score): $s(\vy) \geq \big[ s(\hat\vy) + \ell(\vy,\hat\vy) \big]$ for all $\hat\vy$, modulo slack. We'll refer to the thing in brackets as the ``loss-augmented score.'' But if we want to guarantee that the score of the true $\vy$ beats the loss-augmented score of \emph{all} $\hat\vy$, it's enough to ensure that it beats the loss-augmented score of the most confusing imposter. @@ -383,7 +383,7 @@ \section{Structured Support Vector Machines} %\dotp{\vw}{\phi(\vx_n,\vy_n)} \geq s_\vw(\vx_n,\vy_n) \geq \max_{\hat\vy \in \cY(\vx_n)} \Big[ - %\dotp{\vw}{\phi(\vx_n,\hat\vy)} + %\dotp{\vw}{\phi(\vx_n,\hat\vy)} s_\vw(\vx_n, \hat\vy) + \ell\xth{Ham}(\vy_n, \hat\vy) \Big] - \xi_{n}\nonumber @@ -391,12 +391,12 @@ \section{Structured Support Vector Machines} \end{align} We can now apply the same trick as before to remove $\xi_n$ from the analysis. In particular, because $\xi_n$ is constrained to be $\geq 0$ and because we are -trying to minimize it's sum, we can figure out that out the optimum, it will be the case that: +trying to minimize its sum, we can figure out that at the optimum, it will be the case that: \begin{align} \xi_n &= - \max \left\{ 0, + \max \left\{ 0, \max_{\hat\vy \in \cY(\vx_n)} \Big[ - %\dotp{\vw}{\phi(\vx_n,\hat\vy)} + %\dotp{\vw}{\phi(\vx_n,\hat\vy)} s_\vw(\vx_n,\hat\vy) + \ell\xth{Ham}(\vy_n, \hat\vy) \Big] - %\dotp{\vw}{\phi(\vx_n,\vy_n)} @@ -406,8 +406,8 @@ \section{Structured Support Vector Machines} \end{align} This value is referred to as the \concept{structured hinge loss}, which we have denoted as $\ell\xth{s-h}(\vy_n, \vx_n, \vw)$. -This is because, although it is more complex, it bears a striking resemlance to the \concept{hinge loss} from Chapter~\ref{sec:loss}. -In particular, if the score of the true output beats the score of every the best imposter by at least its loss, then $\xi_n$ will be zero. +This is because, although it is more complex, it bears a striking resemlance to the \concept{hinge loss} from Chapter~\ref{sec:loss}. +In particular, if the score of the true output beats the score of the best imposter by at least its loss, then $\xi_n$ will be zero. On the other hand, if some imposter (plus its loss) beats the true output, the loss scales linearly as a function of the difference. At this point, there is nothing special about Hamming loss, so we will replace it with some arbitrary structured loss $\ell$. @@ -429,9 +429,9 @@ \section{Structured Support Vector Machines} &= \grad_{\vw} \left\{ \max_{\hat\vy \in \cY(\vx_n)} \Big[ \dotp{\vw}{\phi(\vx_n,\hat\vy)} + \ell(\vy_n, \hat\vy) \Big] - \dotp{\vw}{\phi(\vx_n,\vy_n)} \right\} \\ \becauseoffull{define $\hat\vy_n$ to be the output that attains the maximum above, rearrange} \\ - &= \grad_{\vw} \Big\{ \dotp{\vw}{\phi(\vx_n,\hat\vy)} - \dotp{\vw}{\phi(\vx_n,\vy_n)} + \ell(\vy_n, \hat\vy) \Big\} \\ + &= \grad_{\vw} \Big\{ \dotp{\vw}{\phi(\vx_n,\hat\vy_n)} - \dotp{\vw}{\phi(\vx_n,\vy_n)} + \ell(\vy_n, \hat\vy_n) \Big\} \\ \becauseoffull{take gradient} \\ - &= \phi(\vx_n,\hat\vy) - \phi(\vx_n,\vy_n) + &= \phi(\vx_n,\hat\vy_n) - \phi(\vx_n,\vy_n) \end{align} Putting this together, we get the full gradient as: \begin{align} @@ -439,7 +439,7 @@ \section{Structured Support Vector Machines} &= \brack{ \vec 0 & \text{if } \ell\xth{s-h}(\vy_n, \vx_n, \vw) = 0 \\ \phi(\vx_n,\hat\vy_n) - \phi(\vx_n,\vy_n) & \text{otherwise} } \nonumber\\ \text{where } & - \hat\vy_n = + \hat\vy_n = \argmax_{\hat\vy_n \in \cY(\vx_n)} \Big[ \dotp{\vw}{\phi(\vx_n,\hat\vy_n)} + \ell(\vy_n, \hat\vy_n) \Big] \label{eq:srl:lossaug} \end{align} The form of this gradient is very simple: it is equal to the features of the worst imposter minus the features of the truth, unless the truth beats all imposters, in which case it's zero. @@ -468,7 +468,7 @@ \section{Structured Support Vector Machines} } -We will consider how to compute the loss-augmented argmax in the next section, +We will consider how to compute the loss-augmented argmax in the next section, but before that we summarize an algorithm for optimizing structured SVMs using stochastic subgradient descent: Algorithm~\ref{alg:srl:ssgdssvm}. Of course there are other possible optimization strategies; we are highlighting this one because it is nearly identical to the structured perceptron. The only differences are: (1) on line~\ref{eq:srl:algolossaug} you use loss-augmented argmax instead of argmax; and (2) on line~\ref{eq:srl:algoreg} the weights are shrunk slightly corresponding to the $\ell_2$ regularizer on $\vw$. (Note: we have used $\lambda = 1/(2C)$ to make the connection to linear models clearer.) @@ -477,8 +477,8 @@ \section{Loss-Augmented Argmax} The challenge that arises is that we now have a more complicated argmax problem that before. In structured perceptron, we only needed to compute $\hat\vy_n$ as the output that maximized its score (see Eq~\ref{eq:srl:argmax}). -Here, we need to find the output that maximizes it score \emph{plus} it's loss (Eq~\eqref{eq:srl:lossaug}). -This optimization problem is refered to as \concept{loss-augmented search} or \concept{loss-augmented inference}. +Here, we need to find the output that maximizes its score \emph{plus} its loss (Eq~\eqref{eq:srl:lossaug}). +This optimization problem is referred to as \concept{loss-augmented search} or \concept{loss-augmented inference}. Before solving the loss-augmented inference problem, it's worth thinking about why it makes sense. What is $\hat\vy_n$? @@ -593,8 +593,8 @@ \section{Dynamic Programming for Sequences} \label{sec:srl:dp} Before working through the details, let's consider an example. Suppose that we've computing the $\alpha$s up to $l=2$, and have: -$\alpha_{2,\text{noun}} = 2$, -$\alpha_{2,\text{verb}} = 9$, +$\alpha_{2,\text{noun}} = 2$, +$\alpha_{2,\text{verb}} = 9$, $\alpha_{2,\text{adj}} = -1$ (recall: position $l=2$ is ``eat''). We want to extend this to position $3$; for example, we want to compute $\alpha_{3,\text{adj}}$. Let's assume there's a single unary feature here, ``tasty/adj'' and three possible Markov features of the form ``?:adj''. @@ -630,10 +630,10 @@ \section{Dynamic Programming for Sequences} \label{sec:srl:dp} \becauseoffull{separate score of prefix from score of position l+1} \\ &= \max_{\hat\vy_{1:l}} \dotp{\vw}{\Big( \phi_{1:l}(\vx, \hat\vy) + \phi_{l+1}(\vx, \hat\vy \circ k)\Big)} \\ \becauseoffull{distributive law over dot products} \\ - &= \max_{\hat\vy_{1:l}} \Big[ \dotp{\vw}{\phi_{1:l}(\vx, \hat\vy)} + &= \max_{\hat\vy_{1:l}} \Big[ \dotp{\vw}{\phi_{1:l}(\vx, \hat\vy)} + \dotp{\vw}{\phi_{l+1}(\vx, \hat\vy \circ k)} \Big] \\ \becauseoffull{separate out final label from prefix, call it k'} \\ - &= \max_{\hat\vy_{1:l-1}} \max_{k'} \Big[ \dotp{\vw}{\phi_{1:l}(\vx, \hat\vy \circ k')} + &= \max_{\hat\vy_{1:l-1}} \max_{k'} \Big[ \dotp{\vw}{\phi_{1:l}(\vx, \hat\vy \circ k')} + \dotp{\vw}{\phi_{l+1}(\vx, \hat\vy \circ k' \circ k)} \Big] \\ \becauseoffull{swap order of maxes, and last term doesn't depend on prefix} \\ &= \max_{k'} \left[ \Big[ \max_{\hat\vy_{1:l-1}} \dotp{\vw}{\phi_{1:l}(\vx, \hat\vy \circ k')} \Big] \right. \nonumber\\ @@ -641,7 +641,7 @@ \section{Dynamic Programming for Sequences} \label{sec:srl:dp} \becauseoffull{apply recursive definition} \\ &= \max_{k'} \Big[ \alpha_{l,k'} + \dotp{\vw}{\phi_{l+1}(\vx, \langle \dots, k', k \rangle)} \Big] \label{eq:srl:recursion} \\ \becauseoffull{and record a backpointer to the k' that achieves the max} \\ -\zeta_{l+1,k} &= \argmax_{k'} \Big[ \alpha_{l,k'} + \dotp{\vw}{\phi_{l+1}(\vx, \langle \dots, k', k \rangle)} \Big] +\zeta_{l+1,k} &= \argmax_{k'} \Big[ \alpha_{l,k'} + \dotp{\vw}{\phi_{l+1}(\vx, \langle \dots, k', k \rangle)} \Big] \end{align} At the end, we can take $\max_k \alpha_{L,k}$ as the score of the best output sequence. To extract the final sequence, we know that the best label for the last word is $\argmax \alpha_{L,k}$. @@ -695,7 +695,7 @@ \section{Dynamic Programming for Sequences} \label{sec:srl:dp} \end{align} If we define $\tilde\alpha$ to be the loss-augmented score, the corresponding recursion is (differences highlighted in blue): \begin{align} -\tilde\alpha_{l+1,k} +\tilde\alpha_{l+1,k} &= \max_{\hat\vy_{1:l}} \dotp{\vw}{\phi_{1:l+1}(\vx, \hat\vy \circ k)} \textcolor{darkblue}{+ \ell^{\textsf{(Ham)}}_{1:l+1}(\vy, \hat\vy\circ k)}\\ &= \max_{k'} \Big[ \tilde\alpha_{l,k'} + \dotp{\vw}{\phi_{l+1}(\vx, \langle \dots, k', k \rangle)} \Big] \textcolor{darkblue}{+ \Ind[k \neq \vy_{l+1}]} \label{eq:srl:laargmaxrec} @@ -719,7 +719,7 @@ \section{Further Reading} -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/thy.tex b/book/thy.tex index 265d639..6a894a5 100644 --- a/book/thy.tex +++ b/book/thy.tex @@ -104,7 +104,7 @@ \section{Induction is Impossible} \cD( \langle +1 \rangle , -1 ) &= 0.1 & \cD( \langle -1 \rangle , +1 ) &= 0.1 \end{align} -In other words, $80\%$ of data points in this distrubtion have $x = y$ +In other words, $80\%$ of data points in this distribution have $x = y$ and $20\%$ don't. No matter what function your learning algorithm produces, there's no way that it can do better than $20\%$ error on this data. @@ -158,7 +158,7 @@ \section{Probably Approximately Correct Learning} back with functions $f_1$, $f_2$, $\dots$, $f_{10}$. For some reason, whenever you run $f_4$ on a test point, it crashes your computer. For the other learned functions, their performance on test data is always -at most $5\%$ error. If this situtation is guaranteed to happen, then +at most $5\%$ error. If this situation is guaranteed to happen, then this hypothetical learning algorithm is a PAC learning algorithm. It satisfies ``probably'' because it only failed in one out of ten cases, and it's ``approximate'' because it achieved low, but non-zero, error @@ -167,7 +167,7 @@ \section{Probably Approximately Correct Learning} This leads to the formal definition of an $(\ep,\de)$ PAC-learning algorithm. In this definition, $\ep$ plays the role of measuring accuracy (in the previous example, $\ep = 0.05$) and $\de$ plays the -role of measuring failure (in the previous, $\de = 0.1$). +role of measuring failure (in the previous, $\de = 0.1$). \begin{definition} An algorithm $\cA$ is an \textcolor{darkblue}{$(\ep,\de)$-PAC @@ -183,7 +183,7 @@ \section{Probably Approximately Correct Learning} that takes forever. The second is the notion of \concept{sample complexity}: the number of examples required for your algorithm to achieve its goals. Note that the goal of both of these measure of -complexity is to bound how much of a scarse resource your algorithm +complexity is to bound how much of a scarce resource your algorithm uses. In the computational case, the resource is CPU cycles. In the sample case, the resource is labeled examples. \\ @@ -195,7 +195,7 @@ \section{Probably Approximately Correct Learning} polynomial in $\frac 1 \ep$ and $\frac 1 \de$. In other words, suppose that you want your algorithm to achieve $4\%$ -error rate rather than $5\%$. The runtime required to do so should no +error rate rather than $5\%$. The runtime required to do so should not go up by an exponential factor. \section{PAC Learning of Conjunctions} @@ -238,7 +238,7 @@ \section{PAC Learning of Conjunctions} } What is a reasonable algorithm in this case? Suppose that you observe -the example in Table~\ref{tab:thy:booleandata}. From the first +the data set in Table~\ref{tab:thy:booleandata}. From the first example, we know that the true formula cannot include the term $x_1$. If it did, this example would have to be negative, which it is not. By the same reasoning, it cannot include $x_2$. By analogous @@ -284,7 +284,7 @@ \section{PAC Learning of Conjunctions} f^0(\vx) &= x_1 \land \lnot x_1 \land x_2 \land \lnot x_2 \land x_3 \land \lnot x_3 \land x_4 \land \lnot x_4 \\ f^1(\vx) &= \lnot x_1 \land \lnot x_2 \land x_3 \land x_4 \\ f^2(\vx) &= \lnot x_1 \land x_3 \land x_4 \\ -f^3(\vx) &= \lnot x_1 \land x_3 \land x_4 +f^3(\vx) &= \lnot x_1 \land x_3 \land x_4 \end{align} % The first thing to notice about this algorithm is that after @@ -335,7 +335,7 @@ \section{PAC Learning of Conjunctions} term $t$ that is not in $c$. There are initially $2D$ many terms in $f$, and any (or all!) of them might not be in $c$. We want to ensure that the probability that $f$ makes an error is at most - $\ep$. It is sufficient to ensure that + $\ep$. It is sufficient to ensure that For a term $t$ (e.g., $\lnot x_5$), we say that $t$ ``negates'' an example $\vx$ if @@ -344,11 +344,11 @@ \section{PAC Learning of Conjunctions} respect to the unknown distribution $\cD$ over data points). First, we show that if we have no bad terms left in $f$, then $f$ - has an error rate at most $\ep$. + has an error rate at most $\ep$. We know that $f$ contains \emph{at most} $2D$ terms, since is begins with $2D$ terms and throws them - out. + out. The algorithm begins with $2D$ terms (one for each variable and one for each negated variable). Note that $f$ will only make one type @@ -510,7 +510,7 @@ \section{Complexity of Infinite Hypothesis Spaces} dimension is the \emph{maximum} number of points for which you can always find such a classifier. -\thinkaboutit{What is that labeling? What is it's name?} +\thinkaboutit{What is that labeling? What is its name?} You can think of VC dimension as a game between you and an adversary. To play this game, \emph{you} choose $K$ unlabeled points however you @@ -546,7 +546,7 @@ \section{Complexity of Infinite Hypothesis Spaces} the proof that the VC dimension is at least three, you simply need to provide an example of three points, and then work through the small number of possible labelings of that data. To show that it is at most -three, you need to argue that no matter what set of four point you +three, you need to argue that no matter what set of four points you pick, you cannot win the game. @@ -603,7 +603,7 @@ \section{Further Reading} -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: +%%% End: diff --git a/book/unsup.tex b/book/unsup.tex index 779c77f..94c17c6 100644 --- a/book/unsup.tex +++ b/book/unsup.tex @@ -97,7 +97,7 @@ \chapter{Unsupervised Learning} \label{sec:unsup} \begin{theorem}[$K$-Means Convergence Theorem] \label{thm:unsup:kmeans} For any dataset $\mat D$ and any number of clusters $K$, the $K$-means algorithm converges in a finite number of iterations, - where convergence is measured by $\cL$ ceasing the change. + where convergence is measured by $\cL$ ceasing to change. \end{theorem} \begin{myproof}{\ref{thm:unsup:kmeans}} The proof works as follows. There are only two points in which @@ -116,11 +116,11 @@ \chapter{Unsupervised Learning} \label{sec:unsup} It remains to show that lines 6 and 9 decrease $\cL$. For line 6, when looking at example $n$, suppose that the previous value of $z_n$ is $a$ and the new value is $b$. It must be the case that - $\norm{\vx_n-\vec\mu_b} \leq \norm{\vx_n-\vec\mu_b}$. Thus, + $\norm{\vx_n-\vec\mu_b} \leq \norm{\vx_n-\vec\mu_a}$. Thus, changing from $a$ to $b$ can only decrease $\cL$. For line 9, consider the second form of $\cL$. Line 9 computes $\vec\mu_k$ as the mean of the data points for which $z_n=k$, which is precisely - the point that minimizes squared sitances. Thus, this update to + the point that minimizes squared distances. Thus, this update to $\vec\mu_k$ can only decrease $\cL$. \end{myproof} @@ -148,7 +148,7 @@ \chapter{Unsupervised Learning} \label{sec:unsup} \item For $k = 2 \dots K$: \begin{enumerate} \item Find the example $m$ that is as far as possible from - \emph{all} previously selected means; namely: + \emph{all} previously selected means; namely: $m = \arg\max_m \min_{k' < k} \norm{ \vx_m - \vec\mu_{k'} }^2$ and set $\vec\mu_k = \vx_m$ \end{enumerate} @@ -186,7 +186,7 @@ \chapter{Unsupervised Learning} \label{sec:unsup} \FOR{$\VAR{k} = \CON{2}$ \TO $\VAR{K}$} \SETST{$d_n$}{$\min_{\VARm{k'}<\VARm{k}} \norm{ \VARm{\vx_n} - \VARm{\vec\mu_{k'}} }^2$, $\forall \VARm{n}$} \COMMENT{compute distances} -\SETST{$\vec p$}{$\frac 1 {\sum_{\VARm{n}} \VARm{n_d}} \VARm{\vec d}$} +\SETST{$\vec p$}{$\frac 1 {\sum_{\VARm{n}} \VARm{d_n}} \VARm{\vec d}$} \COMMENT{normalize to probability distribution} \SETST{$m$}{random sample from \VAR{$\vec p$}} \COMMENT{pick an example at random} @@ -301,8 +301,8 @@ \section{Linear Dimensionality Reduction} % \begin{align} \sum_n p_n - = \sum_n \dotp{\vx_n}{\vec u} - = \dotp{\left( \sum_n \vx_n \right)}{\vec u} + = \sum_n \dotp{\vx_n}{\vec u} + = \dotp{\left( \sum_n \vx_n \right)}{\vec u} = \dotp{\vec 0}{\vec u} = \vec 0 \end{align} % @@ -368,7 +368,7 @@ \section{Linear Dimensionality Reduction} \norm{ \mat X \vec v }^2 - \la_1 \left( \norm{\vec v}^2 - 1 \right) - \la_2 \dotp{\vec u}{\vec v}\\ -\grad_{\vec u} \cL &= +\grad_{\vec v} \cL &= 2 \mat X\T\mat X \vec v - 2 \la_1 \vec v - \la_2 \vec u\\ \Longrightarrow & \quad\la_1 \vec v = \left(\mat X\T\mat X\right)\vec v - \frac {\la_2} 2 \vec u @@ -401,8 +401,8 @@ \section{Linear Dimensionality Reduction} \COMMENT{project data using $\mat U$} } -This leads to the technique of \concept{principle components - analysis}, or \concept{PCA}. For completeness, the is depicted in +This leads to the technique of \concept{principal components + analysis}, or \concept{PCA}. For completeness, this is depicted in Algorithm~\ref{alg:unsup:pca}. The important thing to note is that the eigenanalysis only gives you the projection directions. It does not give you the embedded data. To embed a data point $\vx$ you need @@ -423,14 +423,14 @@ \section{Linear Dimensionality Reduction} minimize the \concept{reconstruction error}, defined by: % \begin{align} -\norm{ \mat X - \mat Z\vec u\T}^2 -&= \norm{ \mat X - \mat X \vec u \vec u\T }^2 +\norm{ \mat X - \mat Z\vec u\T}^2 +&= \norm{ \mat X - \mat X \vec u \vec u\T }^2 \becauseof{definition of $\mat Z$}\\ -&= \norm{\mat X}^2 +&= \norm{\mat X}^2 + \norm{\mat X \vec u \vec u\T}^2 - 2 \mat X\T \mat X \vec u \vec u\T \becauseof{quadratic rule}\\ -&= \norm{\mat X}^2 +&= \norm{\mat X}^2 + \norm{\mat X \vec u \vec u\T}^2 - 2 \vec u\T\mat X\T \mat X \vec u \becauseof{quadratic rule}\\ @@ -506,8 +506,7 @@ \section{Further Reading} \end{comment} -%%% Local Variables: +%%% Local Variables: %%% mode: latex %%% TeX-master: "courseml" -%%% End: - +%%% End: