diff --git a/book/srl.tex b/book/srl.tex index 50124db..a4443ba 100644 --- a/book/srl.tex +++ b/book/srl.tex @@ -391,7 +391,7 @@ \section{Structured Support Vector Machines} \end{align} We can now apply the same trick as before to remove $\xi_n$ from the analysis. In particular, because $\xi_n$ is constrained to be $\geq 0$ and because we are -trying to minimize it's sum, we can figure out that out the optimum, it will be the case that: +trying to minimize its sum, we can figure out that out the optimum, it will be the case that: \begin{align} \xi_n &= \max \left\{ 0, @@ -477,7 +477,7 @@ \section{Loss-Augmented Argmax} The challenge that arises is that we now have a more complicated argmax problem that before. In structured perceptron, we only needed to compute $\hat\vy_n$ as the output that maximized its score (see Eq~\ref{eq:srl:argmax}). -Here, we need to find the output that maximizes it score \emph{plus} it's loss (Eq~\eqref{eq:srl:lossaug}). +Here, we need to find the output that maximizes it score \emph{plus} its loss (Eq~\eqref{eq:srl:lossaug}). This optimization problem is refered to as \concept{loss-augmented search} or \concept{loss-augmented inference}. Before solving the loss-augmented inference problem, it's worth thinking about why it makes sense. @@ -680,7 +680,7 @@ \section{Dynamic Programming for Sequences} \label{sec:srl:dp} } The main benefit of Algorithm~\ref{alg:srl:argmax} is that it is guaranteed to exactly compute the argmax output for sequences required in the structured perceptron algorithm, \emph{efficiently}. -In particular, it's runtime is $O(LK^2)$, which is an exponential improvement on the naive $O(K^L)$ runtime if one were to enumerate every possible output sequence. +In particular, its runtime is $O(LK^2)$, which is an exponential improvement on the naive $O(K^L)$ runtime if one were to enumerate every possible output sequence. The algorithm can be naturally extended to handle ``higher order'' Markov assumptions, where features depend on triples or quadruples of the output. The memoization becomes notationally cumbersome, but the algorithm remains essentially the same. In order to handle a length $M$ Markov features, the resulting algorithm will take $O(LK^M)$ time.