Skip to content
Open

edit #282

Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions book/srl.tex
Original file line number Diff line number Diff line change
Expand Up @@ -391,7 +391,7 @@ \section{Structured Support Vector Machines}
\end{align}
We can now apply the same trick as before to remove $\xi_n$ from the analysis.
In particular, because $\xi_n$ is constrained to be $\geq 0$ and because we are
trying to minimize it's sum, we can figure out that out the optimum, it will be the case that:
trying to minimize its sum, we can figure out that out the optimum, it will be the case that:
\begin{align}
\xi_n &=
\max \left\{ 0,
Expand Down Expand Up @@ -477,7 +477,7 @@ \section{Loss-Augmented Argmax}

The challenge that arises is that we now have a more complicated argmax problem that before.
In structured perceptron, we only needed to compute $\hat\vy_n$ as the output that maximized its score (see Eq~\ref{eq:srl:argmax}).
Here, we need to find the output that maximizes it score \emph{plus} it's loss (Eq~\eqref{eq:srl:lossaug}).
Here, we need to find the output that maximizes it score \emph{plus} its loss (Eq~\eqref{eq:srl:lossaug}).
This optimization problem is refered to as \concept{loss-augmented search} or \concept{loss-augmented inference}.

Before solving the loss-augmented inference problem, it's worth thinking about why it makes sense.
Expand Down Expand Up @@ -680,7 +680,7 @@ \section{Dynamic Programming for Sequences} \label{sec:srl:dp}
}

The main benefit of Algorithm~\ref{alg:srl:argmax} is that it is guaranteed to exactly compute the argmax output for sequences required in the structured perceptron algorithm, \emph{efficiently}.
In particular, it's runtime is $O(LK^2)$, which is an exponential improvement on the naive $O(K^L)$ runtime if one were to enumerate every possible output sequence.
In particular, its runtime is $O(LK^2)$, which is an exponential improvement on the naive $O(K^L)$ runtime if one were to enumerate every possible output sequence.
The algorithm can be naturally extended to handle ``higher order'' Markov assumptions, where features depend on triples or quadruples of the output.
The memoization becomes notationally cumbersome, but the algorithm remains essentially the same.
In order to handle a length $M$ Markov features, the resulting algorithm will take $O(LK^M)$ time.
Expand Down