As you may have noticed from the posts, I (Bob) have been swotting up on discriminative methods, such as logistic regression and conditional random fields (CRFs).

A couple posts ago, I mentioned Martin Jansche’s training of a logistic regression-like classifier based on (an approximation of) F-measure error. The goal was to build a classifier with a better F-measure than one trained on traditional log loss.

Along these same lines, I came across a paper minimizing yet another error function, this time for sequence taggers:

- Kakade, Sham, Teh, Yee-Whye, and Sam T. Roweis. 2002.

An Alternate Objective Function for Markovian Fields.

Procs of*ICML*

Kakade et al. look at discriminative sequence tagging models including max-entropy Markov models (MEMM) and conditional random fields (CRF). MEMMs and CRFs are both trained like logistic regression using log loss. The key point is that this is defined on a per-input basis, not a per-tag basis. Kakade et al. use a per-tag error function instead, with some surprising results.

Here’s the general setup. The training data consists of `n`

pairs of tag/token sequences `tags[i], tokens[i]`

for `i < n`

. The models (either CRFs or MEMMs) supply probability estimates `p(tags[i] | tokens[i], β )`

, and the error for parameters `β`

is the sum of the log loss values per input in the data:

Err(β ) = Σ_{i}-log p(tags[i] | tokens[i], β )

The chain rule reduces this for bigram context models such as MEMMs (but not CRFs) to:

Err(β ) = Σ_{i}Σ_{j}- log p(tags[i][j] | tags[i][j-1], tokens[i], β )

and as they point out, this decomposition is the source of the so-called label-bias problem in MEMMs as originally discussed in Léon Bottou’s Thesis.

What Kakade et al. do is replace this whole-sequence error with the following error function, which is decomposed per tag:

Err2(β ) = Σ_{i}Σ_{j}- log p(tags[i][j] | tokens[i], β )

As with Jansche’s paper, the goal is to devise an error function that matches the evaluation function more closely (e.g. per-tag part-of-speech tagging error). And also like Jansche’s paper, the work is all in the gradient calculations of the new error function. That, and the actual training algorithm, which as you can guess, makes use of the forward-backward algorithm to compute `p(tag[i][j] | tokens[i])`

(note that this use of forward-backward is implemented in LingPipe’s HMM tag-word lattice class).

What’s really neat about this technique is how well it works. They use a synthetic example, and one example with very high accuracy (McCallum’s FAQ dataset). It would sure be nice to see if this worked on a data set we care more about, like part-of-speech tagging.

The big problem is that per-tag optimized output is not very suitable for tagging in support of named entity extraction or other chunking tasks, because the first-best output isn’t guaranteed to be consistent.

July 11, 2008 at 4:30 pm |

[…] that’s not all. Like HMMs, there’s a degree of label bias in a left-to-right language model. So I reversed all the data and built a right-to-left (or […]