Given a sequence of DNA nucleotide bases, the task of gene prediction is to find subsequences of bases that encode proteins. Reasonable performance on this task has been achieved using generatively trained sequence models, such as hidden Markov models. We propose instead the use of a discriminitively trained sequence model, the conditional random field. Discriminitively trained models can naturally incorporate arbitrary, non-independent features of the input, which can provide the modeling power needed for complex domains.

The goal of this research is to show how incorporating disparate sources of evidence in a coherent probabilistic model can improve gene prediction performance.

We published a technical paper on this.


Aron Culotta