Efficient representation and p-value computation for high order Markov motifs

Paulo G.S. da Fonseca 1,2, Christian Gautier 2, Katia S. Guimarães 1, Marie-France Sagot 2

1 Centro de Informática, Universidade Federal de Pernambuco, Recife, Brasil

2 INRIA Rhône-Alpes; Laboratoire de Biométrie et Biologie Évolutive, UMR CNRS 5558, Université Claude-Bernard, Lyon 1, France


Abstract

Motivation: Position weight matrices (PWMs) have become a standard for representing biological sequence motifs. Their relative simplicity has favoured the development of efficient algorithms for diverse tasks such as motif identification, sequence scanning, and statistical significance evaluation. Markov chain-based models generalise the PWM model by allowing for inter-position dependencies to be considered, at the cost of substantial computational overhead, which has limited their application.

Results: In this paper, we consider two aspects regarding the use of higher order Markov models for biological sequence motifs, namely, the representation and the computation of p-values for motifs described by a set of example occurrences. We propose an efficient representation based on the use of tries, from which empirical position-specific conditional base probabilities can be computed, and extend state-of-the-art PWM-based algorithms to allow for the computation of exact p-values for high order Markov motif models.

Availability: The software is available in the form of a Java object-oriented library from http://www.cin.ufpe.br/~paguso/kmarkov

Contact: paguso at cin.ufpe.br


Software

The software is available HERE.


Last update: Feb 12, 2008 by Paulo Fonseca