Applying GAs in Searching Motif Patterns in Gene Expression Data
Differential expression of gene groups implies existence of a common control pattern. Such a control pattern is executed by binding of transcription factors to short conserved sequences in the gene’s upstream region, called Motifs. Therefore, the existence or absence of these motifs in specific combinations is a definition of a control pattern.
Classified gene expression data consists of samples coming from several biologically meaningful classes (e.g. sample taken from cancer patients vs. those taken from healthy individuals). One central step in the analysis of such data is the ranking and scoring of genes according to their measured or inferred inter-class differential expression.
In this work we address the following general question: given a set of genes and their classification ranking, find a control pattern by identifying a combination of motifs that is unique to the control sequence of genes from one class.
The biological model for this work defines a control pattern as a logical expression over the motifs in the gene’s control region. Each literal in the logical expression is a motif. The logical expression may consist of any number of motif literals and the AND, OR and NOT operators.
The logical expression is represented by as a binary tree, where a leaf represents a motif literal and an inner node represents a logical operator.
The search space, defined by the biological model, is too large for exhaustive search. Therefore, the genetic algorithm is used as a search heuristic.
We analyzed data from heat shock experiment in the yeast Sacchromyces cerevisiae. Some motifs, including the Heat Shock Factor HSF-1 motif and another unknown motif, re-appeared in the results set. Further more, there was a clear statistical difference between the solutions fitness achieved when analyzing the yeast data and random data created using the same statistical characters.
MotifDensity.pdf (381Kb PDF)