**Implementation of Genehunter's algorithm.**

*Genetic linkage analysis*
is a statistical method that is used to associate functionality of genes
to their location on chromosomes. The main idea is that genes that
are found in vicinity on the chromosome have a tendency to stick together
when passed on to offsprings. Thus, if some disease is often passed
to offsprings along with a specific marker, then it can be concluded that
the gene that is responsible for the disease is located close on the chromosome
to that marker. By studying the proportion of offsprings to which
the marker has passed along with the disease, the distance between the
disease gene and the marker can be estimated.

The
alleles (at different genes) that are received by an individual from one
parent are called a *haplotype*. The measure that is used for
estimating whether the haplotype of an individual contains 2 alleles (one
at each locus) that resided in the same haplotype in the individual's parent
is called the recombination fraction and it is denoted by theta.
The basic computational goal in genetic linkage analysis is to compute
the recombination fraction between the two genes, i.e., to compute the
probability that a recombination occurs between two genes.

By calculating the
probability of recombination between two genes we can estimate the distance
between the genes, and therefore if we know the location of one of the
genes we can locate the position of the second one.

One of the
frequently used methods for estimating theta is the Maximum Likelihood
Evaluation (MLE) approach. The aim of this approach is to find a
value of theta that maximizes the probability of the evidence about a family
pedigree, given theta.

Genehunter
is a program used for performing genetic linkage analysis.The
Genehunter algorithm is based on hidden Markov models(HMMs).By
using this approach, the computation time scales linearly with the number
of loci – although it scales exponentially with the number ofmeioses
in the pedigree.The goal of this
project is to implement Genehunter’s algorithm in the **best possible
way, **using the references mentioned below.

SUPERLINK external & internal documentation.

Leonid
Kruglyak, Mark J. Daly, Mary P. Reeve-Daly, and Eric S. Lander.Rapid
Multipoint Linkage Analysis of Recessive Traits in Nuclear Families, Including
Homozygosity Mapping. *American Journal of Human Genetics* 56 (1995):
519-527..

Leonid
Kruglyak, Mark J. Daly, Mary P. Reeve-Daly, and Eric S. Lander. Parametric
and Nonparametric Linkage Analysis: A Unified Multipoint Approach. *American
Journal of Human Genetics* 58 (1996): 1347-1363.

Eric S. Lander and Philip Green.Construction of multilocus genetic linkage maps in humans.In Proceedings of National Academy of Science 84 (1987):2363-2367.

GENEHUNTER
documentation: http://linkage.rockefeller.edu/soft/gh/** **

"Statistics in Genetics " by Terry Speed http://stat-www.berkeley.edu/users/terry/Classes/s260.1998/Week8a/week8a/

For more explanation follow the link - DATA STRUCTURE

Linkage analysis can be devided into two steps:

1. Inferring information about the inheritance pattern of a pedigree

2.Deciding
whether the the inheritance information indicates the presence of a trait-causing
gene

__Computing
the single-locus Inheritance Destribution at Codominant Loci - O(2^ ^{2n-f})__

The
inheritance pattern at each point *x* is completely describes by a
binary inheritance vector *v(x)=(p1,m1
, p2,m2 ....pn,mn),
*whose coordinates describe the outcome of paternal and maternal meioses
giving rise to the *n* nonfounder in pedigree.

For
each locus we compute the possibility of each inheretence vector (all nonfounders
except first generation).This done by computing P(Marker | v) and then
normalizing the sum of all probabilities in same loci to 1.

__Computing
the single-locus Inheritance Destribution at Desease Loci (without
speed-up)- O(4 ^{f})__

Once again we compute P(Desease| v) when P(PhenotypeDesease| v) =Sum(P(gi | v)*Mult (P(PhenotypeDesease | gi)) and normalize then.

P(Desease | gi) is the probability that the i-th pedigree member has phenotype gi

P(gi | v) means that gi is completly determined by the founder genotypes.

The
sum is computed by direct summation over founder genotypes.

__Computing
Likehood- O(n*n*2__^{n}__)__

Is
done exactly as Describes in Lander and Green algorithm(1987).

- Running Environment : Sun OS version 5.8 (sun4u)
- In the experiments we used Superlink v1.0 and Genehunter v1.3

Input Files | No.
of People |
Num. of Founders | No.of
Loci |
Ln-
Likelihood on first place |
ln likehood superlink | Run
Time |
Run
Time
genehunter |
Output | |
---|---|---|---|---|---|---|---|---|---|

datafileEC11.dat | pedfileEC11.dat | 5 |
3 |
110 |
-1094.408491 |
-1103.826326 |
0.02 |
0.4 |
outEC11.txt |

datafileEC10.dat | pedfileEC10.dat | 5 |
3 |
100 |
-999.084665 |
-1007.835833 |
0.01 |
0.3 |
outEC10.txt |

datafileED0.dat | pedfileED0.dat | 5 |
2 |
100 |
-884.612059 |
-872.452335 |
0.02 |
0.4 |
outED0.txt |

datafileED5dat | pedfileED5.dat | 5 |
2 |
150 |
-1320.336439 |
-1305.520497 |
0.02 |
0.6 |
outED5.txt |

datafileED11.dat | pedfileED11.dat | 5 |
2 |
210 |
-1816.519908 |
-1799.752965 |
0.02 |
- |
outED11.txt |

datafileA0.dat | pedfileA0.dat | 7 |
3 |
30 |
-332.227408 |
-333.789857 |
0.02 |
0.2 |
outA0.txt |

datafileA1.dat | pedfileA1.dat | 7 |
3 |
60 |
-706.169015 |
-701.623972 |
0.01 |
0.3 |
outA1.txt |

datafileA2.dat | pedfileA2.dat | 7 |
3 |
90 |
-1082.211699 |
-1070.737231 |
0.01 |
0.5 |
outA2.txt |

datafileB0.dat | pedfileB0.dat | 10 |
3 |
100 |
-1723.443557 |
-1716.274631 |
76.43 |
26.4* |
outB0.txt |

datafileB1.dat | pedfileB1.dat | 10 |
3 |
120 |
-2077.838700 |
-2068.643312 |
92.15 |
* |
outB1.txt |

datafileB2.dat | pedfileB2.dat | 10 |
3 |
140 |
-2416.808981 |
-2407.466499 |
108.65 |
* |
outB2.txt |

datafileB3.dat | pedfileB3.dat | 10 |
3 |
160 |
-2766.293204 |
-2745.365789 |
126.58 |
* |
outB3.txt |

datafileB4.dat | pedfileB4.dat | 10 |
3 |
180 |
-3119.061002 |
-3103.428414 |
138.96 |
* |
outB4.txt |

* LOD score wasn't calculated due to too small
numbers or/and input file size

** The error is bound by 2%

*** Genehunter does not calculate ln-likelihood - only LOD SCORE

**** Output includes ln-likelihood for all possible places for disease locus

***** Run time - was measured by time system function (for more information man time) - user time was taken

Input Files | No.
of People |
Num. of Founders | No.of
Loci |
Ln-
Likelihood on first place |
ln likehood superlink | Run
Time |
Run
Time
genehunter |
Output | |
---|---|---|---|---|---|---|---|---|---|

datafile3.dat | pedfile3.dat | 3 |
2 |
20 |
-123.982425 |
-126.488538 |
0.02 |
0.04 |
out3.txt |

datafile4.dat | pedfile4.dat | 4 |
2 |
20 |
-166.879821 |
-165.384065 |
0.01 |
0.06 |
out4.txt |

datafile5.dat | pedfile5.dat | 5 |
2 |
20 |
-182.224323 |
-177.032069 |
0.02 |
0.18 |
out5.txt |

datafile6dat | pedfile6.dat | 6 |
2 |
20 |
-176.306845 |
-179.544550 |
0.05 |
0.2 |
out6.txt |

datafile7.dat | pedfile7.dat | 7 |
2 |
20 |
-197.065780 |
-195.338692 |
0.36 |
0.52 |
out7.txt |

datafile8.dat | pedfile8.dat | 8 |
3 |
20 |
-245.468750 |
-248.886213 |
0.18 |
0.3 |
out8.txt |

datafile9.dat | pedfile9.dat | 9 |
3 |
20 |
-245.485508 |
-246.856814 |
1.23 |
1.09 |
out9.txt |

datafile10.dat | pedfile10.dat | 10 |
3 |
20 |
-361.802210 |
-364.037750 |
0.72 |
0.59 |
out10.txt |

datafile11.dat | pedfile11.dat | 11 |
4 |
20 |
-338.797211 |
-340.462190 |
5.11 |
2.72 |
out11.txt |

On these files our program is in average faster by 1.38

Average mistake 1.37%