The goal of the present project is to implement NPL (non-parametric linkage) analysis in the fastest possible way. This is the part in the Technion project of efficient implementing multipoint approach for gene linkage. Here is a general explanation about the whole project and the part of the present NPL project in it.
First step. Given a pedigree, information about the affected status of each individual in the pedigree, marker alleles information at specific loci for each individual, recombination values between adjacent loci and gene frequencies – the single point probabilities for each possible inheritance vector at each locus (inheritance distribution) are found. These probabilities depend only on the marker data for each locus itself. Each inheritance vector determines the inheritance pattern of specific locus in the pedigree. When inheritance distribution is found, the multipoint probabilities (for each inheritance vector at each locus) are computed using standard forward-backward HMM algorithm : assuming nodes in HMM chain are loci on the chromosome, hidden states are inheritance vectors, visible states are marker alleles at each specific locus (input data), emission probabilities are single point probabilities (of each inheritance vector at each locus) and transition probabilities are transitions between each pair of inheritance vectors at each pair of adjacent loci with specific recombination value between them. Such transition is computed by multiplying (number of correlated bits which are the same in 2 vectors)^(1 – recombination value) with (number of correlated bits which are different in 2 vectors)^(recombination value). So, using this method, we get 2 tables : left-to-right probabilities table (forward algorithm) and right-to-left probabilities table( backward algorithm).
Second step. Some scoring function should be defined to give the score for each inheritance vector depending on the observed phenotypes in the pedigree, either parametric or non-parametric (NPL score). The main difference between them is that parametric function (LOD-score) assumes linkage model in the computation while NPL doesn’t. So when the linkage model is right, LOD-score should give more precise result, and in the case of misspecification of linkage model, NPL is supposed to give better result (it may be useful in the case of complex diseases).
As it was said above, the goal of this project is to implement the computation of NPL score. 2 definitions of scores were used: Spairs(v) and Sall(v) assuming v – some inheritance vector:
Spairs(v) – the number of pairs of alleles from distinct affected pedigree members which are IBD (identical by descent, that means got physically the same allele from the common ancestor);
Sall(v) = 2^(-a)*∑ ∏bi(h)!
where: a – the number of affected individuals in the pedigree
h – collection of alleles from affected individuals, when only 1 allele from each affected is picked (there are 2^a possible collections)
1..2f – (assuming f is the number of founders in the pedigree) 2f distinct founder alleles
bi(h) – the number of times founder allele i appears in h.
This score sharply increases if the number of affected individuals sharing some specific allele increases.
It is easily seen that the first score is computationally trivial while the second is complex.
Third step. Now we generalize the scoring function by taking its expected value over multipoint probability distribution of inheritance vectors at different loci:
S(x) = ∑ S(w)*P[v(x) = w]
where: x – some locus (testpoint);
w – some inheritance vector (the sum is taken over all possible inheritance vectors);
S(w) – either Spairs(w) or Sall(w);
P[v(x) = w]=P – multipoint probability of vector w at x; if testpoint x is situated on some locus then P is computed by multiplication of left-to-right table[w,x] by right-to-left table[w,x]; else if x is situated somewhere between locus i and locus (i+1), one more step of forward algorithm is done using recombination value between locus i and x (for each vector w we store fromLeft(w)), one more step of backward algorithm is done using recombination value between locus (i+1) and x (storing in fromRight(w)), P=fromLeft(w)*fromRight(w).
All these calculations are performed using reduced inheritance vectors of length 2n-f (and so the number of vectors is 2^(2n-f)), when n is the number of nonfounders and f is the number of founders. We don’t need to store all 2n bits of nonfounders in vector, because there is no information about the founders phase, so for each founder his first child’s bit is set to 0 and doesn’t appear in inheritance vector.