**General Description**

The goal of the present
project is to implement NPL (non-parametric linkage) analysis in the fastest
possible way. This is the part in the Technion
project of efficient implementing multipoint approach for gene linkage. Here is
a general explanation about the whole project and the part of the present NPL
project in it.

First step. Given a pedigree,
information about the affected status of each individual in the pedigree,
marker alleles information at specific loci for each
individual, recombination values between adjacent loci and gene frequencies –
the single point probabilities for each possible inheritance vector at each
locus (inheritance distribution) are found. These probabilities depend only on
the marker data for each locus itself. Each inheritance vector determines the
inheritance pattern of specific locus in the pedigree. When inheritance
distribution is found, the multipoint probabilities (for each inheritance
vector at each locus) are computed using standard forward-backward HMM
algorithm : assuming nodes in HMM chain are loci on the chromosome, hidden
states are inheritance vectors, visible states are marker alleles at each
specific locus (input data), emission probabilities are single point
probabilities (of each inheritance vector at each locus) and transition
probabilities are transitions between each pair of inheritance vectors at each
pair of adjacent loci with specific recombination value between them. Such
transition is computed by multiplying (number of correlated bits which are the
same in 2 vectors)^(1 – recombination value) with
(number of correlated bits which are different in 2 vectors)^(recombination
value). So, using this method, we get 2 tables :
left-to-right probabilities table (forward algorithm) and right-to-left
probabilities table( backward algorithm).

Second step. Some scoring
function should be defined to give the score for each inheritance vector
depending on the observed phenotypes in the pedigree, either parametric or
non-parametric (NPL score). The main difference between them is that parametric
function (LOD-score) assumes linkage model in the computation while NPL
doesn’t. So when the linkage model is right, LOD-score should give more precise
result, and in the case of misspecification of linkage model, NPL is supposed
to give better result (it may be useful in the case of complex diseases).

As it was said above, the goal
of this project is to implement the computation of NPL score. 2 definitions of
scores were used: Spairs(v) and Sall(v) assuming v – some
inheritance vector:

Spairs(v) – the number of
pairs of alleles from distinct affected
pedigree members which are IBD (identical by descent, that means got physically
the same allele from the common ancestor);

Sall(v):

2f

Sall(v) = 2^(-a)*∑ ∏bi(h)!

h i=1

where: a
– the number of affected individuals in the pedigree

h – collection
of alleles from affected individuals, when only 1 allele from each affected is
picked (there are 2^a possible collections)

1..2f – (assuming f is
the number of founders in the pedigree) 2f distinct founder alleles

bi(h) – the number of
times founder allele i appears in h.

This score sharply increases if the number of affected individuals
sharing some specific allele increases.

It is easily seen that the first score is computationally trivial while
the second is complex.

Third step. Now we generalize
the scoring function by taking its expected value over multipoint probability
distribution of inheritance vectors at different loci:

S(x) = ∑ S(w)*P[v(x) = w]

w

where: x
– some locus (testpoint);

w – some
inheritance vector (the sum is taken over all possible inheritance vectors);

S(w) – either Spairs(w) or Sall(w);

P[v(x) = w]=P – multipoint probability
of vector w at x; if testpoint x is situated on some
locus then P is computed by multiplication of left-to-right table[w,x] by right-to-left table[w,x];
else if x is situated somewhere between locus i and
locus (i+1), one more step of forward algorithm is done using recombination
value between locus i and x (for each vector w we
store fromLeft(w)), one more step of backward
algorithm is done using recombination value between locus (i+1) and x (storing
in fromRight(w)), P=fromLeft(w)*fromRight(w).

All these calculations are performed using reduced inheritance vectors
of length 2n-f (and so the number of vectors is 2^(2n-f)),
when n is the number of nonfounders and f is the
number of founders. We don’t need to store all 2n bits of nonfounders
in vector, because there is no information about the founders phase, so for
each founder his first child’s bit is set to 0 and doesn’t appear in
inheritance vector.