Linkage analysis is a statistical method, which is used to extract information from inheritance data (e.g. pedigrees), and to find correlation between certain alleles and phenotypes, thus aiding in the identification of disease causing genes.
Two approaches can be used for the analysis:
GENEHUNTER is a program which performs linkage analysis for pedigrees given theta values.
Multipoint analysis of a given pedigree is separated into two steps:
- computation of probability distribution of inheritance patterns (which depend only on the marker genotypes)
- computation of a statistic that quantifies linkage (which, conditional on an inheritance pattern, depends only on the phenotypes)
The algorithm used by GENEHUNTER is linear by number of loci, but exponential by the number of people.
The goal of the project was to implement NPL (Non Parametric Linkage) analysis in the fastest possible way, based on the methodology of GENEHUNTER. A program was provided which calculates the probability vectors and likelihood scores, into which an implementation of NPL score computation was to be integrated.
A general algorithm to calculate score:
compute_score(inheritance_vector_space){
score := 0
for each inheritance_vector in inheritance_vector_space do
s := vector_score( vector )
p := vector_probability( vector )
score := score + (s * p)
}
In this project, I seek for the best implementation of vector_score
to return
NPL scores of two kinds:
Calculating NPL score has two main algorithmic bottle-necks:
Therefore, improving performance of the calculation, requires:
The first issue is dealt with by not traversing through members who have no influence on the genotypic data of affected member (Kyriacos et al 2001), in addition to the already implemented reduction of inheritance space by removing the first child of each founder (Kruglyak et al 1996, Second Speedup).
The second and third issues are dealt with by integrating them into a unified process, in which the traversal and score calculation are weaved together, and by using two special traits of gray code. Gray code is a binary system in which two consecutive numbers differ by the value of a single bit. By iterating over the inheritance vectors space using Gray code, only a single allele is changed (for the changed member and its descendants). By iterating through the possible choices of alleles of affected members using Gray code, a faster calculation of NPLall is also gained (Kyriacos et al 2001). Since Gray code is cyclic, an extra iteration returns to the initial state, thus saving the need to analyze the entire pedigree when calculating the first choice of alleles in NPLall.
In NPLpairs the score of each vector is the number of pairs of alleles that are IBD,
in distinct affected pedigree members.
When traversing the vector space, whenever changing an allele for a member, the score
changes as follows: let a1
be the allele being replaced by a2
,
and |x|
represents the number of times allele x
appears in
affected members, then after the replacement, |a1|* := |a1| - 1, |a2|* := |a2| + 1,
score* := score - |a1|* + |a2|
. If the new allele is identical to the other allele
present in the same member, decrease the score by one, and if this is the case for the
old allele, increase it by one.
In NPLall the Whittemore-Halpern statistic, is calculated using Gray code as described
by Kyriacos et all (2001). To improve performance, the starting value of the statistic
PI(b_{i}(h=0), i=1, 2f)
is modified during traversal in the following
way: let a1
be the allele being replaced by a2
(in this case
a1
is always the paternal allele), then the value of the statistic
is divided by the multiplicity of a1
, the multiplicity of a1
is
decreased by one, the multiplicity of a2
is increased by one, and the
statistic is multiplied by the multiplicity of a2
.
let SF be the scoring function, which has methods:
onNewVector(v), onChangeAlleles(member, parent), onVectorEnd(vector)
let R be the reduced vectors space
let F be the relevant founders
let N be the relevant non founders
let S be the set of score per vector
Initialization:
For each s in S, s := 0
vector := 0
SF.onNewVector(vector)
For each f in F,
f.paternalAllele := unique value
f.maternalAllele := unique value
SF.onChangeAlleles(f, father)
SF.onChangeAlleles(f, mather)
For each n in N
n.paternalAllele := n.father.paternalAllele
n.maternalAllele := n.mather.paternalAllele
SF.onChangeAlleles(n, father)
SF.onChangeAlleles(n, mather)
SF.onVectorEnd(vector)
Iteration
For each vector in R-{0}, in Gray code order, starting from 1
SF.onNewVector(vector)
let m be the changed member
let p be the parent in whom the allele changed
SF.onChangeAlleles(m,p)
For each descendant d of m, in order of generations
If allele change affected d,
change allele in d, for appropriate parent q
SF.onChangeAlleles(d,q)
SF.onVectorEnd(vector)
Normalization:
For each s in S
s := ( s - Expectation(S) ) / StandardDeviation(S)
Examples of SF:
NPLpairs:
onNewVector:
S(vector) := S(previous_vector)
onChangeAlleles:
if member is not affected then exit.
let a be the allele being changed
let b be the new allele
let c be the other allele in the same member
|a| := |a| - 1
S(vector) := S(vector) - |a| + |b|
|b| := |b| + 1
if a = c S(vector) := S(vector) + 1
if b = c S(vector) := S(vector) - 1
onVectorEnd:
do nothing
NPLall:
let h0 be the stored value of h0 initialized as 1
onNewVector:
do nothing
onChangeAlleles:
if member is not affected, or changed allele is not paternal then exit.
let a be the allele being changed
let b be the new allele
h0 := h0 / |a|
|a| := |a| - 1
|b| := |b| + 1
h0 := h0 * |b|
onVectorEnd:
Let H be the affected vector space
S(vector) := h0
h := h0
For each vector h in H-{0}, in Gray code order
let a be the allele being changed
let b be the new allele
h := h / |a|
|a| := |a| - 1
|b| := |b| + 1
h := h * |b|
S(vector) := S(vector) + h
(The next lines are used to return the multiplicities to the state they should be according to h0)
Let x be the maternal allele of the MSB
Let y be the paternal allele of the MSB
|x| := |x| - 1
|y| := |y| + 1
Let n be the number of non founders, f the number of founders, and a the number of affected members.
Marking relevance of pedigree members | O(n+f) | |
Creating masks | O(n+f) | |
Traversing the inheritence vector space | O(2^{2n-f}) | |
Changing allele assignment for each vector | O(n) | |
Calculating NPLpairs for each vector | O(n) | |
Calculating NPLall for each vector | O(2^{a}) | |
Mutiplying scores and probability vectors | O(2^{2n-f}) | |
Total | O(2^{2n-f+a}) |
In order to integrate the NPL score computation into the existing code, a few changes are required in the original code:
npl_score_all
and npl_score_pairs
to the TestPoint
struct, in programSpec.h
.
Accordingly, the initialization and printing of these fields is implemented in programSpec.c
.
computeNplScore
function provided in the npl.h
library
(implemented in npl.c
) and providing it with functions to compute the required NPL score (for example,
those provided by the library nplScorePairs.h
and nplScoreAll.h
, implemented in the corresponding
c files).
The modification in likehood.h
(adding the vectors' definitions) and in gh_main.c
(calling computeNplScore
)
generates the score vectors to be used in the program.
likehood()
function,
implemented in likehood.c
.
/* modification BEGIN */
and /* modification END */
, where
modification
is INSERT
, REPLACE
, or DELETE
and replaced or deleted code is remarked.
The test data included 56 tests, which were run on t2.technion.ac.il Each test was run 5 times, and the results presented are the average runtime of all the runs.
Resource | Description |
---|---|
SuperLink | Superlink homepage, for file structure and other data |
ghStudents <datafile> <pedfile>
ghStudents <datafile> <pedfile> -pairs