Summer 2004, guidance of Prof. Dan Geiger and Anna Tzemach.
by Osnat Tal and
Noga Engel
The Problem
Consistency checking of a pedigree is a computational problem, aiming to determine whether the data is consistent with the classic Mendelian laws of inheritance.
This
is a basic tool needed by geneticists to have a higher degree of confidence
about the data.
The problem of a full consistency checking is
NPcomplete.
The Solution
Create a program that handles basic checks, providing a higher degree of confidence of the consistency of the genetic input. This set of checks consists of tests that will aim to find the most of the consistency errors and indicate them. The output of the program also includes recommendations regarding the people who should set to “unknown”, in order to make the pedigree consistent.
Phase one: Handling the pedigree
file.
Implementation
of functions:
* Checking the case of multiple pedigrees.
* Parsing
the pedigree, creating a struct for each person,
linked in a list to the person who appears before him in the file.
* Syntax
checks:
All
errors are “collected” and “kept” in mErrorLL1, will be printed at the end of level 1.
Parsing
the file takes time complexity of O(n),
when n is the number of people in the pedigree. Generating the pedigree
data structure, from the parsed persons, takes O(n^{2}), since,
for each person, we go over all people, in worst case, in order to find his
parents (and then connect them together).
Phase two: Locate Mendelian errors:
The
checking that are done once for each nuclear family (checking sibship and consistency within the nuclear family) are done
in time complexity of O(n), when n
is the number of people in the pedigree: Each nuclear family is checked once,
and all its children are being checked once. In each nuclear family checking,
all children are being examined, but such one examination is in a constant
amount of time, and the sum of all children in all nuclear families is O(n). (Because a child,
is a child of only one couple of parents). (For each nuclear family checking,
there is a constant time added to the pass on all children, but it is constant,
and the number of nuclear families is O(n)
too).
Lange Goradia algorithm is done in time complexity of O(n^{2}m^{8}), when n
is the number of people in the pedigree, and m is the number of
different alleles for the checked marker, that appear in the pedigree.
The
algorithm, as appears in the course tutorial, and is coded in our project is:
*All the following calculations are for one marker.
A.
For each pedigree member, save only ordered genotypes compatible
with his/her phenotype.
B.
For each nuclear family:
1. Consider each motherfather genotype pair:
a. Determine which zygotes can arise from this pair.
b. If each child in the nuclear family has one or more of
these zygote genotypes among his or her current genotype list, then save the
parental genotypes, and any child genotype matching
one of the created zygote genotyps.
c. If any child has none of these zygote genotyps
among his/her genotyps list, then don’t save any genotyps.
2. For each person in the nuclear family, exclude any
genotypes not saved during step (1).
C.
Repeat part (B) until no more genotypes can be excluded.
Since
there are m different alleles of the checked marker, then any person has O(m^{2})
pairs of alleles of this marker.
Step A,
thus, is done in O(m^{2}n) time
complexity.
Step
B.1: If each person has O(m^{2})
pairs of alleles of the checked marker, then there are O(m^{4})
motherfather genotype pairs.
Step
B.1.a is done in constant time.
Step
B.1.b is done in k_{i}*O(m^{2}) time, since there is a need to go over
all O(m^{2}) pairs of alleles for each child (of the checked
marker), when there are k_{i} children
in the checked family.
Step
B.1.c is k_{i}*O(m^{2}),
for the same reasons as in B.1.b.
Step
B.2 is done in time complexity of (k_{i}+2)O(m^{2}),
since we go over all pairs of alleles for the checked marker (there are O(m^{2})
such), for each person in the family (k_{i}+2 members).
Thus,
for a single nuclear family, step B requires time complexity of:
m^{4}[C + 2k_{i}*O(m^{2})] + (k_{i}+2)O(m^{2}) = O(k_{i}m^{6})
Since
this is done once (in step B) for each nuclear family, and a person is a child
in an up to one nuclear family, then the sum of all k_{i}’s
in all the pedigree is O(n). Thus, step B takes O(nm^{6})
time.
C.
Step B is repeated until no more genotypes can be excluded.
There are at the beginning a total of O(nm^{2})
genotypes in the pedigree, since there are n people, and each one has O(m^{2})
genotypes of the checked marker. If, in worst case, only a single genotype is
being excluded in step B, then we’ll repeat step B O(nm^{2})
times. And thus, step B and C will take time amount of
O(nm^{2}) * O(nm^{6}) = O(n^{2}m^{8})
The
whole algorithm is in time complexity of O(m^{2}n)
for step A, and O(n^{2}m^{8}) for steps B and C.
Thus the whole algorithm takes time of
O(m^{2}n) + O(n^{2}m^{8})
= O(n^{2}m^{8}).
Phase three: Pedigree recommendations
All level 3 functions deal with finding out which are the minimum set of people, that when their markers are turned to "unknown" (instead of it's value in the input file), then the pedigree becomes consistent.
The user gives D – a critical genotypes maximum level.
The time complexity for finding critical genotypes in level d is:
C(d, n)*(time complexity of Lange Goradia algorithm) = C(d, n) * O(n^{2}m^{8})
Now we will sum these expressions for each feasible value of d (1..D). The maximal possible value of C(d,n) is C( integer(n/2),n), so the upper bound of the sum is
D*C( integer(n/2),n). Thus, the total estimation of algorithm’s complexity is
O(D*C( integer(n/2),n)*(n^{2}m^{8})).
Implementation:
The code is written in C, input file are of two kinds:
1)
data file
2) Pedigree
file, in the following format:
person mother father sex
affectedstatus allele_1.1 allele_1.2 ..allele_K.1 allele_K.2 << comments
In case of multiple pedigrees, each new pedigree starts with a line:
pedigree id, where id is a string or a
number.
Output file, includes errors found by the program, divided to three levels:
1  Consists of syntactic errors.
2 & 3  Provides the user with specific errors
concerning the classic Mendelian laws of inheritance.
(For
more details, look at the Algorithms section).
One of the “basic units” of the program is a structure
called Person, which obviously represents a person
in the pedigree.
Each
person is a member of two or more families, called NuclearFam, depending on
the point of view:
1) mParentsFam – the person is a
child in this family, so it includes his mother, father, siblings.
2) mFoundedFam  the person is a
founder of a family, so it includes his/her children and mate. Such a mFoundedFam will exist for each
different mate the person has brought children with. Those mFoundedFam
of a person will be connected by pointers.
The struct Pedigree contains a list of all people in
pedigree and more valuable information.
To keep
the information of alleles we used a structure Marker.
PedAlleles
is used to save all
the alleles (linked
list) in the pedigree. Its usage is for the consistency checks.
ErrorLL,
used to “collect” error message, implemented as a link list of, Error structure.
PedigreeList is a struct
which holds a linked list of Pedigree. It is needed in cases
the input file consists of several pedigrees.
Pedigree
Example 1: a simple pedigree, the Smiths.
Smith
Pedigree File:
Helen suzan david female healthy 2 2
Bob suzan ben male healthy 3 1
Chris suzan david male affected 1 1
David 0 0 male affected 4 1
Suzan 0 0 female affected 4 1
Ben 0 0 male affected 4 1
Bob

Helen

Chris

David

Suzan

Ben

Enlargement of Helen’s cell in the
pedigree array, a closer look at the structure of Person.
Example 2:
Dan 0 0 male healthy 2 2
Sara 0 0 female healthy 3 1
Sheri Sara Dan female healthy 3 1
George 0 0 male affected 1 1
Richard Sara Dan male affected 2 1
Donald Sara Dan male affected 2 1
Yehezkel Sara George male affected 3 1