Summer 2004, guidance of Prof. Dan Geiger and Anna Tzemach.
by Osnat Tal and Noga Engel
Consistency checking of a pedigree is a computational problem, aiming to determine whether the data is consistent with the classic Mendelian laws of inheritance.
This is a basic tool needed by geneticists to have a higher degree of confidence about the data.
The problem of a full consistency checking is NP-complete.
Create a program that handles basic checks, providing a higher degree of confidence of the consistency of the genetic input. This set of checks consists of tests that will aim to find the most of the consistency errors and indicate them. The output of the program also includes recommendations regarding the people who should set to “unknown”, in order to make the pedigree consistent.
Phase one: Handling the pedigree file.
Implementation of functions:
* Checking the case of multiple pedigrees.
* Parsing the pedigree, creating a struct for each person, linked in a list to the person who appears before him in the file.
* Syntax checks:
All errors are “collected” and “kept” in mErrorLL1, will be printed at the end of level 1.
Parsing the file takes time complexity of O(n), when n is the number of people in the pedigree. Generating the pedigree data structure, from the parsed persons, takes O(n2), since, for each person, we go over all people, in worst case, in order to find his parents (and then connect them together).
Phase two: Locate Mendelian errors:
The checking that are done once for each nuclear family (checking sibship and consistency within the nuclear family) are done in time complexity of O(n), when n is the number of people in the pedigree: Each nuclear family is checked once, and all its children are being checked once. In each nuclear family checking, all children are being examined, but such one examination is in a constant amount of time, and the sum of all children in all nuclear families is O(n). (Because a child, is a child of only one couple of parents). (For each nuclear family checking, there is a constant time added to the pass on all children, but it is constant, and the number of nuclear families is O(n) too).
Lange Goradia algorithm is done in time complexity of O(n2m8), when n is the number of people in the pedigree, and m is the number of different alleles for the checked marker, that appear in the pedigree.
The algorithm, as appears in the course tutorial, and is coded in our project is:
*All the following calculations are for one marker.
A. For each pedigree member, save only ordered genotypes compatible with his/her phenotype.
B. For each nuclear family:
1. Consider each mother-father genotype pair:
a. Determine which zygotes can arise from this pair.
b. If each child in the nuclear family has one or more of these zygote genotypes among his or her current genotype list, then save the parental genotypes, and any child genotype matching one of the created zygote genotyps.
c. If any child has none of these zygote genotyps among his/her genotyps list, then don’t save any genotyps.
2. For each person in the nuclear family, exclude any genotypes not saved during step (1).
C. Repeat part (B) until no more genotypes can be excluded.
Since there are m different alleles of the checked marker, then any person has O(m2)
pairs of alleles of this marker.
Step A, thus, is done in O(m2n) time complexity.
Step B.1: If each person has O(m2) pairs of alleles of the checked marker, then there are O(m4) mother-father genotype pairs.
Step B.1.a is done in constant time.
Step B.1.b is done in ki*O(m2) time, since there is a need to go over all O(m2) pairs of alleles for each child (of the checked marker), when there are ki children in the checked family.
Step B.1.c is ki*O(m2), for the same reasons as in B.1.b.
Step B.2 is done in time complexity of (ki+2)O(m2), since we go over all pairs of alleles for the checked marker (there are O(m2) such), for each person in the family (ki+2 members).
Thus, for a single nuclear family, step B requires time complexity of:
m4[C + 2ki*O(m2)] + (ki+2)O(m2) = O(kim6)
Since this is done once (in step B) for each nuclear family, and a person is a child in an up to one nuclear family, then the sum of all ki’s in all the pedigree is O(n). Thus, step B takes O(nm6) time.
C. Step B is repeated until no more genotypes can be excluded. There are at the beginning a total of O(nm2) genotypes in the pedigree, since there are n people, and each one has O(m2) genotypes of the checked marker. If, in worst case, only a single genotype is being excluded in step B, then we’ll repeat step B O(nm2) times. And thus, step B and C will take time amount of
O(nm2) * O(nm6) = O(n2m8)
The whole algorithm is in time complexity of O(m2n) for step A, and O(n2m8) for steps B and C. Thus the whole algorithm takes time of
O(m2n) + O(n2m8) = O(n2m8).
Phase three: Pedigree recommendations
All level 3 functions deal with finding out which are the minimum set of people, that when their markers are turned to "unknown" (instead of it's value in the input file), then the pedigree becomes consistent.
The user gives D – a critical genotypes maximum level.
The time complexity for finding critical genotypes in level d is:
C(d, n)*(time complexity of Lange Goradia algorithm) = C(d, n) * O(n2m8)
Now we will sum these expressions for each feasible value of d (1..D). The maximal possible value of C(d,n) is C( integer(n/2),n), so the upper bound of the sum is
D*C( integer(n/2),n). Thus, the total estimation of algorithm’s complexity is
The code is written in C, input file are of two kinds:
1) data file
2) Pedigree file, in the following format:
person mother father sex affected-status allele_1.1 allele_1.2 ..allele_K.1 allele_K.2 << comments
In case of multiple pedigrees, each new pedigree starts with a line:
pedigree id, where id is a string or a number.
Output file, includes errors found by the program, divided to three levels:
1 - Consists of syntactic errors.
2 & 3 - Provides the user with specific errors concerning the classic Mendelian laws of inheritance.
(For more details, look at the Algorithms section).
One of the “basic units” of the program is a structure called Person, which obviously represents a person in the pedigree.
Each person is a member of two or more families, called NuclearFam, depending on the point of view:
1) mParentsFam – the person is a child in this family, so it includes his mother, father, siblings.
2) mFoundedFam - the person is a founder of a family, so it includes his/her children and mate. Such a mFoundedFam will exist for each different mate the person has brought children with. Those mFoundedFam of a person will be connected by pointers.
The struct Pedigree contains a list of all people in pedigree and more valuable information.
To keep the information of alleles we used a structure Marker.
PedAlleles is used to save all the alleles (linked list) in the pedigree. Its usage is for the consistency checks.
ErrorLL, used to “collect” error message, implemented as a link list of, Error structure.
PedigreeList is a struct which holds a linked list of Pedigree. It is needed in cases the input file consists of several pedigrees.
Helen suzan david female healthy 2 2
Bob suzan ben male healthy 3 1
Chris suzan david male affected 1 1
David 0 0 male affected 4 1
Suzan 0 0 female affected 4 1
Ben 0 0 male affected 4 1
Enlargement of Helen’s cell in the pedigree array, a closer look at the structure of Person.
Dan 0 0 male healthy 2 2
Sara 0 0 female healthy 3 1
Sheri Sara Dan female healthy 3 1
George 0 0 male affected 1 1
Richard Sara Dan male affected 2 1
Donald Sara Dan male affected 2 1
Yehezkel Sara George male affected 3 1