__Project in bioinformatics__

**Winter 2003/4 under**** ****guidance of**** ****Prof.
Dan Geiger And Ma’ayan Fishelzon**

** Submitted By:**

** Gzayil Firass email: sfgzayil@t2.technion.ac.il**

** **

** Zeidan Mohammed
email: smaz@t2.technion.ac.il**

__Goal of the project:__

Performing approximate inference via a heuristic which ignores extreme markers in likelihood computations, when those are too strenuous to be performed exactly, since, they may take a long time to finish the calculation or may cause memory overflow.

__Requirements: __

To code a function that would satisfy the project goal.

This function is coded within the super link code and uses its internal data structures.

The function keeps removing markers from the data structures till reaching the desired predefined complexity (which is lower than the original one), this will enable super link (after reducing the complexity coefficient below the predefined one), run on minimized data, meaning, faster calculation. As may be expected, this way we loose information that is needed to have full accurate results. Our task was to develop means to remove less valuable markers in an attempt to have results as close as possible to the accurate ones, and to implement those means in this function.

__Implementation:__

1. Start with the total number of loci as specified in the input files.

2. Determine an elimination order of the problem.

3. Derive the complexity coefficient of the elimination order.

4. If it is less than the predefined complexity, we run the superlink program to calculate the likelihood without any further operations.

5. Choose one marker and remove it from the data structures.

6. goto 3

This is an abstract algorithm of the solution.

The main problem was to determine the marker to be clipped without harming the accuracy of the likelihood result as possible.

__How to choose a marker:__

We considered few characteristics of the loci to help choosing the extreme ones:

1. Te distance of a locus from the affection.

This characteristic indicates how close a locus to the affection; hence, we avoid clipping loci located too close to the laffection.

2. The number of persons in which the phenotype of a locus is unknown.

As bigger this number, as less the information a locus provides. So it’s recommended to get rid of it.

3. The number of possible alleles a locus has.

A locus with a big number of possible alleles causes higher complexity.

4. The closeness of a locus to it’s flanking neighbors.

If two loci are too close to each other, it’s possible to give up one of them since we have the information from the other.

__Data structures:__

We had only one data structure, an array that holds former four characteristics for each locus.

The array size is the number of loci, and it is of type prio:

Prio is a struct and defined as follows:

typedef struct Prio{

double leftDist;

double rightDist;

int numUnknown;

double totalDist;

double totalPrio;

}prio;

**leftDist**- is the distance between a locus and it’s left
neighbor.

**rightDist**- is the distance between a locus and it’s right
neighbor.

**numUnknown**- the number of persons that their phenotype at this
locus is unknown.

**totalDist**- the distance of this locus from the affection.

**totalPrio**- the total priority of this locus.

*As lower a locus` priority as higher the chance to be clipped off!

**Hence, the total memory
complexity is O(number of loci)**

__Algorithm of setting the
priority of the loci:__

1. Calculate and set the values of every field of each cell of the priority array corresponding with the proper locus. A locus located at index i in the array of the loci (loci), corresponds with index i in the priority array (prioLoci).

· leftDist / rightDist / totalDist - are calculated once by moving along the
array that holds the recombination values between the loci è time complexity is **O(number of loci)**

· numUnknown-
is calculated by moving all the persons in all pedigrees at each locus
calculationè time complexity is **O[****(number
of loci)*(number of people in all pedigrees)]**

2. Calculate the total priority:

a. As mentioned above we have no interest in removing loci located too close to the affection; therefore, we “protect” the area surrounding the affection. The size of the area (dangerousInterval) is determined by a percentage defined (the user may change it (percentageForInterval)). For each locus in the protected area we set a high positive value (100) to its total priority (totalPrio), and zero for the others. If however, during the run of the program, no loci left out the protected are, it shrinks by one from each side.

b.
Calculate the average of the number of
unknown phenotype of any person at any locus, and the average of alleles at
each locus outside the protected interval,è time complexity is **O(number of
loci)**

c. For each locus (out of the protected area) that has number of unknown greater than the average, reduce 10 from its total priority; (the new total priority is -10). The same is done for each locus (out of the protected area), and has number of possible alleles greater than average.

d. We had defined a critical distance (criticalDist) so that any two adjacent loci with recombination frequency equal or less than it, reduce a defined positive value from total priority (totalPrio) (minVal 30) – the user can change it.

(c + d) in one loop over priority arrayè time complexity is **O(****number of
loci)**

e. In choosing the marker, the search selects the locus with the lowest priority, if there is more than one locus with the lowest priority, take the one with higher number of alleles, if the alleles are equal take the farthest one from the affection locus.

This
is done by moving along the priority array only onceè time complexity is **O(****number of
loci)**

3. Remove the locus chosen an
adjust the data structure to get consistent, by deleting the marker chosen from
the array loci and shift the others è (time complexity
is O(number of loci) in worst case shift), also
updating the two dimensional array (maleThetaVals and
femaleThetaVals) è(time complexity
is O(number of loci) in worst case shift). Also updating the pedigree, shift
the loci for each person è(time complexity is O((number of loci)*(number of
people in all pedigrees)) in worst case shift).

prioLoci array:

This is how the priority
array looks like at running time. In the middle of the diagram, the affection
locus is located, and the rectangle describes the protected (dangerous) area
which includes the loci surrounding the affection within a fixed length, the
priority of these loci is denoted by 100 and the rest of loci receive priority
according to the data each one holds. The flanking arrows are to say that the
width of the protected area may shrink if this is needed, if no loci left out
of it (very low chance to happen).

As already been mentioned
above, we choose the locus that has lowest priority of the loci.

__Results and
Conclusions:__

*in all the file bellow, the
affection locus is set to be in the middle of the map

__Resuls____ (same number of alleles for each locus)__

File |
#people |
#loci |
#Nodes |
Lod score/likelihood Exact. |
Lod score/likelihood Approx. |
Absolute error |
Error rate % |
Run time Exact. (sec) |
Run time approx. (sec) |
#loci removed |
Result files.
(Exact) |
Result files.
(Approx) |

57 |
21 |
659 |
-0.078388 |
-0.078245 |
0.00014 |
0.182425882 |
3.140000 |
3.890 |
1 |
|||

57 |
21 |
1838 |
-1.731987 |
-1.102788 |
0.6292 |
36.32815951 |
22.516000 |
8.000 |
8 |
|||

57 |
23 |
2007 |
-1.232434 |
-2.253489 |
1.02106 |
82.84865559 |
21.110000 |
20.406 |
7 |
|||

57 |
25 |
2190 |
-2.577146 |
-2.251073 |
0.32607 |
12.65248457 |
24.671000 |
35.547 |
6 |
|||

57 |
27 |
2380 |
-1.356160 |
-0.723942 |
0.63222 |
46.61824563 |
1132.828000 |
293.813 |
7 |
|||

57 |
29 |
2531 |
-1.282169 |
-0.706973 |
0.5752 |
44.86116885 |
2371.485000 |
622.578 |
5 |
|||

25 |
23 |
851 |
-1.145314 |
1.614273 |
2.75959 |
240.9458891 |
76.266000 |
45.515 |
6 |
|||

25 |
25 |
922 |
-0.354794 |
-1.584810 |
1.23002 |
346.684555 |
95.843000 |
56.297 |
8 |
|||

25 |
27 |
996 |
-1.623843 |
-0.578665 |
1.04518 |
64.36447366 |
120.156000 |
49.500 |
12 |
|||

25 |
29 |
1062 |
-0.260679 |
-0.630194 |
0.36952 |
141.7509657 |
89.235000 |
71.937 |
13 |

This
kind of file in which we assigned the number of alleles for each locus,
reflects the effect of the number of unknown and the

recombination values between the loci, on the final results of our implementation.
As it seems from the results, our implementation

stands it
nice, since we had only one file in which we obtained relatively high error
distance.

We
have obtained the following graphs from the table above. The first graph below
describes the absolute error (error size) as a

function
of the number of loci been removed.

The
absolute error is generally low (except one file) as it seems from the table
the same as from the graph.

__ __

The
following graph describes the absolute error as a function of the number of
loci at each side of the affection locus.

In
this group, from a certain point, the more loci at each side of the affection
the lower the absolute rate is.

__Result (same known)__

Files |
#people |
#loci |
#Nodes |
Lod score/likelihood Exact. |
Lod score/likelihood Approx. |
Absolute error |
Error rate % |
Run time Exact. (sec) |
Run time approx. (sec) |
#loci removed |
Result files.
(Exact) |
Result files.
(Approx) |

50 |
21 |
1716 |
-2.105777 |
-1.588505 |
0.51727 |
24.5644244 |
3.2500 |
2.547000 |
5 |
|||

50 |
23 |
1855 |
-0.798646 |
-0.677597 |
0.12105 |
15.1567778 |
2.7960 |
2.828000 |
2 |
|||

50 |
25 |
2033 |
-2.237754 |
10.164054 |
12.4018 |
554.207835 |
5.6560 |
6.343000 |
14 |
|||

50 |
27 |
2197 |
-1.301117 |
-1.727963 |
0.42685 |
32.8061197 |
12.6250 |
10.89000 |
8 |
|||

50 |
29 |
2330 |
-2.227370 |
0.892379 |
3.11975 |
140.064246 |
12.1710 |
13.81200 |
15 |
|||

50 |
31 |
2472 |
-2.339975 |
-2.540232 |
0.20026 |
8.55808289 |
8.8590 |
8.171000 |
9 |
|||

50 |
33 |
2632 |
-2.024760 |
-0.525317 |
1.49944 |
74.0553448 |
10.6250 |
7.421000 |
9 |
|||

50 |
35 |
2779 |
-0.705142 |
1.642086 |
2.34723 |
332.873095 |
7.8120 |
6.671000 |
10 |
|||

50 |
37 |
2919 |
-3.120746 |
1.638174 |
4.75892 |
152.493026 |
7.3120 |
8.156000 |
9 |
|||

50 |
39 |
3078 |
-0.294515 |
3.999787 |
4.2943 |
1458.0928 |
18.2960 |
12.40600 |
11 |

In these files, all the loci
have the same number of known phenotype in the people given in the input,
(matter fact we assigned the loci to be all known at any person – since we take
into account the average - as been described above – this wouldn’t have any
effects). According to our implementation, in this case we remove – mainly loci
with the highest number of alleles in an attempt to

reduce the complexity coefficient. As it is obvious from the
table and the graph bellow (the first one) the more loci removed, the bigger
the absolute rate is.

Also in this case, from a
certain point, the more loci at each side of the affection the lower the
absolute rate is.

__Results (same distance)__

Files |
#people |
#loci |
#Nodes |
Lod score/likelihood Exact. |
Lod score/likelihood Approx. |
Absolute error |
Error rate % |
Run time Exact. (sec) |
Run time approx. (sec) |
#loci removed |
Result files.
(Exact) |
Result files.
(Approx) |

30 |
15 |
686 |
-0.386085 |
2.580186 |
2.96627 |
768.294806 |
0.4840 |
0.9220 |
8 |
|||

30 |
17 |
750 |
-0.231126 |
0.934130 |
1.16526 |
504.164828 |
0.6870 |
0.9840 |
4 |
|||

30 |
19 |
837 |
-0.456340 |
-0.586443 |
0.1301 |
28.5101021 |
0.7340 |
1.3430 |
6 |
|||

30 |
21 |
927 |
-0.293532 |
0.739871 |
1.0334 |
352.058038 |
0.890 |
1.140 |
2 |
|||

30 |
23 |
1012 |
-0.645523 |
-0.571646 |
0.07388 |
11.4445186 |
1.0150 |
1.6710 |
4 |
|||

30 |
25 |
1095 |
-0.586887 |
-0.288023 |
0.29886 |
50.923602 |
1.0460 |
2.5150 |
8 |
|||

30 |
27 |
1174 |
-0.345031 |
-1.090526 |
0.7455 |
216.066093 |
1.000 |
3.250 |
11 |
|||

30 |
29 |
1267 |
-0.641469 |
-0.914389 |
0.27292 |
42.5460934 |
3.2180 |
4.2030 |
13 |
|||

30 |
31 |
1341 |
0.275094 |
-1.297687 |
1.57278 |
571.724938 |
2.610 |
4.9060 |
14 |
|||

30 |
33 |
1435 |
-0.321124 |
-1.310705 |
0.98958 |
308.161645 |
2.860 |
5.8280 |
15 |

All loci in these files have
the same recombination values between them = 0.1.

In general the absolute error
is low except the first file, however, this file has relatively a low number of
loci from each side of the affection and half of the loci were removed. So this
particular file is not a representative one.

1111.5160

41.7500

5

20

25

1003

-0.792989

-0.647185

0.1458

18.386636

2326.140

50.3590

7

20

27

1079

-0.128628

-1.262478

1.13385

881.49548

2688.6250

175.0620

8

20

29

1149

-0.734409

-0.419229

0.31518

42.916141

199.2190

426.9530

2

20

31

1238

-0.239630

-0.102404

0.13723

57.265785

748.1250

1628.031

11

20

33

1319

-0.417219

-0.907606

0.49039

117.53707

337.5470

974.7180

14

20

35

1406

-0.724944

0.475627

1.20057

165.60879

5233.9380

233.6090

16

20

23

901

-0.475182

-0.432396

0.04279

9.0041289

2897.203

1279.765

2

57

29

2531

-1.282169

-1.281969

0.0002

0.0155986

6093.265

1217.25

2

These are general files with
random values of the former variable explained above!

In general the results are
fine especially when working on long massive files with complexity coefficient
higher than 9.

Files number
8/9 are an example for such files (true for all categories).

*a very significant
improvement in running time is obtained (regardless the category).

It took file no. 9 about 101
min’s running on the exact calculation, while in the approx it took it 21 min’s
within a very low absolute error and error rate!

This graph supports our
assumption that when a big number of loci is removed, the error may get higher

Also here we can notice a
tendency of reducing absolute error within growing number of loci at each side
of the affection locus.

the absolute error is very low relatively, so the graph
still supports the assumption.

__Total results:__

These are graphs as a
combination of all the above

**Download
Executable for Windows**

Note: the complexityThreshold
was initialized to 8.