**Winter 2004, under guidance of Prof. Dan
Geiger and Ma'ayan Fishelzon**

**by****
Mhameed Aez aldeen and
Qassem Ibrahem**

** **

__Goal __

Provide the means
for performing approximate inference via a heuristic which ignores extreme
markers when computations become too strenuous to be performed exactly.

Currently SUPERLINK
performs exact inference. Regardless of how well algorithms are being
perfected, linkage analysis provides an ever-growing challenge to computing
because some disease models depend on multiple loci, markers are highly
polymorphic, and there are many markers are available. It is therefore crucial
to provide the means for approximate computations with error bounds.

**Algorithm’s Prototype:**

**1)
****Start with the total number of markers as specified in the user input.**

**2)
****Determine an elimination order for the problem as is.**

**3)
****If the complexity coefficient of the best elimination order found
greater than some determined threshold ,clip off one of the extreme markers and
goto 2**

**4)
****When we have found an elimination order with complexity coefficient
below the desired threshold call the function to calculate likelihood
.**

__Strategy (to determined
the clipped markers)__

Concisely
our strategy is to keep the closest markers for the affection locus, as long as
there is **relatively** “reasonable” number of typed persons at each one of
these markers.

In
order to obtain a good heuristic we have first to determine what are the parameters
that influence result accuracy, then for each one of these parameters we have
to evaluate how does the heuristic value depend on this parameter (monotonous, decreasing,
increasing, linear etc…), for this phase we will reference later as **local
factor. **And then we have to weigh up between all the heuristic values from
the different parameters such that if parameter A is more important than
parameter B then the value contributed by A should be greater and greater correspondingly,
for this phase we will reference later as **global factor.**

__Data structure __

Info:

This data structure is used in order to hold
the marker information in order to calculate the heuristic value.

Following is a definition of Info structure :

typedef struct
Info {

double heurVal;

double distFromDisease;

int possibleAlleles;

} Info;

heurVal indicates the heuristic value for this entry, in our
implementation high heuristic values were assigned for those markers who are
most likely to be clipped off. I.e. each iteration we
clip the marker with the highest heuristic value.

distFromDisease is the distance of the marker in this entry from the
affection status locus, this value calculated by Haldane function.

possibleAlleles number of possible alleles in this markers.

|LociInformation is an array of Info, of size
equal to the numLoci. Entry i holds Info for the locus (Marker) in position i
on the chromosome ,Each time we clip a marker the
correspond entry is deleted and we shift to the left
all the entries in right side. I.e. we maintain LociInformation updated and matching the remaining markers situation.

1.
__Distance
from affection status locus____.__

a.
**Why does it matter**?

Genes
that lie close together on the same chromosome will tend to be inherited
together, because it will be rare for a crossover to occur between the loci at
meiosis.

b.
**Local factor** :

The closer
together the loci are, the less likely crossovers will be and the fewer
recombinants will be observed. If loci are far apart or on different
chromosomes then recombination will occur by chance in 50% of meioses. The
recombination fraction ranges from 0 (tight linkage) to 0.5 (no linkage).

So the first and simple conclusion is
that the closer the marker is for the affected status locus the more high value
it gets. But besides the distance should be relative rather than just an
absolute distance, which means the heuristic determined by how much the marker
is closer to the disease locus relatively to other markers.

This is better because it is more dynamic
and attached for the markers topology on the chromosome. For example in case
that all the markers are approximately at the same distance from the disease
locus there is a high opportunity to clip of not the farthest marker of them
but to “ignore” the influence of this factor and let other factors to decide.

This is made by updating each entry on LociInformation by the difference between the current average and the average in case that the corresponding marker is clipped off. In addition to emphasize that small differences in distances can lead to big differences in the results accuracy, we shall arise the previous expression to 3.

Algorithm **prototype**:

Let dis (i) = LociInformation[i].distFromDisease//distance from disease locus

D=AVG (dis (1), dis (2) **…** dis (numLoci))

For i =1 to numLoci

_{}

End

Notice
that if the distance is less than the average distance the contribution will be
negative

c.
**global factor**:

As
mentioned before this parameter is the most important one that influence the
result accuracy so it crucial that it takes great factor relatively to other parameters,so in our implementation
:

_{}

And
in this way we ensure that the distance influence is dominant and decisive .

**2.
**__Informativeness____ of the marker __

(Percentage
of typed individuals at this marker

a.
Why
does it matter?

In
order to determine if a specific marker and a disease locus are linked together,
we should examine as many as possible crossbreeds where the parents and
children are typed at this marker. Since if there are a little number of
crossbreeds who satisfy this condition then we can’t determined definitely if
the marker and disease locus are linked or not linked, because it could be that
the result is just “coincidental“,

b.
Local
factor :

For each marker it’s very vital that as many as possible number of individuals are typed at this marker from all available individuals on the pedigree, so what determines is the percentage of typed person at this locus rather than their number. In addition it’s reasonable that the heuristic value added to the each entry not linear to the percentage of the number of untyped individuals at the corresponding marker, instead the more is the percentage close to 1 the more the heuristic value is added, that to means for small percentages our heuristic is “forgiving” and the punishment for that is light and when this percentage grows up the heuristic value increases by a little, but for big percentages then every individual becomes crucial so the punishment is heavy and when this percentage grows more the heuristic value increases “crazy”.

Thus the heuristic function is from this from:

* this is a prototype function, for the precise one look at
source code

Where is the number of
untyped individuals at marker i.

We
can obtain more information than just count the number of untyped persons, we
can investigate the “environment” of each individual (number of his children
and if he has parents and the status of these individuals in the corresponding
marker).

For
example let’s look at specific marker, it is reasonable that the information we
loose if there is untyped individual at this marker which has one typed
children is less than the information we loose from untyped individual which
has 4 typed children, however If one parent has not been genotyped, one may
still use information from the other parent under certain conditions. The typed
parent must be heterozygous, and obviously if the affected child has the same
genotype then one cannot tell which allele has been transmitted. If the child
is homozygous, then one can deduce which allele has been transmitted, but the
pair must still be discarded from the analysis because otherwise one will
introduce a bias in favour of commoner alleles. If a
parent is missing, one should only incorporate the remaining parent-child pair
if both are heterozygous and have different genotypes.

On
the other hand an untyped individual which has no typed siblings and has two
typed parents (at the same marker) cause greater information loose than untyped
individual which has typed siblings (that we can inference some information
from this family) or on or more of his parents
are untyped so in our implementation in the first phase we counter the number
of untyped individual with a factor that
depends on the individual “environment” for each marker , and then we calculate
heuristic value according to the former equation.

c.
Global
factor:

in our
implementation this parameter consider primary after the Distance parameter,
that we keep the closer markers as long as they have “sufficient” information.

**3.
**__number of possible Alleles__

a.
Why
does it matter?

We
mean by that the sum of possible alleles at all the individuals as concluded
from the pedigree.

The
main purpose of our algorithm is to reduce the complexity coefficient, and in
the same time to maintain as possible the result accuracy .so it’s desired to
clip alleles as less as possible, that means we shall clip markers that reduce
the complexity coefficient as more as possible, and it’s known that the more
the marker we clip has possible alleles the more the complexity coefficient
reduced.

b.
Local
factor

So
the markers that relatively have many possible alleles have the priority to be
clipped first.

c.
Global
factor

It’s
not clear how does the complexity coefficient acts as function of the number of
possible alleles,besides by the Experimental tests we
notice that it is not unusual that clipping marker with less number of alleles
than another marker lead to less complexity coefficient.

So we let this
parameter minimal influence at the heuristic value.

**4.
**__distance from neighbor markers__

a.
**Why
does it matter**?

As
we explained before two genes that lie close together on the same chromosome
will tend to be inherited together, so if a marker is too close to one of his
neighbors the will be tied together and there is minor chance for recombination
between them, so in proximity can ascribe to them as one marker, and the other
marker doesn’t add any information.

b.
**Local
factor**:

In
order that the approximation that two close marker behave together in nearly
all the time they should be very very close together.

c.
**Global
factor**:

In
case that the distance between markers not as sufficient that we can consider
two markers behave as one marker and clip one of them so we give this parameter
small affection on the heuristic value and make sense just when the markers are
very close and the previous parameters are close.

__General notices:__

In
case of sex difference in order to compute the Distance from the affected
locus, or the distances between the markers we use both maleThetaVals and
femaleThetaVals and weigh between them by the ratio of male number and female
number in the pedigrees.

The
constants and factors on the equation were determined by Experimental results.