Efficient calculation of interval scores
for DNA copy number data analysis
by Oren
Levin and Amit Meshulam
April 2007, under guidance of Zohar Yakhini
Introduction
Alterations in DNA copy number are
characteristic of many cancer types and are thought to drive some cancer
pathogenesis processes. These alterations include large chromosomal gains and
losses as well as smaller scale amplifications and deletions. Because of their
role in cancer development, regions of chromosomal instability are useful for
elucidating other components of the process. For example, since genomic
instability can trigger the over expression or activation of oncogenes and the
silencing of tumor suppressors, mapping regions of common genomic aberrations
had been used to discover cancer related genes.
In Comparative Genomic Hybridization (CGH)
differentially labeled tumor and normal DNA are cohybridized to normal metaphases.
Ratios between the two labels allow the detection of chromosomal amplifications
and deletions of regions that may harbor oncogenes and tumor suppressor genes.
In a more advanced method termed array CGH (aCGH), tumor and normal DNA are
cohybridized to a microarray of thousands of genomic clones of BAC, cDNA or
oligonucleotide probes, allowing the determination of changes in DNA copy
number of relatively small chromosomal regions.
The Stepgram application is used to detect aberrations in DNA copy number data from aCGH experiments. The application is based on the interval score described in the paper "Efficient Calculation of Interval Scores for DNA Copy Number Data Analysis"[1]. In short, the score of an interval is defined as the sum of its probes, divided by the square root of its length (_{}). Although the paper mentioned above addresses many problems, we’ve focused on one in particular since it is the key to all others: how can we find the interval with the maximum score correctly and efficiently. The paper offers 2 different algorithms for that problem in details, we will describe them in short.
Known Algorithms
The paper suggests 2 algorithms for finding the interval with maximal score:
LookAhead algorithm
The algorithm operates by considering all possible interval starting points, in sequence. For each starting point_{}, we check the intervals with ending point in increasing distance from _{}. The basic idea is to try and not consider all possible ending point, but rather to skip some that will clearly not provide the optimum. Assume that we are given two parameters _{} and _{}, where _{} is a lower bound on the optimal score _{} and _{} is an upper bound on the value of any single element _{}.
Assume that we have just considered an interval of length _{} with _{}, and _{}.
After checking the interval _{}, an exhaustive algorithm would continue to the next ending point, and check the interval _{}. However, this might not always be necessary. If _{} is sufficiently small and _{} is sufficiently large then _{} may stand no chance of obtaining_{}. For any x, setting _{} (skipping the next _{} intervals) an upper bound on the score of _{} is given by _{}. Thus, _{} has a chance of surpassing _{} only if _{} . Solving for_{}, we obtain_{}. Thus, the next ending point to consider is_{}, where_{}.
The improvement in efficiency depends on the number of ending points that skipped. This number, in turn, depends on the tightness of the two bounds _{} and_{}. Initially, _{} may be set to _{}. As we proceed _{} is replaced by maximal score encountered so far, gradually improving performance.
_{}provides an upper bound on the values of single elements in _{}. In the description above, we used the global maximum_{}. However, using this bound may limit the usefulness of the LookAhead approach, since even a single high value in the data will severely limit the skip size. Thus, we use a local approach, where we bound the value of the single elements within a window of size_{}. The most efficient definition of local bounds is the slope maximum: For a window size_{}, for all_{}, precalculate, _{}.
Although _{} is not an upper bound on the value of elements within the _{} window following_{}, the value of _{} is indeed an upper bound on _{} within the _{} window, maintaining the correctness of the approach. Note that the skip size must now be limited by the window size:_{}. Thus, larger values of _{} give better performance although preprocessing limits the practical window size to _{}.
Geometric Family Algorithm (GFA)
The paper defines a geometric family of intervals I
as: I =_{}
where I(j) is defined as follows:
For j = 0, 1,…, log_{(1+ε)}n, set k_{j}=(1+ε)^{j}
and Δ_{j}= εk_{j}. then:
I(j) = _{}.
In simple words, in each step we increase both the length of the intervals (k_{j})
and the space between the starting points of 2 consecutive intervals (Δ_{j}).
Since both increments are exponential, in length and in space, the result is a
set of intervals that is significantly smaller than all _{} possible subintervals
in the range [0, n1].
Lemma (proof in the paper): Let I* be the optimal scoring interval. Let J be the leftmost longest interval of I fully contained in I*. Then _{}(J) ≥ _{}(I*)/a, where _{}.
Using the lemma, we can do as follows:
Find the maximal score of intervals in I, call it M,
and let M* denote the score of the maximal interval in
Project goals
Out project addresses 2 issues:
Improving LookAhead algorithm:
The general concept:
In the original LA algorithm, the upper bound's computation is done for all the points. For each point_{}, this computation is set by the values from _{} to _{} only.
In the Improved LA Algorithm, the computation of the upper
limit is from each point to the last point. By this method, in the max interval
calculation, the skip _{} is not limited by any
value.
Preprocessing:
From each point_{}all the points from this point to the end are grouped into intervals, each interval size is _{} (except for the last interval).
In an empiric way, we selected _{}.
For each interval, the maximum point is computed using the_{} Algorithm.
All the points in a specific interval are considered to be the maximum value found inside. If the maximum is a negative value, it is set to be zero. Taking these new values set for each point, the slopes are computed.
The slope of a point _{}is equal or larger than the average of the values in any interval beginning with i. This is guaranteed because the use of an upper bounds for the point’s value.
The complexity of the _{} algorithm’s preprocessing is _{} and the maximum value’s finding in an interval, is_{}.
In the Improved LA algorithm’s preprocessing, for each point, no more than _{}calculations of max values are done. Therefore the complexity of the preprocessing is_{}.
Max
intervals calculation:
Same like the original algorithm, but the slope’s values are different.
The step size_{}, as computed in the original algorithm, is computed with the new slopes (_{}). Note that, in contrast to the LA Algorithm, _{} is not limited to any size.
An alternate option is to calculate the original slopes as well, and then the step size_{}, is computed twice. First _{} is set to the original slope and _{} is limited to_{}, then _{} is set to the new slope that was calculated in the Improved LA Preprocessing. The maximum_{} is selected, and the steps are set accordingly.
In the alternate option the performance in the preprocessing is not as good as the proposed option.
Range Maximum Query (_{}) Algorithm^{1}
For any array A of numbers, for indices _{} and _{} between _{} and_{}, query _{} returns the index and the value of the largest element in the sub array_{}.
The idea is to precompute each query whose length is a
power of two. That is, for every _{}
_
Between _{} and _{} and every _{} between _{} and _{}, maximum element is found in the block starting at _{} and having length _{}, that is, we compute _{}. Table _{}therefore has size_{}, and we fill it in time _{}by using dynamic programming. Specificaly, the maximum in a
block of size _{} is found by computing
the two maxima of its two constituent blocks of size _{}. More formally, _{} if _{} and _{} otherwise.
To avoid unnecessary iterations, only the first _{} iteration are done.
Computation of the _{}:
We select two overlapping blocks that entirely cover
the subrange: let _{} be the size of the
largest block that fits into the range from _{} to _{}, that is let _{}. Then _{}can be computed by comparing the maxima of the following two
blocks:_{} to _{}and _{} to _{}. These values have already been computed, so we can find the
_{}in constant time. The complexity_{}[2].
OneSide GFA:
OSGFA use is similar in concept to the original GFA, but while in each iteration in GFA we enlarge both interval length and spacing between intervals, in OSGFA we only enlarge the length. Formally:
We define I, a oneside geometric family of intervals:
Set e > 0.
For j = 0,1,…,n1: I(j) = { [ j , j + ( 1 + e)_{}^{i} – 1 ] : i = 0,1,…,log_{(1+}_{e)}(nj)
}
_{}
Lemma 1: Let I be an interval, and J – the leftmost longest interval of I contained in I. Then: _{}
Proof:
Suppose that the length of J is (1+e)^{j} (we consider the interval [i,i] as an interval with the length of 1). J is the leftmost longest interval contained in I, therefore _{}. Then:
_{}
è_{}
Lemma 2: Let I^{*} be the interval with the optimal score, and J be the leftmost longest interval of I contained in I^{*}. Then:
_{}, with _{}.
Proof:
Set _{}. Assume by contradiction that _{}. Denote I^{*} = [u,v] and J = [u,w]. define A = [w+1,v] (the interval that extends J to I^{*}).
_{}
_{} and _{} therefore:
_{}
According to our assumption, we now get:
_{}
è_{}
So our assumption was wrong, and indeed _{}.
From lemma 2 we get that we can use OSGFA in the same manner as GFA, but we can get the same approximation (same Alpha) in OSGFA by using a bigger e: _{}.
In
addition, the cover zone of interval [x,y] in OSGFA is smaller than it is in
GFA
( LCOV(x) = x and RCOV(y) = y + (yx+1)^{1+}^{e }
2 ).
Results
To test our suggestions we implemented the following:
 Naïve algorithm (exhaustive search).
 LookAhead algorithm.
 Improved LookAhead.
 GFA.
 OSGFA.
 Random input generator.
 Main program that produces a random input (or receives it as a file) and runs the selected algorithms.
Using the input generator we produced the following tests:
Size of input 
Number of amplifications 
Number of tests 
3000 
1 
100 
6000* 
2 
100 
12000 
3 
50 
24000 
4 
50 
48000 
5 
30 
96000 
6 
20 
192000 
7 
10 
* 4 tests of real data of about 6000 probes (Pollack)
were also tested and the times were similar
The random input was produced as follows:

Initial values of all points were drawn
from
 Input vector divided into 17 zones, depending on number of amplifications implanted, for each zone an amplification was added by:
o Drawing the amplification length from Geom(0.01), with a limitation that it must be at least 10 and no more than 20% of the zone length.
o Start point of the amplification in the zone is drawn uniformly.
o For each point in the amplification, we draw the amplification amplitude from Uniform(0.5,10).
 All of the parameters for the generation mentioned above are the default values in our program, but they can be changed by the user.
All algorithms were tested with the tests of 300012000 points, on larger tests we tried all except for the Naïve algorithm (takes too much time…).
* All running times refer to a PC with AMD Athlon XP 2500+ processor and 1GB RAM.
Graph 1: Average runtimes of LookAhead, improved LA and some key functions as reference. The start point values that were chosen for the functions:
X^{1} => (3000,0.5) – same as improved LookAhead.
X^{4/3} => (3000,0.5) – same as improved LookAhead.
X^{3/2} => (3000,0.3) – same as LookAhead.
Graph 2: Average runtimes of GFA, OSGFA and some key functions as
reference. The start point values that were chosen for the
functions:
X^{1 }(0.38)=> (3000,0.38) – same as GFA.
X^{1 }(0.47)=> (3000,0.47) – same as OSGFA.
X^{4/3} => (3000,0.47) – same as improved OSGFA.
Graph 3: Average runtimes for all tested algorithms and some key functions as reference, all start points for the functions were (3000,0.3) – same as LA:
Graph 4: runtime scatter for LA, improved LA, GFA and OSGFA:
24000 (50) 48000 (30) 96000 (20) 192000 (10)
The code:
The main program can:
 Create a random input file or use an existing one.
o The input file should consist of the probes values separated by any nondigit char(s).
 Run any algorithm (or some of them) from: naïve, LookAhead, improved LA, GFA & OSGFA.
Command line arguments for the main program:
Usage: stepgram.exe
input <file name> # name of file to read input from (or to
create if using produce_input)
#The input file
should consist of the probes values separated by any nondigit char(s)
[modes <commaseparated list of
modes>] # optional modes:
all,naive,la,imp_la,gfa,os_gfa
[produce_input <size of
input>] # will produce random input,
can be used with modes flag to run with these modes after generation of input
[gfa_ep <epsilon>] # epsilon for GFA, default is 0.1
[os_gfa_ep <epsilon>] # epsilon for oneside GFA, default is 0.25
[la_window <window size>] # window size for LookAhead, given as the
exponent of input size, default is 1/2
[imp_la_window <window
size>] # window size for improved
LookAhead, default is 1/3
The following flags are for random input
generation:
[ri_vm <mean>] # original values are generated using
[ri_vsd <sd>] #
defaults are: ri_vm = 0, ri_vsd = 0
[ri_agp <p>] #
the amplification length is determined by Geom(ri_agp), but must be:
[ri_aml <length>] #
larger than ri_aml and smaller than (input_size/ri_noa)*ri_matir
[ri_matir <ratio>] # defaults are: ri_agp = 0.01, ri_aml = 10,
ri_amatir = 0.2
[ri_amina <min>] # for each probe in the amplification
interval, amplitude is determined
[ri_amaxa <max>] # using UNIFORM(ri_amina, ri_amaxa), default
is (0.5,10)
[ri_noa <num>] # number of amplification to add.
[h] # display this help message.
Example for an output (messages begin with “I“ for info, “T“ for time measurements, “R“ for results, “W“ for warnings and “E“ for errors):
F:\TECHNION\project in
bioinformatics\submit>stepgram.exe produce_input 100000 input
"F:\TECHNION\project in bioinformatics\submit\random_input_for_example.txt"
modes la,imp_la,gfa,os_gfa ri_noa 5
Command line: stepgram.exe
produce_input 100000 input F:\TECHNION\project in
bioinformatics\submit\random_input_for_example.txt modes la,imp_la,gfa,os_gfa
ri_noa 5
##################################################################
##################################################################
I stepgram: starting
standart LookAhead algorithm
I LAPreProcess::LAPreProcess
 Total number of probes: 100000
T LAPreProcess::LAPreProcess
 LAPreProcess constructor: 20.375 seconds
R LookAhead::activate  max
interval: (17819,18231) score: 102.348
T LookAhead::activate:
32.093 seconds
R stepgram: max interval by
la algorithm: (17819,18231) score: 102.348
T
stepgram: la calculation: 52.468 seconds
##################################################################
##################################################################
##################################################################
I stepgram:  starting
improved LookAhead algorithm
I LAPreProcess::LAPreProcess
 Total number of probes: 100000
T LAPreProcess::LAPreProcess
 LAPreProcess constructor: 2.969 seconds
I
ImpLAPreProcess::ImpLAPreProcess  Total number of probes: 100000
T ImpLAPreProcess::ImpLAPreProcess
 ImpLAPreProcess constructor: 13.485 seconds
R ImpLookAhead::activate 
max interval: (17819,18231) score: 102.348
T ImpLookAhead::activate:
15.625 seconds
R stepgram: max interval by
improved la algorithm: (17819,18231) score: 102.348
T stepgram: improved la
calculation: 32.079 seconds
##################################################################
##################################################################
##################################################################
I stepgram: starting GFA
I LAPreProcess::LAPreProcess
 Total number of probes: 100000
T LAPreProcess::LAPreProcess
 LAPreProcess constructor: 2.953 seconds
I GfaBase::activate  Found
1015 candidate intervals... reduced to 1 after merge
T GfaBase::activate: 7.641
seconds
R LookAhead::activate  max
interval: (17819,18231) score: 102.348
T LookAhead::activate: 0.281
seconds
R stepgram: max interval by
GFA: (17819,18231) score: 102.348
T stepgram: GFA calculation:
12.453 seconds
##################################################################
##################################################################
##################################################################
I stepgram:  starting
OneSideGFA
I LAPreProcess::LAPreProcess
 Total number of probes: 100000
T LAPreProcess::LAPreProcess
 LAPreProcess constructor: 2.875 seconds
I GfaBase::activate  Found
11767 candidate intervals... reduced to 2 after merge
T GfaBase::activate: 11
seconds
R LookAhead::activate  max
interval: (17819,18231) score: 102.348
T LookAhead::activate: 0.188
seconds
R stepgram:  max interval
by OneSideGFA: (17819,18231) score: 102.348
T stepgram:  OneSideGFA
calculation: 15.391 seconds
##################################################################
##################################################################
Run the Program
References:
1. Doron Lipson, Yonatan Aumann, Amir BenDor, Nathan Linial and Zohar Yakhini, "Efficient Calculation of Interval Scores for DNA Copy Number Data Analysis". Proceedings of RECOMB '05, LNCS 3500, p. 83, SpringerVerlag, 2005.