Supplementary material

 

 

 

 

S1 Flip Simulations

 

Tree edit distance algorithms for computing RSS similarities use three operations on the tree nodes: insertion, deletion and substitution [13, 15]. An analogy can be made between these operations and events that occur on the primary sequence level, i.e. mutations that cause deletion of nucleotides, insertion of nucleotides or substitution of nucleotides. These mutations induce secondary structure changes that can be picked up by the OLT comparison model, e.g. a deletion of consecutive nucleotides might ‘delete’ a loop and/or stem from the RSS. Another event that occurs on the primary structure level is inversion. We hypothesize that the flip enabled model can capture this type of event.

We were unable to find experimentally verified secondary structures in the literature that contain such inversions for testing the model.

Therefore, in order to test whether the flip enabled model can in principle detect changes in secondary structure induced by inversions on the sequence level we performed the following of simulation: the first 747 nucleotides from the human coxsackie virus IRES 5’ (M16572) were extracted [30]. An artificial inversion of the nucleotides between positions 243 to 439 was made. The two sequences, the original and the one containing the inversion were folded into their predicted RSS using minimum free energy folding algorithm [23] (see Figure S2).  The two structures were given as input to the SimTree method and the similarity score between them was computed using one flip. Notice that the model does not have any pre-knowledge regarding the two structures; it does not know if an inversion occurred or where it occurred. It flips a substructure only if the flip at that location yields the maximal similarity score. The indices of the nucleotides that reside in the flipped substructure are then returned. In this instance, nucleotides 273 to 411 were returned. This means that the flipped secondary substructure overlapped and was entirely contained in the inverted primary sequence region, indicating that inversions on the sequence level induce secondary structure changes that can be picked up by the flip enabled model.

We repeated the experiment 10 times, each time creating an inversion at different locations on different viral IRESes. In all instances the inverted subsequence and the predicted flip in the secondary substructure overlapped. Furthermore, their boundaries were closer than then 40 nucleotides in all cases.

Overall, by computing the flipping location that yields maximal RSS similarity, the method was able to accurately trace back sequence level inversions based solely on secondary structure information.

 

 

Figure S1.

 

Two secondary structures corresponding to two inverted sequences. The structures were generated using the minimum fold energy model [22]. It can be seen that the secondary structures are highly similar mirror reflections of one another.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure S2.

 

An artificial inversion of the subsequence between nucleotide positions 243 to 439 in the human coxsackie virus IRES 5’ was introduced. The predicted secondary structure of the original sequence and the one containing the inverted sub-sequence are given in (a) and (b) respectively. The inversion on the primary sequence caused a secondary substructure to change into its approximate mirror reflection. By enabling a flip SimTree was able to detect this occurrence and map corresponding regions a1 to b1, a2 to b2 and a3 to b3.

 

Table S1.

 

A  list of the 164 organisms in the RNase P RNA secondary structure dataset.

 

 

Archaea

 

 

 

 

 

Crenarchaea

 

 

1

 

 

A.ambivalens

2

 

 

A.brierleyi

 

3

 

 

A.pernix

 

4

 

 

M.sedula

 

5

 

 

S.acidocaldarius

6

 

 

S.shibatae

7

 

 

S.solfataricus

 

 

Euryarchaea

 

 

8

 

 

A.fulgidus

 

9

 

 

H.cutirubrum

10

 

 

H.morrhuae

11

 

 

H.trapanicum

12

 

 

H.volcanii

 

13

 

 

M.barkeri

 

14

 

 

M.fervidus

 

15

 

 

M.formicicum

16

 

 

M.jannaschii

17

 

 

M.maripaludus

18

 

 

M.thermoautotrophicum-DH

19

 

 

M.thermolithotrophicus

20

 

 

M.vannielli

21

 

 

N.gregoryi

 

22

 

 

P.abyssi

 

23

 

 

P.horikoshi

24

 

 

SL2D

 

25

 

 

T.celerAL1

 

Bacteria

 

 

 

 

 

Gram-positive Bacteria

 

 

 

 

 

High GC

 

26

 

 

 

A.erythreum

27

 

 

 

C.diphtheriae

28

 

 

 

F.antarctica

29

 

 

 

I.calvum

30

 

 

 

L.japonicus-IFO15385

31

 

 

 

M.avium

32

 

 

 

M.bovis

33

 

 

 

M.leprae

34

 

 

 

M.luteus

35

 

 

 

M.phosphovorus-JCM9380

36

 

 

 

M.tuberculosis-g

37

 

 

 

M.tuberculosis

38

 

 

 

N.flavus

39

 

 

 

N.jensenii

40

 

 

 

N.luteus

41

 

 

 

Nocardiodes-ATCC39419

42

 

 

 

N.plantarum

43

 

 

 

N.pyridinolyticus

44

 

 

 

N.simplex

45

 

 

 

P.innocua

46

 

 

 

S.bikiniensis

47

 

 

 

S.caesia-K76T

48

 

 

 

S.cyanea-K168T

49

 

 

 

S.glauca-L169T

50

 

 

 

S.lividans

51

 

 

 

S.polymorpha

52

 

 

 

S.viridis-K73T

 

 

 

Low GC

 

53

 

 

 

A.laidlawii

54

 

 

 

B.anthracis

55

 

 

 

B.brevis

56

 

 

 

B.halodurans

57

 

 

 

B.megaterium

58

 

 

 

B.stearothermophilus

59

 

 

 

B.subtilis

60

 

 

 

C.acetobutylicum

61

 

 

 

C.difficile

62

 

 

 

C.hydrogenoformans

63

 

 

 

C.innocuum

64

 

 

 

C.sporogenes

65

 

 

 

E.faecalis

66

 

 

 

E.rhusiopathiae

67

 

 

 

E.thermomarinus

68

 

 

 

H.chlorum

69

 

 

 

M.capricolum

70

 

 

 

M.fermentans

71

 

 

 

M.flocculare

72

 

 

 

M.genitalium

73

 

 

 

M.hyopneumoniae

74

 

 

 

S.bovis

75

 

 

 

S.epidermidis

76

 

 

 

S.gordonii

77

 

 

 

U.urealyticum

 

 

Purple Bacteria

 

 

 

 

 

Alpha

 

78

 

 

 

A.tumefaciens

79

 

 

 

ESH212C

80

 

 

 

LGB41

81

 

 

 

LGW113

82

 

 

 

PS26

83

 

 

 

PS2

84

 

 

 

R.capsulatus

85

 

 

 

R.palustris

86

 

 

 

R.rubrum

87

 

 

 

Wolbachia-sp

 

 

 

Beta

 

88

 

 

 

A.eutrophus

89

 

 

 

C.testosteroni

90

 

 

 

ESH167F

91

 

 

 

ESH183D

92

 

 

 

ESH26-4

93

 

 

 

Mxa1

94

 

 

 

N.meningitidis-Z2491

95

 

 

 

T.thioparus

96

 

 

 

vB11

 

 

 

Gamma

 

97

 

 

 

A.ferrooxidans

98

 

 

 

Buchnera-APS

99

 

 

 

C.vinosum

100

 

 

 

E.agglomerulans

101

 

 

 

E.coli

102

 

 

 

H.influenza

103

 

 

 

K.pneumoniae

104

 

 

 

L.adecarboxylata

105

 

 

 

P.alcalifaciens

106

 

 

 

V.cholera

 

 

 

Delta

 

107

 

 

 

D.desulfuricans

108

 

 

 

D.vulgaris

109

 

 

 

G.sulfurreducens

110

 

 

 

M.xanthus

 

 

 

Epsilon

 

111

 

 

 

C.jejuni

112

 

 

 

H.pylori-26695

 

 

Bacteroides and Relatives

 

 

113

 

 

B.thetaiotaomicron

114

 

 

ESH30-3

 

115

 

 

F.yabuuchiae

116

 

 

LGB27

 

117

 

 

L.weilii

 

118

 

 

P.gingivalis

119

 

 

T.pallidum

 

 

 

Chlamydia

 

 

120

 

 

C.abortus-B577

121

 

 

C.caviae-G

122

 

 

C.felis-FP

 

123

 

 

C.muridarum-MoPn

124

 

 

C.pecorum-H3

125

 

 

C.pneumoniae

126

 

 

C.psittaci

 

127

 

 

C.suis-S45

128

 

 

C.trachomatis

129

 

 

P.acanthamoebae

130

 

 

S.negevensis

 

 

thermolithotrophicus

 

 

131

 

 

Anabaena-ATCC29413

132

 

 

Anabaena-PCC7120

133

 

 

A.nidulans

134

 

 

Calothrix-PCC7601

135

 

 

Dermocarpa-PCC7437

136

 

 

Fischerella-UTEX1829

137

 

 

Nostoc-PCC7107

138

 

 

Oscillatoria-PCC7515

139

 

 

P.hollandica

140

 

 

P.marinus

 

141

 

 

Prochlorococcus-TATL2

142

 

 

Pseudoanabaena-PCC6903

143

 

 

Synechococcus-PCC7001

 

 

Spirochaete

 

 

144

 

 

B.burgdorferi

145

 

 

B.hermsii

 

146

 

 

L.borgpetersenii

 

Eukaryotes

 

 

 

 

 

mitochondrion

 

 

147

 

 

R.americana-mito

 

 

nucleus

 

 

148

 

 

A.obstetricans

149

 

 

A.telluris

 

150

 

 

A.truei

 

151

 

 

C.lusitaniae

152

 

 

E.eschsholtzii

153

 

 

G.gorilla

 

154

 

 

K.polysporus

155

 

 

L.muta-1

 

156

 

 

M.musculus

157

 

 

P.pygmaeus

158

 

 

S.carlsbergensis

159

 

 

S.cerevisiae

160

 

 

T.syrichta

 

161

 

 

X.laevis

 

 

 

plastid

 

 

162

 

 

Cadoxa-cyanelle

163

 

 

N.olivacea-chloroplast

164

 

 

P.purpurea-chloroplast

 

 

Transformation Algorithm

The SimTree method first receives an RNA secondary structure in parenthesis notation and transforms it into a labeled tree representation.

Input: a tree in a parenthesis representation.

Output: a tree object.

Time complexity: O(n) where n is the length of the input sequence.

Assumptions and notations:

1.The parenthesis representation is correct i.e. it represents a viable RNA secondary structure.

2.                        Each node v in the tree contains the following data:

a.                   The type of the node, where type(v) є {‘multi loop’, ‘stem’}

b.                  The node complexity which is defined as:

     

n_bp is number of base pairs in the stem. (n_bp, n_fork) are number of nucleotides in a multi-loop and number of stems branching out of a multi-loop. Other high level structural subunits can now be composed, for instance:

A loop: Type(v) = multi-loop and Complexity(v) = (n_bp,1).

A bulge: Type(v) = multi-loop and Complexity(v) = ({1 or 2},2).

3.                        All trees are rooted at a unique node, notated as ‘Root’, regardless of their topology.

 

Algorithm description:

The algorithm has three major components:

First, the input parenthesis sequence is traversed upon and transformed into a list of triples where each triple is composed of the following fields: [shape, s, index].  Shape , s represents the number of sequential instances of the shape and an index is the position of the shape in the sequence. For example see figure 2.

Figure 2

 
 

 

 

 

 

 

 

 

 

 

 


The time complexity of the above component is O(n)  as it performs a single traverse over the input parenthesis sequence.

 

The second component receives the parenthesis sequence transformation from the first component and returns a stack containing the tree’s node in a pre-order traversal.

This component employs two stacks:

1.      SOP - contains the open parenthesis objects.

2.      SD - contains the dot objects.

 

 

The algorithm determines the nature of a node by counting the number of branches from each node and thus, determining the kind of multi-loop (stems are being determined by complementary parenthesis). The method exploits the observation that when it finds a stem, a multi-loop must precede it. Hence, it pops and counts the number of dot objects in the dots stack and determines the number of branches the multi-loop has. A pseudo dot object is needed in order to count correctly multi-loops with a fork where there are no separating bases between the two branches. A schematic explanation is given in figure 3.

This method has a time complexity of O(n) since in worse case it performs n operations on both stacks.

 

                                                            Figure 3

When a ')' is encountered all objects in the '.' stack that have greater index than the object on the top of the '(' stack are popped. The number of popped objects determines the degree of the Multi-Loop.

 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


The third method takes the stack that contains the nodes in a pre-order from the second component and builds the corresponding tree.

 

It performs one scan of the stack and hence has a time complexity of O(n).