You may change page (above) or search this site (enter word or phrase below)

Chimeric sequences

 
chimeric and parental sequences

  

Looking for cyanobacterial phylogenetic trees? Visit cyanophylogeny.scienceontheweb.net/

 

 INTRODUCTION

 The 16S rRNA sequence alignment maintained in ARB and containing over 4000 sequences, is an essential part of CyanoPhy for analysis of cyanobacterial phylogeny. New sequences are downloaded from NCBI weekly to keep the database current. A surprising number have been identified as chimeric sequences by the various detection methods available in CyanoPhy. An example is shown in the figure above, the breakpoint being indicated by the arrow.

Chimeric sequences, composed of parts of each of two or more true parental sequences, are most likely the result of PCR errors. Most are believed to arise following incomplete extension during a PCR cycle; the partially extended strand may bind to a template derived from a different sequence, this primer being extended and amplified during succeeding cycles to produce the chimeric sequence. A more complete description of this process is given on the UCHIME home page.

RESULTS AND CONCLUSIONS

Chimera formation is unlikely to be a problem if the organism used for sequencing is in the axenic state, such as strains from the Pasteur Culture Collection of Cyanobacteria (PCC). However, chimeric sequences may be formed in PCR reactions following contamination of such strains in the laboratory or if non-axenic cultures, proposed by most Culture Collections with cyanobacterial holdings, are employed. Another source of chimeric sequences is a culture thought to be clonal (containing a single cyanobacterium) but in fact comprised of several cyanobacterial representatives. Finally, as evidenced by the results presented below, environmental samples show a high risk of formation of chimeric sequences since they contain several to many different organisms.

Chimeric sequences are difficult to detect and to distinguish from real biological sequences, but it is essential to exclude them from studies of cyanobacterial phylogeny because their inclusion can falsify the results, causing misplacement of other organisms or clades in the phylogenetic tree, and giving the impression that genetic diversity is wider than it really is. In designing CyanoPhy I tested all chimera-detection programmes available, and continue to do so as new detection methods are published. A list of programmes is available here.

These tests revealed two major problems:

1)     Over 60 % of the sites are identical in all sequences of the cyanobacterial alignment and thus tend to mask recombination events; cyanobacterial chimeric sequences therefore escape detection methods such as Pintail, often used to verify the quality of sequences in the public databases.
2)     If the database used for sequence comparison contains few cyanobacterial representatives, a chimeric sequence will not be detected because one or more of the "parental" sequences are absent; this is the case for the Bellerophon server, again frequently employed for the control of sequence quality in the public databases.

 Of the programmes that do not employ a sequence database, Mallard appears to give fairly reliable results, but should not be used with more than about 200 sequences per file. LARD (unfortunately discontinued) performs well in localizing the position of a breakpoint - but must be supplied with the chimera and the two potential parental sequences. The Decipher web server has a good database with many cyanobacterial sequences, is rapid and performs reasonably well, but unfortunately depends on the RDP classifier, sadly out of date for Cyanobacteria. SplitsTree4 often gives an unambiguous result in the splits graphs.

I have retained Lard, Pintail, Mallard and SplitsTree4 in CyanoPhy, and added UCHIME. The latter uses a FASTA formatted database to which I have added over 1000 cyanobacterial 16S rRNA sequences of good length and quality. The standalone version of Decipher has the same problem as the web server and has not been retained.
 
Some cyanobacterial chimeric sequences were initially identified by chance from the marked effect their inclusion has on tree topology. Thus, inclusion of AphaNH5 (Aphanizomenon strain NH5, accession AF425995) causes a major change in the position of the heterocystous Nodularia group, and Ukia101a (Umezakia natans strain TAC101, accession AF516748) may, depending on the choice of other taxa included, cause the entire heterocystous clade to move from the top to the root of the tree when Rhodopseudomonas is included in the outgroup. In contrast, a second sequence of strain NH5 (accession AY196086) or strain TAC101 (accession AY897614), or the sequence of Umezakia natans strain TAC661 (accession AB608023) are not chimeric and behave normally in the tree.

The best detection method for chimeric sequences appears to be UCHIME. I have built a shell script around this programme that first lists and removes sets of identical sequences (if any) from the query infile, then runs UCHIME on the remainder, checks the positive chimeric sequences against the list of identical sequences, reformats the statistical results, adds a list of the identical sequences in the original infile, then prints all results to file. The sequences UCHIME has  identified as chimeric are included in the Table below; this list should be regarded as tentative, pending further study. A more complicated (and time-consuming) detection method is to use NCBI BLAST to verify potential chimeras; briefly, the sequence is cut into two or more fragments and subject to BLAST analysis on NCBI; this has only been done for a few sequences, identified from their effects on tree topology. Note that the local CyanoPhy BLAST database can be queried (in place of NCBI).

All chimeric sequences found by the detection methods available in CyanoPhy are given in the Table below, and should be excluded from sequence sets used for building trees. They have been left in the ARB database only for informational purposes. 

The Table is followed by several examples of UCHIME output, automatically reformatted by the shell script. Note that the 5-prime and 3-prime "parental sequences" detected by UCHIME are not necessarily the true parents of the chimeric sequences, but their nearest relatives found in the UCHIME database. 

Also shown are four examples of analysis with SplitsTree4. These demonstrate the unusual position of the chimeric sequence Ukia101a, which falls between the phyla Cyanobacteria and Protobacteria, the anomalous position of the chimeric sequence DspN78 between the heterocystous and Pseudanabaena clades, the anomalous behaviour of the chimeric sequence AphaNH5 within the heterocystous clade, and that of PlaH1128 within the genus Planktothrix.


Chimeric sequences identified by UCHIME, Mallard, Decipher

 

These sequences are listed in the Table below. Since this is large, it is masked by default. It may be visualized by clicking the "view table" button, and hidden again via the "mask table" button.

Only cyanobacterial sequences from ARB >1249 nt in length and with <6 ambiguous sites were examined. Mallard results are given at support value p=0.05; Breakpoints are from UCHIME output.

List updated 24:05:2014. Of the 4340 total sequences in the CyanoPhy ARB database, 3919 were >1249 nt in length and had <6 ambiguous sites. These contained 731 identical sets, which were removed. The remaining 3188 sequences were checked for potential chimeric origin using the tools available in CyanoPhy. A total of 64 were found to be real (indicated by "+" in the Table below) or possible ("(+)") chimeric sequences, corresponding to a frequency of 2.01 % among the 3188 unique cyanobacterial sequences. Of these, 36 (1.35 %) are among the 2657 unique sequences from cultured isolates, increasing markedly to 5.27 % (28 chimeric sequences) among the 531 uncultured samples included.

Three sequences described in the Table have one or more identical counterparts in the 726 sequence dataset:

PlaH1128 (Planktothrix HAB1128, FJ184439) is identical to PlaH417 (Planktothrix HAB417, FJ184441), PlaH1347 (Planktothrix HAB1347, FJ184442) and PlaH1379 (Planktothrix HAB1379, FJ184443);

LepCYN83 (Leptolyngbya CYN83, JF925321) is identical to LepCYN87 (Leptolyngbya CYN87, JF925322), and LepCYN95 (Leptolyngbya CYN95, JF925323);

PlaH1130 (Planktothrix HAB1130, FJ184437) is identical to PlaH662 (Planktothrix HAB662, FJ184438).

Chimeric sequence name, Genus and strain
NCBI
accession
UCHIME
Mallard
Decipher
(web)
Breakpoint
Aph27S12 Aphanizomenon 2LT27S12
FM177484
-
(+)
-

AphaNH5 Aphanizomenon NH-5
AF425995
+
-
-
667-751
ArthSp12 Arthrospira Sp-12
EF432315
+
-
+
316-317
CalAsk3 Calothrix Asko 3
FJ661007
-
-
+

CalCY100 Calothrix CYN100
JF925326
+
+
+
734-736
Calo3363 Calothrix CAL3363
AM230684
+
-
-
395-486
ChrcCC4 Chroococcidiopsis CC4
DQ914866
+
+
+
1077-1087
Cop26206 Cylindrospermopsis PMC262.06
GQ859604
+
(+)
-
838-839
Dsp525 Dolichospermum NRC525-17
AF247597
+
+
-
668-750
DspC202 Dolichospermum CENA202
FJ830575
+
+
+
277-285
DspN78 Dolichospermum NIES 78
AF317627
+
+
+
658-717
Jgm25205 Jaaginema PMC252.05
GQ859646
+
+
-
755-839
LepCYN83 Leptolyngbya CYN83
JF925321
+
+
-
686-721
Lim27206 Limnothrix PMC272.06
GQ859647
+
(+)
-
748-753
Mer26006 Merismopedia PMC260.06
GQ859638
+
(+)
-
711-839
Pho27406 Phormidium PMC274.06
GQ859648
+
(+)
-
755-939
PhorM221 Phormidium M-221
AB003165
-
+
-

PhorM71 Phormidium M-71
AB003167
+
+
-
701-715
PlaH1128 Planktothrix HAB1128
FJ184439
(+)
+
-
757-890
PlaH1130 Planktothrix HAB1130
FJ184437
(+)
-
-
543-621
PlaH1131 Planktothrix HAB1131
FJ184440
(+)
+
-
757-890
PluA230 Pleurocapsa HA4302 clone A
KC525078
+
(+)
-
277-386
Plu2B230 Pleurocapsa HA4230 clone 2B
KC525080
+
(+)
-
275-378
Plu2C230 Pleurocapsa HA4230 clone 2C
KC525081
+
(+)
-
275-378
PsaCA530 Pseudanabaena CAWBG530
JX088101
+
+
+
700-708
PsaCA531 Pseudanabaena CAWBG531
JX088102
+
+
+
704-709
Riv5PA11 Rivularia 5PA11
FJ660987
+
+
+
1063-1064
Riv5PA13 Rivularia 5PA13
FJ660986
+
(+)
(+)
859-933
Symp642a Symploca VP642a
AY032932
-
(+)
-

Ukia101a Umezakia TAC101
AF516748
+
+
+
706-712
UCY0706f uncultured clone B10706F
HQ189065
-
(+)
+

UCY0808a uncultured clone B10808A
HQ189019
+
-
-
739-915
UCY0811f uncultured clone B10811F
HQ189025
+
-
-
191-201
UCY5B150 uncultured clone ART5B_150
JF303683
+
+
-
965-1046
UCYay620 uncultured clone AY6_20
FJ891050
+
+
+
498-505
UCYB364 uncultured clone LB3-64
AF076160
+
(+)
-
746-784
UCYfr032 uncultured clone Fr032
AY151725
+
(+)
-
565-582
UCYfr048 uncultured clone Fr048
AY151726
+
-
-
1166-1170
UCYfr094 uncultured clone Fr094
AY151727
+
(+)
-
678-740
UCYfr297 uncultured clone Fr297
AY151733
+
+
-
295-296
UCYfr304 uncultured clone Fr304
AY151734
+
(+)
-
827-909
UCYfr313 uncultured clone FrE313
AY151736
+
+
+
865-880
UCYh124 uncultured clone H124
HG917263
+
nd
+
726-733
UCYha106 uncultured clone HAVOmat106
EF032780
-
(+)
-

UCYha128 uncultured clone HAVOmat128
EF032783
+
-
-
412-419
UCYP256 uncultured clone FBP256
AY250870
-
-
-

UCYP290 uncultured clone FBP290
AY250874
+
+
+
926-927
UCYP403 uncultured clone FBP403
AY250881
+
+
+
1082-1089
UCYP1113 uncultured clone SWMP11-13
JX006094
+
+
-
218-318
UCYP1114 uncultured clone SWMP11-14
JX006095
+
-
(+)
226-298
UCYSco1 uncultured 1
DQ131175
+
(+)
+
430-433
UCYuhas5 uncultured clone UHAS5.42
N037922
+
(+)
+
1124-1125
UCYxj112 uncultured clone sw-xj112
GQ302543
+
+
-
896-897
UCYxj279 uncultured clone sw-xj279
GQ302545
+
+
+
903-904
UIDH114 unidentified HI14
FJ660994
+
-
-
800-805
UIDHI15 unidentified HI15
FJ660993
-
+
-

UIDtB301 unidentified tBTRCCn 301
DQ471441
+
+
-
737-747
UIDtB302 unidentified tBTRCCn 302
DQ471445
+
-
-
728-751



Limited BLAST studies (see Table) confirmed the chimeric nature of sequences
AphaNH5, ChrcCC4, DspN78 and Ukia101a
and suggested 6 additional strains, not found by the other methods
Mcys42 Microcystis NIES 42, U40335
Mcys43 Microcystis NIES 43, U40336
PluP302b Pleurocapsa VP3-02b, FR798929
PluVP302 Pleurocapsa VP3-02, FR798927
PluVP407 Pleurocapsa VP4-07, FR798930
ScocIR11 Synechococcus IR11, AF448079

Examples of UCHIME output for chimeric sequences (extracted from the full listing):


Chimeric (Query) sequences (55) found by UCHIME in
/usr/local/uchime/cyano-0710.fasw
and taxonomic information for their 5' (*****) and 3' (***) parents.

Statistics show: qAB, % identity of the query to the assembled parents, and to
the full length 5'- (qAt) and 3'- (qBt) parents; pAB, % identity of the parents;
total discriminatory sites left (tL) and right (tR) of the breakpoint and their sum (tT); number of discriminatory sites supporting the model (left, pL; right, pR; sum, pT); YES/yes, strong/weak evidence for the chimera.
Breakpoint region: residues conserved in chimera and parents may hinder precise localization.


Query: AphaNH5
*****  NodB9427  Nodularia BCNOD9427 Section 4 AJ224447
***    Ana27S03  Anabaena 1LT27S03 Section 4 FM177478
                         Statistics:  qAB: 99.9   qAt: 97.3   qBt: 97.0   pAB: 94.7 
                 tL: 40    tR: 39    tT: 79    pL: 39    pR: 35    pT: 74      YES
                         Breakpoint region: 667 to 751 of chimera (1427 nt)
--------
Query: ChrcCC4
*****  FiscN592  Fischerella major NIES 592 Section 5 AB093487
***    S000446569  Ochrobactrum grignonense (T); type strain:OgA9a; AJ242581 (Bacteria)
                         Statistics:  qAB: 99.4   qAt: 94.3   qBt: 83.2   pAB: 78.0 
                 tL: 243   tR: 72    tT: 315   pL: 233   pR: 71    pT: 304     YES
                         Breakpoint region: 1077 to 1087 of chimera (1448 nt)
--------
Query: DspC202
*****  S000643548  Flavobacterium aquatile (T); type strain: DSM 1132; AM230485 (Bacteria)
***    DspC207  Dolichospermum crassum CENA207 Section 4 FJ830578
                         Statistics:  qAB: 99.1   qAt: 79.7   qBt: 93.5   pAB: 74.3 
                 tL: 88    tR: 273   tT: 361   pL: 79    pR: 269   pT: 348     YES
                         Breakpoint region: 277 to 285 of chimera (1436 nt)
--------
Query: DspN78
*****  PsaBRG53  Pseudanabaena ABRG5-3 Section 3 AB527076
***    DspN80  Dolichospermum solitarium NIES 80 Section 4 AF247594
                         Statistics:  qAB: 97.9   qAt: 92.7   qBt: 92.6   pAB: 86.7 
                 tL: 118   tR: 71    tT: 189   pL: 89    pR: 69    pT: 158     YES
                         Breakpoint region: 658 to 717 of chimera (1369 nt)
--------
Query: PlaH1128
*****  Plan7S08  Planktothrix agardhii 1LT27S08 Section 3 AJ635435
***    Pla34S02  Planktothrix pseudagardhii 2LT34S02 Section 3 FM177501
                         Statistics:  qAB: 100.0  qAt: 97.8   qBt: 97.7   pAB: 95.4 
                 tL: 32    tR: 31    tT: 63    pL: 32    pR: 30    pT: 62      yes
                         Breakpoint region: 757 to 890 of chimera (1382 nt)
--------
Query: PsaCA530
*****  LimCC29  Limnothrix redekei CCAP 1459/29 Section 3 HE974998
***    S000129976  Stenotrophomonas rhizophila (T); e-p10; AJ293463 (Bacteria)
                         Statistics:  qAB: 99.1   qAt: 91.3   qBt: 89.2   pAB: 82.1 
                 tL: 145   tR: 119   tT: 264   pL: 140   pR: 109   pT: 249     YES
                         Breakpoint region: 700 to 708 of chimera (1400 nt)
--------
Query: Ukia101a
*****  Aph27S04  Aphanizomenon ovalisporum 1LT27S04 Section 4 FM177485
***    S000498560  Rhodopseudomonas faecalis (T); gc; AF123085 (Bacteria)
                         Statistics:  qAB: 98.5   qAt: 87.8   qBt: 87.7   pAB: 76.9 
                 tL: 161   tR: 171   tT: 332   pL: 154   pR: 151   pT: 305     YES
                         Breakpoint region: 706 to 712 of chimera (1418 nt)
--------

Note: one or more sets of identical sequences were removed (see
/usr/local/uchime/cyano-0710-711.deleted).
Those (if any) identical to the queries are shown below within square brackets:

LepCYN83   [ LepCYN87 LepCYN95 ]
PlaH1128   [ PlaH1347 PlaH1379 PlaH417 ]
PlaH1130   [ PlaH662 ]


The above results show that all types of chimera formation – involving parental sequences from different phyla (Cyanobacteria plus other bacterial phyla), different clades within the Cyanobacteria (DspN78 involves parents representing heterocystous and non-heterocystous Cyanobacteria), within the same cyanobacterial clade (the parents of AphaNH5 both being heterocystous organisms), or within the same genus – occur. Not surprisingly, the number of discriminatory sites within the chimeric sequence, given as tT or pT values above, decreases as the degree of sequence divergence decreases.  The precision of location of the position of the breakpoint also decreases, from 6 – 10 nucleotides for the inter-phylum examples (involving a cyanobacterial and a bacterial parental sequence) to 133 for the intra-generic example, as a consequence of the increase in sequence identity which renders precise location progressively difficult.

Example splits graphs produced by SplitsTree4 NeighborNet analysis:

 

Inter-phylum chimeric sequence formation. Ukia101a (accession AF516748) is a chimeric sequence involving parents from both cyanobacterial and proteobacterial phyla, represented by a cluster of heterocystous Cyanobacteria and Rhodopseudomonas faecalis strain gc, respectively. 
.
Chimeric sequence Ukia101a


Inter-clade chimera. The chimeric sequence DspN78 (accession AF313627) of the heterocystous Dolichospermum strain NIES 78 is distinct from its non-chimeric version DspN78n (accession AY701551); the splits show a strong identity with sequences DspN80/DspN78n (Dolichospermum) and with non-heterocystous Pseudanabena strains represented by sequences PsaBRG53 and Psa7429A. The parental sequences are from organisms that fall into distinct clades of the phylogenetic tree.

Chimeric sequence DspN78
Intra-clade chimeric sequence. The chimeric sequence AphaNH5 (accession AF425995) of the heterocystous Aphanizomenon strain NH-5 is distinct from a second, non-chimeric, sequence (AphNH5, accession AY196086) of the same organism; the splits reveal identity with both Nodularia and Anabaena/Aphanizomenon strains (sequences NodB9427 and Ana27S03/AphNH5). The parental sequences are found within the same major clade of the phylogenetic tree.


Chimeric sequence AphaNH5

Intra-generic chimer formation. Within the genus Planktothrix, the 16S rRNA sequence PlaH1128 (accession FJ184439) occupies an unusual position because it is a chimeric sequence, showing identity to both the P. agardhii strain cluster (e.g. Pla7821) and the P. pseudagardhii cluster (e.g. Pla34S02).
Chimeric sequence PlaH1128

 
Page last updated: 02.04.2017 by Michael Herdman

No comments:

Post a Comment