Publication: MS4-Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences
| dc.bibliographiccitation.artnumber | 406 | |
| dc.bibliographiccitation.journal | BMC Bioinformatics | |
| dc.bibliographiccitation.volume | 11 | |
| dc.contributor.author | Corel, Eduardo | |
| dc.contributor.author | Pitschi, Florian | |
| dc.contributor.author | Laprevotte, Ivan | |
| dc.contributor.author | Grasseau, Gilles | |
| dc.contributor.author | Didier, Gilles | |
| dc.contributor.author | Devauchelle, Claudine | |
| dc.date.accessioned | 2018-11-07T08:41:15Z | |
| dc.date.available | 2018-11-07T08:41:15Z | |
| dc.date.issued | 2010 | |
| dc.description.abstract | Background: While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suffix trees ...) or exhaustivity (motif search), in general with fixed length word and/or number of mismatches. We developed previously a method to detect local similarities (the N-local decoding) based on the occurrences of repeated subwords of fixed length, which does not impose a fixed number of mismatches. The resulting similarities are, for some "good" values of N, sufficiently relevant to form the basis of a reliable alignment-free classification. The aim of this paper is to develop a method that uses the similarities detected by N-local decoding while not imposing a fixed value of N. We present a procedure that selects for every position in the sequences an adaptive value of N, and we implement it as the MS4 classification tool. Results: Among the equivalence classes produced by the N-local decodings for all N, we select a (relatively) small number of "relevant" classes corresponding to variable length subwords that carry enough information to perform the classification. The parameter N, for which correct values are data-dependent and thus hard to guess, is here replaced by the average repetitivity kappa of the sequences. We show that our approach yields classifications of several sets of HIV/SIV sequences that agree with the accepted taxonomy, even on usually discarded repetitive regions (like the non-coding part of LTR). Conclusions: The method MS4 satisfactorily classifies a set of sequences that are notoriously hard to align. This suggests that our approach forms the basis of a reliable alignment-free classification tool. The only parameter kappa of MS4 seems to give reasonable results even for its default value, which can be a great advantage for sequence sets for which little information is available. | |
| dc.description.sponsorship | Genopole; Deutsche Forschungsgemeinschaft [MO 1048/6-1] | |
| dc.identifier.doi | 10.1186/1471-2105-11-406 | |
| dc.identifier.isi | 000281442500003 | |
| dc.identifier.pmid | 20673356 | |
| dc.identifier.purl | https://resolver.sub.uni-goettingen.de/purl?gs-1/5666 | |
| dc.identifier.uri | https://resolver.sub.uni-goettingen.de/purl?gro-2/19426 | |
| dc.item.fulltext | With Fulltext | |
| dc.notes.intern | Merged from goescholar | |
| dc.notes.status | zu prüfen | |
| dc.notes.submitter | Najko | |
| dc.publisher | Biomed Central Ltd | |
| dc.relation.issn | 1471-2105 | |
| dc.rights | CC BY 2.0 | |
| dc.rights.uri | https://creativecommons.org/licenses/by/2.0 | |
| dc.title | MS4-Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences | |
| dc.type | journal_article | |
| dc.type.internalPublication | yes | |
| dc.type.peerReviewed | yes | |
| dc.type.status | published | |
| dc.type.version | published_version | |
| dspace.entity.type | Publication |
Files
Original bundle
1 - 5 of 12