MS4-Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences

Devauchelle, Claudine

doi:10.1186/1471-2105-11-406

Publication:
MS4-Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences

dc.bibliographiccitation.artnumber	406
dc.bibliographiccitation.journal	BMC Bioinformatics
dc.bibliographiccitation.volume	11
dc.contributor.author	Corel, Eduardo
dc.contributor.author	Pitschi, Florian
dc.contributor.author	Laprevotte, Ivan
dc.contributor.author	Grasseau, Gilles
dc.contributor.author	Didier, Gilles
dc.contributor.author	Devauchelle, Claudine
dc.date.accessioned	2018-11-07T08:41:15Z
dc.date.available	2018-11-07T08:41:15Z
dc.date.issued	2010
dc.description.abstract	Background: While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suffix trees ...) or exhaustivity (motif search), in general with fixed length word and/or number of mismatches. We developed previously a method to detect local similarities (the N-local decoding) based on the occurrences of repeated subwords of fixed length, which does not impose a fixed number of mismatches. The resulting similarities are, for some "good" values of N, sufficiently relevant to form the basis of a reliable alignment-free classification. The aim of this paper is to develop a method that uses the similarities detected by N-local decoding while not imposing a fixed value of N. We present a procedure that selects for every position in the sequences an adaptive value of N, and we implement it as the MS4 classification tool. Results: Among the equivalence classes produced by the N-local decodings for all N, we select a (relatively) small number of "relevant" classes corresponding to variable length subwords that carry enough information to perform the classification. The parameter N, for which correct values are data-dependent and thus hard to guess, is here replaced by the average repetitivity kappa of the sequences. We show that our approach yields classifications of several sets of HIV/SIV sequences that agree with the accepted taxonomy, even on usually discarded repetitive regions (like the non-coding part of LTR). Conclusions: The method MS4 satisfactorily classifies a set of sequences that are notoriously hard to align. This suggests that our approach forms the basis of a reliable alignment-free classification tool. The only parameter kappa of MS4 seems to give reasonable results even for its default value, which can be a great advantage for sequence sets for which little information is available.
dc.description.sponsorship	Genopole; Deutsche Forschungsgemeinschaft [MO 1048/6-1]
dc.identifier.doi	10.1186/1471-2105-11-406
dc.identifier.isi	000281442500003
dc.identifier.pmid	20673356
dc.identifier.purl	https://resolver.sub.uni-goettingen.de/purl?gs-1/5666
dc.identifier.uri	https://resolver.sub.uni-goettingen.de/purl?gro-2/19426
dc.item.fulltext	With Fulltext
dc.notes.intern	Merged from goescholar
dc.notes.status	zu prüfen
dc.notes.submitter	Najko
dc.publisher	Biomed Central Ltd
dc.relation.issn	1471-2105
dc.rights	CC BY 2.0
dc.rights.uri	https://creativecommons.org/licenses/by/2.0
dc.title	MS4-Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences
dc.type	journal_article
dc.type.internalPublication	yes
dc.type.peerReviewed	yes
dc.type.status	published
dc.type.version	published_version
dspace.entity.type	Publication