Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers

Yang, B; Peng, Y; Leung, HCM; Yiu, SM; Chen, JC; Chin, FYL

File Download

Content.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1186/1471-2105-11-S2-S5
Scopus: eid_2-s2.0-77952894198
PMID: 20406503
WOS: WOS:000276812300005
Find via

Supplementary

Bookmarks:
- CiteULike: 5
Citations:
- Scopus: 0
- Web of Science: 0
- PubMed Central: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers

Title	Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers
Authors	Yang, B Peng, Y Leung, HCM Yiu, SM Chen, JC Chin, FYL
Issue Date	2010
Publisher	BioMed Central Ltd.
Citation	The 3rd International Workshop on Data and Text Mining in Bioinformatics (DTMBIO 2009), Hong Kong, 6 November 2009. In BMC Bioinformatics, 2010, v. 11 suppl 2, article S5 How to Cite? DOI: http://dx.doi.org/10.1186/1471-2105-11-S2-S5
Abstract	BACKGROUND: With the rapid development of genome sequencing techniques, traditional research methods based on the isolation and cultivation of microorganisms are being gradually replaced by metagenomics, which is also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomics is the taxonomic characterization of DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as 'binning'. Existing binning methods are based on supervised or semi-supervised approaches which rely heavily on reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and instability of marker genes, existing binning methods may not be applicable in many cases. RESULTS: In this paper, we present an unsupervised binning method based on the distribution of a carefully selected set of l-mers (substrings of length l in DNA fragments). From our experiments, we show that our method can accurately bin DNA fragments with various lengths and relative species abundance ratios without using any reference and training datasets. Another feature of our method is its error robustness. The binning accuracy decreases by less than 1% when the sequencing error rate increases from 0% to 5%. Note that the typical sequencing error rate of existing commercial sequencing platforms is less than 2%. CONCLUSIONS: We provide a new and effective tool to solve the metagenome binning problem without using any reference datasets or markers information of any known reference genomes (species). The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/alse/MetaCluster/.
Persistent Identifier	http://hdl.handle.net/10722/152434
ISSN	1471-2105 2023 Impact Factor: 2.9 2023 SCImago Journal Rankings: 1.005
PubMed Central ID	PMC3165929
ISI Accession Number ID	WOS:000276812300005
References	References in Scopus

DC Field	Value	Language
dc.contributor.author	Yang, B	en_US
dc.contributor.author	Peng, Y	en_US
dc.contributor.author	Leung, HCM	en_US
dc.contributor.author	Yiu, SM	en_US
dc.contributor.author	Chen, JC	en_US
dc.contributor.author	Chin, FYL	en_US
dc.date.accessioned	2012-06-26T06:39:00Z	-
dc.date.available	2012-06-26T06:39:00Z	-
dc.date.issued	2010	en_US
dc.identifier.citation	The 3rd International Workshop on Data and Text Mining in Bioinformatics (DTMBIO 2009), Hong Kong, 6 November 2009. In BMC Bioinformatics, 2010, v. 11 suppl 2, article S5	en_US
dc.identifier.issn	1471-2105	en_US
dc.identifier.uri	http://hdl.handle.net/10722/152434	-
dc.description.abstract	BACKGROUND: With the rapid development of genome sequencing techniques, traditional research methods based on the isolation and cultivation of microorganisms are being gradually replaced by metagenomics, which is also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomics is the taxonomic characterization of DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as 'binning'. Existing binning methods are based on supervised or semi-supervised approaches which rely heavily on reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and instability of marker genes, existing binning methods may not be applicable in many cases. RESULTS: In this paper, we present an unsupervised binning method based on the distribution of a carefully selected set of l-mers (substrings of length l in DNA fragments). From our experiments, we show that our method can accurately bin DNA fragments with various lengths and relative species abundance ratios without using any reference and training datasets. Another feature of our method is its error robustness. The binning accuracy decreases by less than 1% when the sequencing error rate increases from 0% to 5%. Note that the typical sequencing error rate of existing commercial sequencing platforms is less than 2%. CONCLUSIONS: We provide a new and effective tool to solve the metagenome binning problem without using any reference datasets or markers information of any known reference genomes (species). The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/alse/MetaCluster/.	en_US
dc.language	eng	en_US
dc.publisher	BioMed Central Ltd.	en_US
dc.relation.ispartof	BMC Bioinformatics	en_US
dc.rights	BMC Bioinformatics. Copyright © BioMed Central Ltd.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.mesh	Algorithms	en_US
dc.subject.mesh	Cluster Analysis	en_US
dc.subject.mesh	DNA - chemistry	en_US
dc.subject.mesh	Data Mining - methods	en_US
dc.subject.mesh	Databases, Genetic	en_US
dc.subject.mesh	Environmental Microbiology	en_US
dc.subject.mesh	Escherichia coli - genetics	en_US
dc.subject.mesh	Genome, Bacterial - genetics	en_US
dc.subject.mesh	Lactobacillus - genetics	en_US
dc.subject.mesh	Metagenomics - methods	en_US
dc.subject.mesh	Sequence Analysis, DNA - methods	en_US
dc.title	Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers	en_US
dc.type	Conference_Paper	en_US
dc.identifier.email	Leung, HCM: cmleung2@cs.hku.hk	en_US
dc.identifier.email	Yiu, SM: smyiu@cs.hku.hk	en_US
dc.identifier.email	Chin, FYL: chin@cs.hku.hk	en_US
dc.identifier.authority	Leung, HCM=rp00144	en_US
dc.identifier.authority	Yiu, SM=rp00207	en_US
dc.identifier.authority	Chin, FYL=rp00105	en_US
dc.description.nature	published_or_final_version	en_US
dc.identifier.doi	10.1186/1471-2105-11-S2-S5	en_US
dc.identifier.pmid	20406503	-
dc.identifier.pmcid	PMC3165929	-
dc.identifier.scopus	eid_2-s2.0-77952894198	en_US
dc.identifier.hkuros	177371	-
dc.relation.references	http://www.scopus.com/mlt/select.url?eid=2-s2.0-77952894198&selection=ref&src=s&origin=recordpage	en_US
dc.identifier.volume	11	en_US
dc.identifier.issue	suppl 2	en_US
dc.identifier.isi	WOS:000276812300005	-
dc.publisher.place	United Kingdom	en_US
dc.identifier.scopusauthorid	Chin, FY=7005101915	en_US
dc.identifier.scopusauthorid	Chen, JC=36439015600	en_US
dc.identifier.scopusauthorid	Yiu, SM=7003282240	en_US
dc.identifier.scopusauthorid	Leung, HC=35233742700	en_US
dc.identifier.scopusauthorid	Peng, Y=8713314400	en_US
dc.identifier.scopusauthorid	Yang, B=7404472246	en_US
dc.identifier.citeulike	8210869	-
dc.customcontrol.immutable	sml 140806	-
dc.identifier.issnl	1471-2105	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats