File Download
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1186/1471-2105-11-S2-S5
- Scopus: eid_2-s2.0-77952894198
- PMID: 20406503
- WOS: WOS:000276812300005
- Find via
Supplementary
-
Bookmarks:
- CiteULike: 5
- Citations:
- Appears in Collections:
Conference Paper: Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers
Title | Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers |
---|---|
Authors | |
Issue Date | 2010 |
Publisher | BioMed Central Ltd. |
Citation | The 3rd International Workshop on Data and Text Mining in Bioinformatics (DTMBIO 2009), Hong Kong, 6 November 2009. In BMC Bioinformatics, 2010, v. 11 suppl 2, article S5 How to Cite? |
Abstract | BACKGROUND: With the rapid development of genome sequencing techniques, traditional research methods based on the isolation and cultivation of microorganisms are being gradually replaced by metagenomics, which is also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomics is the taxonomic characterization of DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as 'binning'. Existing binning methods are based on supervised or semi-supervised approaches which rely heavily on reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and instability of marker genes, existing binning methods may not be applicable in many cases. RESULTS: In this paper, we present an unsupervised binning method based on the distribution of a carefully selected set of l-mers (substrings of length l in DNA fragments). From our experiments, we show that our method can accurately bin DNA fragments with various lengths and relative species abundance ratios without using any reference and training datasets. Another feature of our method is its error robustness. The binning accuracy decreases by less than 1% when the sequencing error rate increases from 0% to 5%. Note that the typical sequencing error rate of existing commercial sequencing platforms is less than 2%. CONCLUSIONS: We provide a new and effective tool to solve the metagenome binning problem without using any reference datasets or markers information of any known reference genomes (species). The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/alse/MetaCluster/. |
Persistent Identifier | http://hdl.handle.net/10722/152434 |
ISSN | 2023 Impact Factor: 2.9 2023 SCImago Journal Rankings: 1.005 |
PubMed Central ID | |
ISI Accession Number ID | |
References |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Yang, B | en_US |
dc.contributor.author | Peng, Y | en_US |
dc.contributor.author | Leung, HCM | en_US |
dc.contributor.author | Yiu, SM | en_US |
dc.contributor.author | Chen, JC | en_US |
dc.contributor.author | Chin, FYL | en_US |
dc.date.accessioned | 2012-06-26T06:39:00Z | - |
dc.date.available | 2012-06-26T06:39:00Z | - |
dc.date.issued | 2010 | en_US |
dc.identifier.citation | The 3rd International Workshop on Data and Text Mining in Bioinformatics (DTMBIO 2009), Hong Kong, 6 November 2009. In BMC Bioinformatics, 2010, v. 11 suppl 2, article S5 | en_US |
dc.identifier.issn | 1471-2105 | en_US |
dc.identifier.uri | http://hdl.handle.net/10722/152434 | - |
dc.description.abstract | BACKGROUND: With the rapid development of genome sequencing techniques, traditional research methods based on the isolation and cultivation of microorganisms are being gradually replaced by metagenomics, which is also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomics is the taxonomic characterization of DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as 'binning'. Existing binning methods are based on supervised or semi-supervised approaches which rely heavily on reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and instability of marker genes, existing binning methods may not be applicable in many cases. RESULTS: In this paper, we present an unsupervised binning method based on the distribution of a carefully selected set of l-mers (substrings of length l in DNA fragments). From our experiments, we show that our method can accurately bin DNA fragments with various lengths and relative species abundance ratios without using any reference and training datasets. Another feature of our method is its error robustness. The binning accuracy decreases by less than 1% when the sequencing error rate increases from 0% to 5%. Note that the typical sequencing error rate of existing commercial sequencing platforms is less than 2%. CONCLUSIONS: We provide a new and effective tool to solve the metagenome binning problem without using any reference datasets or markers information of any known reference genomes (species). The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/alse/MetaCluster/. | en_US |
dc.language | eng | en_US |
dc.publisher | BioMed Central Ltd. | en_US |
dc.relation.ispartof | BMC Bioinformatics | en_US |
dc.rights | BMC Bioinformatics. Copyright © BioMed Central Ltd. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.mesh | Algorithms | en_US |
dc.subject.mesh | Cluster Analysis | en_US |
dc.subject.mesh | DNA - chemistry | en_US |
dc.subject.mesh | Data Mining - methods | en_US |
dc.subject.mesh | Databases, Genetic | en_US |
dc.subject.mesh | Environmental Microbiology | en_US |
dc.subject.mesh | Escherichia coli - genetics | en_US |
dc.subject.mesh | Genome, Bacterial - genetics | en_US |
dc.subject.mesh | Lactobacillus - genetics | en_US |
dc.subject.mesh | Metagenomics - methods | en_US |
dc.subject.mesh | Sequence Analysis, DNA - methods | en_US |
dc.title | Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers | en_US |
dc.type | Conference_Paper | en_US |
dc.identifier.email | Leung, HCM: cmleung2@cs.hku.hk | en_US |
dc.identifier.email | Yiu, SM: smyiu@cs.hku.hk | en_US |
dc.identifier.email | Chin, FYL: chin@cs.hku.hk | en_US |
dc.identifier.authority | Leung, HCM=rp00144 | en_US |
dc.identifier.authority | Yiu, SM=rp00207 | en_US |
dc.identifier.authority | Chin, FYL=rp00105 | en_US |
dc.description.nature | published_or_final_version | en_US |
dc.identifier.doi | 10.1186/1471-2105-11-S2-S5 | en_US |
dc.identifier.pmid | 20406503 | - |
dc.identifier.pmcid | PMC3165929 | - |
dc.identifier.scopus | eid_2-s2.0-77952894198 | en_US |
dc.identifier.hkuros | 177371 | - |
dc.relation.references | http://www.scopus.com/mlt/select.url?eid=2-s2.0-77952894198&selection=ref&src=s&origin=recordpage | en_US |
dc.identifier.volume | 11 | en_US |
dc.identifier.issue | suppl 2 | en_US |
dc.identifier.isi | WOS:000276812300005 | - |
dc.publisher.place | United Kingdom | en_US |
dc.identifier.scopusauthorid | Chin, FY=7005101915 | en_US |
dc.identifier.scopusauthorid | Chen, JC=36439015600 | en_US |
dc.identifier.scopusauthorid | Yiu, SM=7003282240 | en_US |
dc.identifier.scopusauthorid | Leung, HC=35233742700 | en_US |
dc.identifier.scopusauthorid | Peng, Y=8713314400 | en_US |
dc.identifier.scopusauthorid | Yang, B=7404472246 | en_US |
dc.identifier.citeulike | 8210869 | - |
dc.customcontrol.immutable | sml 140806 | - |
dc.identifier.issnl | 1471-2105 | - |