File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Memory- and time-efficient solutions for large-scale metagenomic sequence analysis
Title | Memory- and time-efficient solutions for large-scale metagenomic sequence analysis |
---|---|
Authors | |
Advisors | Advisor(s):Lam, TW |
Issue Date | 2017 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Li, D. R. [李定华]. (2017). Memory- and time-efficient solutions for large-scale metagenomic sequence analysis. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Metagenomics, the study of genomic material directly obtained from uncultured environments, has greatly benefited from the recent advances in next-generation sequencing (NGS) technologies. These technologies are capable of generating millions to billions of short DNA segments (known as reads) from one or multiple environmental samples in a few days. The large volume and high complexity of metagenomic data have posed new computational challenges in analyzing them efficiently and accurately. This thesis discusses three software solutions for analyzing large-scale metagenomic NGS data in memory- and time-efficient manners. All of them were tested on a wide-range of metagenomic data to show their flexibility and advantage over existing methods.
The first tool introduced is MEGAHIT, a de novo assembler for NGS metagenomic sequences. It is the first tool to exploit a succinct data structure called succinct de Bruijn graph to achieve low memory footprint, and it simultaneously achieves high speed via sophisticated parallel design to manipulate the succinct data structure. Before MEGAHIT, the Great Prairie Soil Metagenome datasets (52Gbp to 597Gbp) could only be assembled with preprocessing including partitioning and digital normalization. MEGAHIT is the first metagenome assembler that could handle them without preprocessing and more importantly, delivers higher-quality results in terms of assembly completeness and contiguity.
The second tool to present is MegaGTA, a metagenomic gene-targeted assembler. The idea is to make use of existing gene information to improve the quality of metagenome assembly, especially for high-complexity datasets. MegaGTA improves the pioneering work Xander in three aspects to fully demonstrate the power of gene-targeted assembly. First, MegaGTA employs iterative de Bruijn graphs to achieve high sensitivity and accuracy simultaneously. Second, it penalizes error-prone nodes in de Bruijn graphs to reduce assembly errors. Third, it utilizes succinct de Bruijn graphs to replace Bloom-filters, which are inexact representations of the graphs used in Xander. MegaGTA outperforms Xander in both mock and real metagenomic datasets and is much faster. MegaGTA can assemble large soil metagenome datasets and produce longer and a greater number of gene sequences than MEGAHIT.
Lastly, I present MegaPath, a bioinformatics pipeline for metagenomic short read classification. Unlike most metagenomic classifiers that use exact matches to trade sensitivity for high speed, MegaPath is a more sensitive solution powered by a new NGS short read aligner SOAP-M. This aligner adopts a refined maximum mappable prefix seeding strategy using FM-index and an SIMD-enabled implementation of Smith-Waterman alignment. MegaPath shows higher sensitivity than popular classifiers like Kraken and Centrifuge in different kinds of datasets, especially when the microorganisms in the metagenomic samples share low identity with the reference genomes. MegaPath also allows a more refined classification at the protein level, which endows it the ability to identify highly mutated viral species. Even at the protein level, MegaPath detects bacteria and viruses from clinical samples that typically contain tens of millions of short reads in two to three hours using a single server. MegaPath’s high sensitivity and speed demonstrate its distinguishing feasibility for pathogen detection in real clinical cases. |
Degree | Doctor of Philosophy |
Subject | Metagenomics Bioinformatics |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/249831 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Lam, TW | - |
dc.contributor.author | Li, Dinghua | - |
dc.contributor.author | 李定华 | - |
dc.date.accessioned | 2017-12-19T09:27:27Z | - |
dc.date.available | 2017-12-19T09:27:27Z | - |
dc.date.issued | 2017 | - |
dc.identifier.citation | Li, D. R. [李定华]. (2017). Memory- and time-efficient solutions for large-scale metagenomic sequence analysis. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/249831 | - |
dc.description.abstract | Metagenomics, the study of genomic material directly obtained from uncultured environments, has greatly benefited from the recent advances in next-generation sequencing (NGS) technologies. These technologies are capable of generating millions to billions of short DNA segments (known as reads) from one or multiple environmental samples in a few days. The large volume and high complexity of metagenomic data have posed new computational challenges in analyzing them efficiently and accurately. This thesis discusses three software solutions for analyzing large-scale metagenomic NGS data in memory- and time-efficient manners. All of them were tested on a wide-range of metagenomic data to show their flexibility and advantage over existing methods. The first tool introduced is MEGAHIT, a de novo assembler for NGS metagenomic sequences. It is the first tool to exploit a succinct data structure called succinct de Bruijn graph to achieve low memory footprint, and it simultaneously achieves high speed via sophisticated parallel design to manipulate the succinct data structure. Before MEGAHIT, the Great Prairie Soil Metagenome datasets (52Gbp to 597Gbp) could only be assembled with preprocessing including partitioning and digital normalization. MEGAHIT is the first metagenome assembler that could handle them without preprocessing and more importantly, delivers higher-quality results in terms of assembly completeness and contiguity. The second tool to present is MegaGTA, a metagenomic gene-targeted assembler. The idea is to make use of existing gene information to improve the quality of metagenome assembly, especially for high-complexity datasets. MegaGTA improves the pioneering work Xander in three aspects to fully demonstrate the power of gene-targeted assembly. First, MegaGTA employs iterative de Bruijn graphs to achieve high sensitivity and accuracy simultaneously. Second, it penalizes error-prone nodes in de Bruijn graphs to reduce assembly errors. Third, it utilizes succinct de Bruijn graphs to replace Bloom-filters, which are inexact representations of the graphs used in Xander. MegaGTA outperforms Xander in both mock and real metagenomic datasets and is much faster. MegaGTA can assemble large soil metagenome datasets and produce longer and a greater number of gene sequences than MEGAHIT. Lastly, I present MegaPath, a bioinformatics pipeline for metagenomic short read classification. Unlike most metagenomic classifiers that use exact matches to trade sensitivity for high speed, MegaPath is a more sensitive solution powered by a new NGS short read aligner SOAP-M. This aligner adopts a refined maximum mappable prefix seeding strategy using FM-index and an SIMD-enabled implementation of Smith-Waterman alignment. MegaPath shows higher sensitivity than popular classifiers like Kraken and Centrifuge in different kinds of datasets, especially when the microorganisms in the metagenomic samples share low identity with the reference genomes. MegaPath also allows a more refined classification at the protein level, which endows it the ability to identify highly mutated viral species. Even at the protein level, MegaPath detects bacteria and viruses from clinical samples that typically contain tens of millions of short reads in two to three hours using a single server. MegaPath’s high sensitivity and speed demonstrate its distinguishing feasibility for pathogen detection in real clinical cases. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Metagenomics | - |
dc.subject.lcsh | Bioinformatics | - |
dc.title | Memory- and time-efficient solutions for large-scale metagenomic sequence analysis | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.identifier.doi | 10.5353/th_991043976596703414 | - |
dc.date.hkucongregation | 2017 | - |
dc.identifier.mmsid | 991043976596703414 | - |