Memory- and time-efficient solutions for large-scale metagenomic sequence analysis

Li, Dinghua; 李定华

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_991043976596703414

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Memory- and time-efficient solutions for large-scale metagenomic sequence analysis

Title	Memory- and time-efficient solutions for large-scale metagenomic sequence analysis
Authors	Li, Dinghua 李定华
Advisors	Advisor(s):Lam, TW
Issue Date	2017
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Li, D. R. [李定华]. (2017). Memory- and time-efficient solutions for large-scale metagenomic sequence analysis. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Metagenomics, the study of genomic material directly obtained from uncultured environments, has greatly benefited from the recent advances in next-generation sequencing (NGS) technologies. These technologies are capable of generating millions to billions of short DNA segments (known as reads) from one or multiple environmental samples in a few days. The large volume and high complexity of metagenomic data have posed new computational challenges in analyzing them efficiently and accurately. This thesis discusses three software solutions for analyzing large-scale metagenomic NGS data in memory- and time-efficient manners. All of them were tested on a wide-range of metagenomic data to show their flexibility and advantage over existing methods. The first tool introduced is MEGAHIT, a de novo assembler for NGS metagenomic sequences. It is the first tool to exploit a succinct data structure called succinct de Bruijn graph to achieve low memory footprint, and it simultaneously achieves high speed via sophisticated parallel design to manipulate the succinct data structure. Before MEGAHIT, the Great Prairie Soil Metagenome datasets (52Gbp to 597Gbp) could only be assembled with preprocessing including partitioning and digital normalization. MEGAHIT is the first metagenome assembler that could handle them without preprocessing and more importantly, delivers higher-quality results in terms of assembly completeness and contiguity. The second tool to present is MegaGTA, a metagenomic gene-targeted assembler. The idea is to make use of existing gene information to improve the quality of metagenome assembly, especially for high-complexity datasets. MegaGTA improves the pioneering work Xander in three aspects to fully demonstrate the power of gene-targeted assembly. First, MegaGTA employs iterative de Bruijn graphs to achieve high sensitivity and accuracy simultaneously. Second, it penalizes error-prone nodes in de Bruijn graphs to reduce assembly errors. Third, it utilizes succinct de Bruijn graphs to replace Bloom-filters, which are inexact representations of the graphs used in Xander. MegaGTA outperforms Xander in both mock and real metagenomic datasets and is much faster. MegaGTA can assemble large soil metagenome datasets and produce longer and a greater number of gene sequences than MEGAHIT. Lastly, I present MegaPath, a bioinformatics pipeline for metagenomic short read classification. Unlike most metagenomic classifiers that use exact matches to trade sensitivity for high speed, MegaPath is a more sensitive solution powered by a new NGS short read aligner SOAP-M. This aligner adopts a refined maximum mappable prefix seeding strategy using FM-index and an SIMD-enabled implementation of Smith-Waterman alignment. MegaPath shows higher sensitivity than popular classifiers like Kraken and Centrifuge in different kinds of datasets, especially when the microorganisms in the metagenomic samples share low identity with the reference genomes. MegaPath also allows a more refined classification at the protein level, which endows it the ability to identify highly mutated viral species. Even at the protein level, MegaPath detects bacteria and viruses from clinical samples that typically contain tens of millions of short reads in two to three hours using a single server. MegaPath’s high sensitivity and speed demonstrate its distinguishing feasibility for pathogen detection in real clinical cases.
Degree	Doctor of Philosophy
Subject	Metagenomics Bioinformatics
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/249831

DC Field	Value	Language
dc.contributor.advisor	Lam, TW	-
dc.contributor.author	Li, Dinghua	-
dc.contributor.author	李定华	-
dc.date.accessioned	2017-12-19T09:27:27Z	-
dc.date.available	2017-12-19T09:27:27Z	-
dc.date.issued	2017	-
dc.identifier.citation	Li, D. R. [李定华]. (2017). Memory- and time-efficient solutions for large-scale metagenomic sequence analysis. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/249831	-
dc.description.abstract	Metagenomics, the study of genomic material directly obtained from uncultured environments, has greatly benefited from the recent advances in next-generation sequencing (NGS) technologies. These technologies are capable of generating millions to billions of short DNA segments (known as reads) from one or multiple environmental samples in a few days. The large volume and high complexity of metagenomic data have posed new computational challenges in analyzing them efficiently and accurately. This thesis discusses three software solutions for analyzing large-scale metagenomic NGS data in memory- and time-efficient manners. All of them were tested on a wide-range of metagenomic data to show their flexibility and advantage over existing methods. The first tool introduced is MEGAHIT, a de novo assembler for NGS metagenomic sequences. It is the first tool to exploit a succinct data structure called succinct de Bruijn graph to achieve low memory footprint, and it simultaneously achieves high speed via sophisticated parallel design to manipulate the succinct data structure. Before MEGAHIT, the Great Prairie Soil Metagenome datasets (52Gbp to 597Gbp) could only be assembled with preprocessing including partitioning and digital normalization. MEGAHIT is the first metagenome assembler that could handle them without preprocessing and more importantly, delivers higher-quality results in terms of assembly completeness and contiguity. The second tool to present is MegaGTA, a metagenomic gene-targeted assembler. The idea is to make use of existing gene information to improve the quality of metagenome assembly, especially for high-complexity datasets. MegaGTA improves the pioneering work Xander in three aspects to fully demonstrate the power of gene-targeted assembly. First, MegaGTA employs iterative de Bruijn graphs to achieve high sensitivity and accuracy simultaneously. Second, it penalizes error-prone nodes in de Bruijn graphs to reduce assembly errors. Third, it utilizes succinct de Bruijn graphs to replace Bloom-filters, which are inexact representations of the graphs used in Xander. MegaGTA outperforms Xander in both mock and real metagenomic datasets and is much faster. MegaGTA can assemble large soil metagenome datasets and produce longer and a greater number of gene sequences than MEGAHIT. Lastly, I present MegaPath, a bioinformatics pipeline for metagenomic short read classification. Unlike most metagenomic classifiers that use exact matches to trade sensitivity for high speed, MegaPath is a more sensitive solution powered by a new NGS short read aligner SOAP-M. This aligner adopts a refined maximum mappable prefix seeding strategy using FM-index and an SIMD-enabled implementation of Smith-Waterman alignment. MegaPath shows higher sensitivity than popular classifiers like Kraken and Centrifuge in different kinds of datasets, especially when the microorganisms in the metagenomic samples share low identity with the reference genomes. MegaPath also allows a more refined classification at the protein level, which endows it the ability to identify highly mutated viral species. Even at the protein level, MegaPath detects bacteria and viruses from clinical samples that typically contain tens of millions of short reads in two to three hours using a single server. MegaPath’s high sensitivity and speed demonstrate its distinguishing feasibility for pathogen detection in real clinical cases.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Metagenomics	-
dc.subject.lcsh	Bioinformatics	-
dc.title	Memory- and time-efficient solutions for large-scale metagenomic sequence analysis	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_991043976596703414	-
dc.date.hkucongregation	2017	-
dc.identifier.mmsid	991043976596703414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Memory- and time-efficient solutions for large-scale metagenomic sequence analysis

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats