File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

postgraduate thesis: Large genome de novo assembly with bi-directional BWT

TitleLarge genome de novo assembly with bi-directional BWT
Authors
Issue Date2015
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Liu, B. [劉兵行]. (2015). Large genome de novo assembly with bi-directional BWT. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5736698
AbstractDe novo genome assembly is a fundamental problem in genomics research. When assembling large genomes, time is often a very important concern, and one might have no choice but to use a more efficient assembler like SOAPden-ovo2 instead of a high-quality but prohibitively slow assembler (e.g., SPAdes). Yet SOAPdenovo2 has inherent difficulty to utilize the full advantage of longer reads (say, 150bp to 250bp from Illumina HiSeq and MiSeq). Other assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are indeed more favorable for longer reads. In this thesis, I mainly present a new contig assembler called BASE, based on a seed-extension approach. It exploits an efficient indexing of reads to generate adaptive seeds with high probability of unique appearance in the genome and high sequencing quality. Guided by these seeds, BASE constructs extension trees and gradually removes the branches with a method called reverse validation, which utilizes information about read coverage and paired-end relationship to obtain consensus sequences of reads sharing the seeds. These consensus sequences are further extended to form high quality contigs. Benchmark on several bacteria and human datasets demonstrates the performance advantage of BASE in speed and assembly quality when longer reads are used. Our first benchmark was based on two datasets of deeply sequenced bacteria genomes (240X) with read length of 100bp and 250bp. Especially for 250bp reads, BASE performs much better than SOAPdenovo2 and SGA and is similar to SPAdes in performance. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. We have further compared BASE and SOAPdenovo2 using human genome datasets with read length 100bp, 150bp and 250bp. BASE consistently achieves a higher N50 for all datasets; while the improvement becomes more significant when read length reaches 250bp. SOAPdenovo2 uses relatively more memory when sequencing error is high. BASE is an efficient assembler for contig construction, with significant improvement in quality for long NGS reads. It could be easily extended to support scaffolding in the near future.
DegreeMaster of Philosophy
SubjectNucleotide sequence - Data processing
Data compression (Computer science)
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/225227
HKU Library Item IDb5736698

 

DC FieldValueLanguage
dc.contributor.authorLiu, Binghang-
dc.contributor.author劉兵行-
dc.date.accessioned2016-04-28T06:50:59Z-
dc.date.available2016-04-28T06:50:59Z-
dc.date.issued2015-
dc.identifier.citationLiu, B. [劉兵行]. (2015). Large genome de novo assembly with bi-directional BWT. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5736698-
dc.identifier.urihttp://hdl.handle.net/10722/225227-
dc.description.abstractDe novo genome assembly is a fundamental problem in genomics research. When assembling large genomes, time is often a very important concern, and one might have no choice but to use a more efficient assembler like SOAPden-ovo2 instead of a high-quality but prohibitively slow assembler (e.g., SPAdes). Yet SOAPdenovo2 has inherent difficulty to utilize the full advantage of longer reads (say, 150bp to 250bp from Illumina HiSeq and MiSeq). Other assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are indeed more favorable for longer reads. In this thesis, I mainly present a new contig assembler called BASE, based on a seed-extension approach. It exploits an efficient indexing of reads to generate adaptive seeds with high probability of unique appearance in the genome and high sequencing quality. Guided by these seeds, BASE constructs extension trees and gradually removes the branches with a method called reverse validation, which utilizes information about read coverage and paired-end relationship to obtain consensus sequences of reads sharing the seeds. These consensus sequences are further extended to form high quality contigs. Benchmark on several bacteria and human datasets demonstrates the performance advantage of BASE in speed and assembly quality when longer reads are used. Our first benchmark was based on two datasets of deeply sequenced bacteria genomes (240X) with read length of 100bp and 250bp. Especially for 250bp reads, BASE performs much better than SOAPdenovo2 and SGA and is similar to SPAdes in performance. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. We have further compared BASE and SOAPdenovo2 using human genome datasets with read length 100bp, 150bp and 250bp. BASE consistently achieves a higher N50 for all datasets; while the improvement becomes more significant when read length reaches 250bp. SOAPdenovo2 uses relatively more memory when sequencing error is high. BASE is an efficient assembler for contig construction, with significant improvement in quality for long NGS reads. It could be easily extended to support scaffolding in the near future.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshNucleotide sequence - Data processing-
dc.subject.lcshData compression (Computer science)-
dc.titleLarge genome de novo assembly with bi-directional BWT-
dc.typePG_Thesis-
dc.identifier.hkulb5736698-
dc.description.thesisnameMaster of Philosophy-
dc.description.thesislevelMaster-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.5353/th_b5736698-
dc.identifier.mmsid991019348549703414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats