File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Genome analyses based on RNA sequences and DNA optical maps
Title | Genome analyses based on RNA sequences and DNA optical maps |
---|---|
Authors | |
Advisors | Advisor(s):Yiu, SM |
Issue Date | 2018 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Li, M. [李夢璐]. (2018). Genome analyses based on RNA sequences and DNA optical maps. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | DNA, RNA and proteins are three major macromolecules essential for all known forms of life. The genetic information within a biological system naturally transfers from DNA to RNA and RNA to protein by transcription and translation. Investiga- tions on DNA optical maps and RNA secondary structures are challenging tasks to expand our understanding of genomics.
An RNA molecule exists in nature as a single strand of nucleotides folding back to itself. This is because some bases are paired up by hydrogen bonds, forming base pair interactions termed as RNA secondary structures. For many RNA molecules, their secondary structures are often more important for correct functionalities (gene expression, regulation, catalysis and cellular signal communication) than their plain sequences. Although the majority of RNAs fold into simple secondary structures, pseudoknots are found in almost all classes of RNAs. Their existence makes the secondary structure prediction NP-hard. In order to predict complicated secondary structures, we devised a grammar-based machine learning method to predict secondary structures for all RNA sequences in Rfam. Regarding every structure as a unique operation path to generate it, we are able to train a rule transition probability matrix and base emission probability matrix. These matrices determine the operation path to generate a secondary structure for a given RNA sequence. Experimental results show that our approach performs well with a high PPV and sensitivity, particularly for highly-pseudoknotted RNAs.
DNA molecules are inherently fully-paired double helices storing biological infor-
mation and encode genetic instructions. In recent years, next-generation sequencing technologies have enabled researchers to discover critical ndings in genomics with low cost and high e ciency. On the other hand, the short read length remains a ma- jor obstacle for thorough structural analyses such as de novo assembly and structural variation detection. As a compliment, optical mapping is a high-throughput technique that produces long and high-resolutional restriction maps. To lay a good basis for op- tical map studies, we rstly conducted a probabilistic error study on the alignment results of BioNano RefAligner on CEU trio maps. Sizing error, false cuts, missing cuts and unknown molecule orientation are carefully modeled using maximal likelihood es- timation. Using the trio of samples and simulated datasets, this error model exhibits a better tting to BioNano optical maps than the previous model. In prediction of the di cult regions that are inclined to higher error rates, our error model performs more accurate than other popular error models.
Taking the optical map investigation one step further, an iterative framework is proposed to assemble optical maps to contigs. Each iteration begins with pairwise alignments among all input optical maps. Con dent alignments compose an overlap graph. By careful graph correction and path search, each connected graph component yields a contig. The assembly process iterates by taking the resulting contigs as new inputs. The algorithm stops when contigs no longer extend or merge. Experiments on E.coli simulated and real datasets show that our assembler is capable of constructing long and accurate consensus maps without misconnections. |
Degree | Doctor of Philosophy |
Subject | Nucleotide sequence - Data processing Gene mapping - Data processing |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/255439 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Yiu, SM | - |
dc.contributor.author | Li, Menglu | - |
dc.contributor.author | 李夢璐 | - |
dc.date.accessioned | 2018-07-05T07:43:33Z | - |
dc.date.available | 2018-07-05T07:43:33Z | - |
dc.date.issued | 2018 | - |
dc.identifier.citation | Li, M. [李夢璐]. (2018). Genome analyses based on RNA sequences and DNA optical maps. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/255439 | - |
dc.description.abstract | DNA, RNA and proteins are three major macromolecules essential for all known forms of life. The genetic information within a biological system naturally transfers from DNA to RNA and RNA to protein by transcription and translation. Investiga- tions on DNA optical maps and RNA secondary structures are challenging tasks to expand our understanding of genomics. An RNA molecule exists in nature as a single strand of nucleotides folding back to itself. This is because some bases are paired up by hydrogen bonds, forming base pair interactions termed as RNA secondary structures. For many RNA molecules, their secondary structures are often more important for correct functionalities (gene expression, regulation, catalysis and cellular signal communication) than their plain sequences. Although the majority of RNAs fold into simple secondary structures, pseudoknots are found in almost all classes of RNAs. Their existence makes the secondary structure prediction NP-hard. In order to predict complicated secondary structures, we devised a grammar-based machine learning method to predict secondary structures for all RNA sequences in Rfam. Regarding every structure as a unique operation path to generate it, we are able to train a rule transition probability matrix and base emission probability matrix. These matrices determine the operation path to generate a secondary structure for a given RNA sequence. Experimental results show that our approach performs well with a high PPV and sensitivity, particularly for highly-pseudoknotted RNAs. DNA molecules are inherently fully-paired double helices storing biological infor- mation and encode genetic instructions. In recent years, next-generation sequencing technologies have enabled researchers to discover critical ndings in genomics with low cost and high e ciency. On the other hand, the short read length remains a ma- jor obstacle for thorough structural analyses such as de novo assembly and structural variation detection. As a compliment, optical mapping is a high-throughput technique that produces long and high-resolutional restriction maps. To lay a good basis for op- tical map studies, we rstly conducted a probabilistic error study on the alignment results of BioNano RefAligner on CEU trio maps. Sizing error, false cuts, missing cuts and unknown molecule orientation are carefully modeled using maximal likelihood es- timation. Using the trio of samples and simulated datasets, this error model exhibits a better tting to BioNano optical maps than the previous model. In prediction of the di cult regions that are inclined to higher error rates, our error model performs more accurate than other popular error models. Taking the optical map investigation one step further, an iterative framework is proposed to assemble optical maps to contigs. Each iteration begins with pairwise alignments among all input optical maps. Con dent alignments compose an overlap graph. By careful graph correction and path search, each connected graph component yields a contig. The assembly process iterates by taking the resulting contigs as new inputs. The algorithm stops when contigs no longer extend or merge. Experiments on E.coli simulated and real datasets show that our assembler is capable of constructing long and accurate consensus maps without misconnections. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Nucleotide sequence - Data processing | - |
dc.subject.lcsh | Gene mapping - Data processing | - |
dc.title | Genome analyses based on RNA sequences and DNA optical maps | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.identifier.doi | 10.5353/th_991044019484603414 | - |
dc.date.hkucongregation | 2018 | - |
dc.identifier.mmsid | 991044019484603414 | - |