Learning local and global context from sequence and matrix inputs

Li, Zhen; 李鎮

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_991044058178003414

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Learning local and global context from sequence and matrix inputs

Title	Learning local and global context from sequence and matrix inputs
Authors	Li, Zhen 李鎮
Advisors	Advisor(s):Yu, Y
Issue Date	2018
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Li, Z. [李鎮]. (2018). Learning local and global context from sequence and matrix inputs. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Learning local and global context from sequence and matrix inputs plays an extremely significate role for bioinformatics and computer vision problems. By taking advantage of big data and appealing data-driven methods, in this thesis, we propose novel pipelines for learning local, global context and integrated local-global context. At first, a novel deep learning pipeline is proposed for protein secondary structure prediction. Specifically, we propose an end-to-end deep network that predicts protein secondary structures from integrated local and global contextual features. Our deep architecture leverages convolutional neural networks with different kernel sizes to extract multiscale local contextual features. In addition, considering long-range dependencies existing in amino acid sequences, we set up a bidirectional neural network consisting of gated recurrent unit to capture global contextual features. Furthermore, multi-task learning is utilized to predict secondary structure labels and amino-acid solvent accessibility simultaneously. Our proposed deep network demonstrates its effectiveness by achieving state-of-the-art performance. Inspired by the success of previous sequence context learning, a new proposed base-caller, WaveNano, are presented to improve the Oxford MinION nanopore basecalling. We further show that the indel (insertions and deletions, mainly cause the high error rate) issue can be significantly reduced via accurate labeling of nucleotide and move labels directly from the raw signal. Our bi-directional WaveNet model with residual blocks and skip connections is able to capture the extremely long dependency in the raw sequential signal. Taking the predicted move as the segmentation guidance, we employ the Viterbi decoding to obtain the final basecalling results from the smoothed nucleotide probability matrix. Our proposed base-caller, WaveNano, achieves state-of-the-art performance on real MinION sequencing data from Lambda phage. Though protein contacts contain key information for protein structure understanding, the predicted contacts based on existing methods learning context form matrix inputs are still of low quality, especially for membrane proteins (MPs) with lack of sufficient solved structures. A low-cost, high-throughput deep transfer learning method is proposed to first predict MP contacts by learning from non-membrane proteins (non-MPs) using integrated local and global context from amino acid sequential and matrix co-evolutional features, and then predict 3D structure models using predicted contacts as distance restraints. Tested on 510 non-redundant MPs, our method has much better contact prediction accuracy than existing ones. A rigorous blind test in CAMEO and human multi-pass MPs test verify the priority of our method. Finally, we address the RGB-D scene labeling problem, which generates pixel-wise and fine-grained label maps from simultaneously sensed photometric (RGB) and depth channels. Our proposed pipeline solves this problem by i) developing a novel Long Short-Term Memorized Context Fusion (LSTM-CF) Model that captures and fuses contextual information from multiple channels of photometric and depth data, and ii) incorporating this model into deep convolutional neural networks (CNNs) for end-to-end training. At last, the fused contextual representation is concatenated with the local convolutional features extracted from the photometric channels in order to improve the accuracy of fine-scale semantic labeling. Our proposed model has set a new state-of-the-art on three main datasets.
Degree	Doctor of Philosophy
Subject	Biometric identification Nucleotide sequence
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/265387

DC Field	Value	Language
dc.contributor.advisor	Yu, Y	-
dc.contributor.author	Li, Zhen	-
dc.contributor.author	李鎮	-
dc.date.accessioned	2018-11-29T06:22:32Z	-
dc.date.available	2018-11-29T06:22:32Z	-
dc.date.issued	2018	-
dc.identifier.citation	Li, Z. [李鎮]. (2018). Learning local and global context from sequence and matrix inputs. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/265387	-
dc.description.abstract	Learning local and global context from sequence and matrix inputs plays an extremely significate role for bioinformatics and computer vision problems. By taking advantage of big data and appealing data-driven methods, in this thesis, we propose novel pipelines for learning local, global context and integrated local-global context. At first, a novel deep learning pipeline is proposed for protein secondary structure prediction. Specifically, we propose an end-to-end deep network that predicts protein secondary structures from integrated local and global contextual features. Our deep architecture leverages convolutional neural networks with different kernel sizes to extract multiscale local contextual features. In addition, considering long-range dependencies existing in amino acid sequences, we set up a bidirectional neural network consisting of gated recurrent unit to capture global contextual features. Furthermore, multi-task learning is utilized to predict secondary structure labels and amino-acid solvent accessibility simultaneously. Our proposed deep network demonstrates its effectiveness by achieving state-of-the-art performance. Inspired by the success of previous sequence context learning, a new proposed base-caller, WaveNano, are presented to improve the Oxford MinION nanopore basecalling. We further show that the indel (insertions and deletions, mainly cause the high error rate) issue can be significantly reduced via accurate labeling of nucleotide and move labels directly from the raw signal. Our bi-directional WaveNet model with residual blocks and skip connections is able to capture the extremely long dependency in the raw sequential signal. Taking the predicted move as the segmentation guidance, we employ the Viterbi decoding to obtain the final basecalling results from the smoothed nucleotide probability matrix. Our proposed base-caller, WaveNano, achieves state-of-the-art performance on real MinION sequencing data from Lambda phage. Though protein contacts contain key information for protein structure understanding, the predicted contacts based on existing methods learning context form matrix inputs are still of low quality, especially for membrane proteins (MPs) with lack of sufficient solved structures. A low-cost, high-throughput deep transfer learning method is proposed to first predict MP contacts by learning from non-membrane proteins (non-MPs) using integrated local and global context from amino acid sequential and matrix co-evolutional features, and then predict 3D structure models using predicted contacts as distance restraints. Tested on 510 non-redundant MPs, our method has much better contact prediction accuracy than existing ones. A rigorous blind test in CAMEO and human multi-pass MPs test verify the priority of our method. Finally, we address the RGB-D scene labeling problem, which generates pixel-wise and fine-grained label maps from simultaneously sensed photometric (RGB) and depth channels. Our proposed pipeline solves this problem by i) developing a novel Long Short-Term Memorized Context Fusion (LSTM-CF) Model that captures and fuses contextual information from multiple channels of photometric and depth data, and ii) incorporating this model into deep convolutional neural networks (CNNs) for end-to-end training. At last, the fused contextual representation is concatenated with the local convolutional features extracted from the photometric channels in order to improve the accuracy of fine-scale semantic labeling. Our proposed model has set a new state-of-the-art on three main datasets.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Biometric identification	-
dc.subject.lcsh	Nucleotide sequence	-
dc.title	Learning local and global context from sequence and matrix inputs	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_991044058178003414	-
dc.date.hkucongregation	2018	-
dc.identifier.mmsid	991044058178003414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Learning local and global context from sequence and matrix inputs

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats