Some topics on deep learning and text analytics

Tang, Yaohua; 汤耀华

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_991043976390303414

Supplementary

Citations:
Appears in Collections:
- Statistics & Actuarial Science: Theses
- HKU Theses Online

postgraduate thesis: Some topics on deep learning and text analytics

Title	Some topics on deep learning and text analytics
Authors	Tang, Yaohua 汤耀华
Advisors	Advisor(s):Yu, PLH
Issue Date	2017
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Tang, Y. [汤耀华]. (2017). Some topics on deep learning and text analytics. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	In this thesis, I study some issues related to text analytics and deep learning, including grammar analysis, machine translation and realized covariance matrix modelling. The methods utilized to solve these issues are mainly under the Bayesian and deep learning frameworks. In the first part, I study the unsupervised grammar induction problem. I propose a unsupervised probabilistic framework to extract hidden common probabilistic context-free grammars (PCFGs) across the authors of the articles. Instead of using a single grammar to parse all texts, I assume that there are several PCFGs and each PCFG shares the same CFG but with different grammar rule probabilities. Each article in the corpus is generated from a random mixture of these PCFGs, with proportions drawn from a Dirichlet distribution. The sentences in an article are obtained by repeatedly choosing a PCFG from the proportions and then using the rules to generate from larger to smaller spans recursively. I derive two inference algorithms, a variational Bayes and a Markov chain Monte Carlo algorithm for the model. In the experiments, it is found that the multi-grammars model outperforms both single grammar Bayes model and Inside-Outside algorithm. In the second part, I attempt to incorporate phrase table into the neural machine translation models which are word-based. I present a phraseNet, a neural machine translator with a phrase memory, which stores phrase pairs mined from corpus or specified by human experts. For any given source sentence, phraseNet scans the phrase memory to determine the candidate phrase pairs and integrates tagging information in the representation of the source sentence accordingly. Two variant decoders are proposed to utilize a mixture of word-generating component and phrase-generating component, with a specifically designed strategy to generate a sequence of multiple words all at once. The phraseNet not only approaches one step towards incorporating external knowledge into neural machine translation, but also makes an effort to extend the word-by-word generation mechanism of recurrent neural network. The empirical study on the Chinese-to-English translation shows that, with carefully-chosen phrase table in memory, phraseNet yields 3.45 BLEU improvement over the generic neural machine translator. Lastly, I consider two Fully Convolutional Network models for modelling and forecasting high-dimensional realized covariance matrices. It is well known that modelling and forecasting covariance matrices of asset returns play a crucial role in finance. The availability of high frequency intraday data enables the modelling of the realized covariance matrix directly. However, most of the models available in the literature depend on strong structural assumptions and they often suffer from the curse of dimensionality. In this thesis, I propose two non-parametric models built on the Convolutional Neural Network which do not require to make any distributional or structural assumption but could handle high-dimensional realized covariance matrices. The proposed models focus on local structures and learn a nonlinear mapping that connect the historical realized covariance matrices to the future one. My empirical studies demonstrate their excellent forecasting ability compared with several advanced volatility models. We also visualize the learned filters to illustrate the models.
Degree	Doctor of Philosophy
Subject	Mathematical linguistics Computational linguistics - Statistical methods
Dept/Program	Statistics and Actuarial Science
Persistent Identifier	http://hdl.handle.net/10722/249887

DC Field	Value	Language
dc.contributor.advisor	Yu, PLH	-
dc.contributor.author	Tang, Yaohua	-
dc.contributor.author	汤耀华	-
dc.date.accessioned	2017-12-19T09:27:38Z	-
dc.date.available	2017-12-19T09:27:38Z	-
dc.date.issued	2017	-
dc.identifier.citation	Tang, Y. [汤耀华]. (2017). Some topics on deep learning and text analytics. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/249887	-
dc.description.abstract	In this thesis, I study some issues related to text analytics and deep learning, including grammar analysis, machine translation and realized covariance matrix modelling. The methods utilized to solve these issues are mainly under the Bayesian and deep learning frameworks. In the first part, I study the unsupervised grammar induction problem. I propose a unsupervised probabilistic framework to extract hidden common probabilistic context-free grammars (PCFGs) across the authors of the articles. Instead of using a single grammar to parse all texts, I assume that there are several PCFGs and each PCFG shares the same CFG but with different grammar rule probabilities. Each article in the corpus is generated from a random mixture of these PCFGs, with proportions drawn from a Dirichlet distribution. The sentences in an article are obtained by repeatedly choosing a PCFG from the proportions and then using the rules to generate from larger to smaller spans recursively. I derive two inference algorithms, a variational Bayes and a Markov chain Monte Carlo algorithm for the model. In the experiments, it is found that the multi-grammars model outperforms both single grammar Bayes model and Inside-Outside algorithm. In the second part, I attempt to incorporate phrase table into the neural machine translation models which are word-based. I present a phraseNet, a neural machine translator with a phrase memory, which stores phrase pairs mined from corpus or specified by human experts. For any given source sentence, phraseNet scans the phrase memory to determine the candidate phrase pairs and integrates tagging information in the representation of the source sentence accordingly. Two variant decoders are proposed to utilize a mixture of word-generating component and phrase-generating component, with a specifically designed strategy to generate a sequence of multiple words all at once. The phraseNet not only approaches one step towards incorporating external knowledge into neural machine translation, but also makes an effort to extend the word-by-word generation mechanism of recurrent neural network. The empirical study on the Chinese-to-English translation shows that, with carefully-chosen phrase table in memory, phraseNet yields 3.45 BLEU improvement over the generic neural machine translator. Lastly, I consider two Fully Convolutional Network models for modelling and forecasting high-dimensional realized covariance matrices. It is well known that modelling and forecasting covariance matrices of asset returns play a crucial role in finance. The availability of high frequency intraday data enables the modelling of the realized covariance matrix directly. However, most of the models available in the literature depend on strong structural assumptions and they often suffer from the curse of dimensionality. In this thesis, I propose two non-parametric models built on the Convolutional Neural Network which do not require to make any distributional or structural assumption but could handle high-dimensional realized covariance matrices. The proposed models focus on local structures and learn a nonlinear mapping that connect the historical realized covariance matrices to the future one. My empirical studies demonstrate their excellent forecasting ability compared with several advanced volatility models. We also visualize the learned filters to illustrate the models.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Mathematical linguistics	-
dc.subject.lcsh	Computational linguistics - Statistical methods	-
dc.title	Some topics on deep learning and text analytics	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Statistics and Actuarial Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_991043976390303414	-
dc.date.hkucongregation	2017	-
dc.identifier.mmsid	991043976390303414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Some topics on deep learning and text analytics

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats