Statistical learning by embedding data into computational graphs

Gu, Jiaqi

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Statistics & Actuarial Science: Theses

postgraduate thesis: Statistical learning by embedding data into computational graphs

Title	Statistical learning by embedding data into computational graphs
Authors	Gu, Jiaqi
Issue Date	2022
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Gu, J.. (2022). Statistical learning by embedding data into computational graphs. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	As the dimension of real-world data continues to grow in the era of big data, various techniques have been proposed to obtain embeddings with important data information. By mapping observations to embeddings with either smaller dimension or noise, the signal-to-noise ratio of data is increased, leading to statistical learning approaches with higher statistical and computational efficiency. This thesis introduces a new framework of data embedding techniques. By transferring observations as records or nodes of a computational graph, we develop several statistical learning methods to make statistical inference with network structures of different types of data. In the first part, a nonparametric nodes clustering approach with node covariates is developed for relational data. By defining the triangular concordance index between links and latent positions of nodes, we propose the triangular concordance learning to estimate the latent positions by maximizing the penalized triangular concordance function. Without prespecified number of clusters, the fused penalty shrinks node-specific centers of nodes with similar link pattern together and provides an estimated community structure of nodes. In addition, an individualized criterion for linkage of nodes is also obtained to predict unobserved or future links in a nonparametric way. In the second part, we discuss the efficient computation of maximum likelihood estimation (MLE) under generalized multinomial models. From the necessary condition that the gradient of log-likelihood function equals $0$ at the MLE, we theoretically show that the MLE corresponds to the stationary distribution of an inhomogeneous Markov chain indexed by the MLE itself. Therefore, observations under generalized multinomial models are interpreted as win-loss records of a tournament network and a Markov chain based algorithm is developed to compute the MLE computationally efficiently. In the third part, we suggest that the Delaunay triangulation implies a geometry-based network structure of datasets with the highest level of smoothness. Based on such interpretation, we incorporate the Delaunay triangulation into nonparametric regression and develop the crystallization learning to estimate conditional expectation function with computational efficiency. Compared to existing approaches, the crystallization learning and its variants can select neighbor data points uniformly in all directions and thus are robust to the local geometric structure of the data, leading to better estimation performance on both synthetic and real data. In the final part, a greedy algorithm is developed to compress 3D point cloud data with triangular network structure. Based on a local retriangulation method, which utilizes the network structure to fill in the resultant hole and compute the information loss caused by the removal of each point, the proposed algorithm progressively removes the least informative point so that local features of the network structure are polished earlier than global features. In addition, a rank-based procedure is proposed to detect the change point of information loss throughout iterations and used to select the optimal compression rate with the approximation quality of the network structure maintained.
Degree	Doctor of Philosophy
Subject	Mathematical statistics - Data processing Computer graphics
Dept/Program	Statistics and Actuarial Science
Persistent Identifier	http://hdl.handle.net/10722/325767

DC Field	Value	Language
dc.contributor.author	Gu, Jiaqi	-
dc.date.accessioned	2023-03-02T16:32:40Z	-
dc.date.available	2023-03-02T16:32:40Z	-
dc.date.issued	2022	-
dc.identifier.citation	Gu, J.. (2022). Statistical learning by embedding data into computational graphs. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/325767	-
dc.description.abstract	As the dimension of real-world data continues to grow in the era of big data, various techniques have been proposed to obtain embeddings with important data information. By mapping observations to embeddings with either smaller dimension or noise, the signal-to-noise ratio of data is increased, leading to statistical learning approaches with higher statistical and computational efficiency. This thesis introduces a new framework of data embedding techniques. By transferring observations as records or nodes of a computational graph, we develop several statistical learning methods to make statistical inference with network structures of different types of data. In the first part, a nonparametric nodes clustering approach with node covariates is developed for relational data. By defining the triangular concordance index between links and latent positions of nodes, we propose the triangular concordance learning to estimate the latent positions by maximizing the penalized triangular concordance function. Without prespecified number of clusters, the fused penalty shrinks node-specific centers of nodes with similar link pattern together and provides an estimated community structure of nodes. In addition, an individualized criterion for linkage of nodes is also obtained to predict unobserved or future links in a nonparametric way. In the second part, we discuss the efficient computation of maximum likelihood estimation (MLE) under generalized multinomial models. From the necessary condition that the gradient of log-likelihood function equals $0$ at the MLE, we theoretically show that the MLE corresponds to the stationary distribution of an inhomogeneous Markov chain indexed by the MLE itself. Therefore, observations under generalized multinomial models are interpreted as win-loss records of a tournament network and a Markov chain based algorithm is developed to compute the MLE computationally efficiently. In the third part, we suggest that the Delaunay triangulation implies a geometry-based network structure of datasets with the highest level of smoothness. Based on such interpretation, we incorporate the Delaunay triangulation into nonparametric regression and develop the crystallization learning to estimate conditional expectation function with computational efficiency. Compared to existing approaches, the crystallization learning and its variants can select neighbor data points uniformly in all directions and thus are robust to the local geometric structure of the data, leading to better estimation performance on both synthetic and real data. In the final part, a greedy algorithm is developed to compress 3D point cloud data with triangular network structure. Based on a local retriangulation method, which utilizes the network structure to fill in the resultant hole and compute the information loss caused by the removal of each point, the proposed algorithm progressively removes the least informative point so that local features of the network structure are polished earlier than global features. In addition, a rank-based procedure is proposed to detect the change point of information loss throughout iterations and used to select the optimal compression rate with the approximation quality of the network structure maintained.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Mathematical statistics - Data processing	-
dc.subject.lcsh	Computer graphics	-
dc.title	Statistical learning by embedding data into computational graphs	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Statistics and Actuarial Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2022	-
dc.identifier.mmsid	991044649996203414	-

File Download

Supplementary

postgraduate thesis: Statistical learning by embedding data into computational graphs

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats