File Download
Supplementary

postgraduate thesis: Statistical learning by embedding data into computational graphs

TitleStatistical learning by embedding data into computational graphs
Authors
Issue Date2022
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Gu, J.. (2022). Statistical learning by embedding data into computational graphs. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractAs the dimension of real-world data continues to grow in the era of big data, various techniques have been proposed to obtain embeddings with important data information. By mapping observations to embeddings with either smaller dimension or noise, the signal-to-noise ratio of data is increased, leading to statistical learning approaches with higher statistical and computational efficiency. This thesis introduces a new framework of data embedding techniques. By transferring observations as records or nodes of a computational graph, we develop several statistical learning methods to make statistical inference with network structures of different types of data. In the first part, a nonparametric nodes clustering approach with node covariates is developed for relational data. By defining the triangular concordance index between links and latent positions of nodes, we propose the triangular concordance learning to estimate the latent positions by maximizing the penalized triangular concordance function. Without prespecified number of clusters, the fused penalty shrinks node-specific centers of nodes with similar link pattern together and provides an estimated community structure of nodes. In addition, an individualized criterion for linkage of nodes is also obtained to predict unobserved or future links in a nonparametric way. In the second part, we discuss the efficient computation of maximum likelihood estimation (MLE) under generalized multinomial models. From the necessary condition that the gradient of log-likelihood function equals $0$ at the MLE, we theoretically show that the MLE corresponds to the stationary distribution of an inhomogeneous Markov chain indexed by the MLE itself. Therefore, observations under generalized multinomial models are interpreted as win-loss records of a tournament network and a Markov chain based algorithm is developed to compute the MLE computationally efficiently. In the third part, we suggest that the Delaunay triangulation implies a geometry-based network structure of datasets with the highest level of smoothness. Based on such interpretation, we incorporate the Delaunay triangulation into nonparametric regression and develop the crystallization learning to estimate conditional expectation function with computational efficiency. Compared to existing approaches, the crystallization learning and its variants can select neighbor data points uniformly in all directions and thus are robust to the local geometric structure of the data, leading to better estimation performance on both synthetic and real data. In the final part, a greedy algorithm is developed to compress 3D point cloud data with triangular network structure. Based on a local retriangulation method, which utilizes the network structure to fill in the resultant hole and compute the information loss caused by the removal of each point, the proposed algorithm progressively removes the least informative point so that local features of the network structure are polished earlier than global features. In addition, a rank-based procedure is proposed to detect the change point of information loss throughout iterations and used to select the optimal compression rate with the approximation quality of the network structure maintained.
DegreeDoctor of Philosophy
SubjectMathematical statistics - Data processing
Computer graphics
Dept/ProgramStatistics and Actuarial Science
Persistent Identifierhttp://hdl.handle.net/10722/325767

 

DC FieldValueLanguage
dc.contributor.authorGu, Jiaqi-
dc.date.accessioned2023-03-02T16:32:40Z-
dc.date.available2023-03-02T16:32:40Z-
dc.date.issued2022-
dc.identifier.citationGu, J.. (2022). Statistical learning by embedding data into computational graphs. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/325767-
dc.description.abstractAs the dimension of real-world data continues to grow in the era of big data, various techniques have been proposed to obtain embeddings with important data information. By mapping observations to embeddings with either smaller dimension or noise, the signal-to-noise ratio of data is increased, leading to statistical learning approaches with higher statistical and computational efficiency. This thesis introduces a new framework of data embedding techniques. By transferring observations as records or nodes of a computational graph, we develop several statistical learning methods to make statistical inference with network structures of different types of data. In the first part, a nonparametric nodes clustering approach with node covariates is developed for relational data. By defining the triangular concordance index between links and latent positions of nodes, we propose the triangular concordance learning to estimate the latent positions by maximizing the penalized triangular concordance function. Without prespecified number of clusters, the fused penalty shrinks node-specific centers of nodes with similar link pattern together and provides an estimated community structure of nodes. In addition, an individualized criterion for linkage of nodes is also obtained to predict unobserved or future links in a nonparametric way. In the second part, we discuss the efficient computation of maximum likelihood estimation (MLE) under generalized multinomial models. From the necessary condition that the gradient of log-likelihood function equals $0$ at the MLE, we theoretically show that the MLE corresponds to the stationary distribution of an inhomogeneous Markov chain indexed by the MLE itself. Therefore, observations under generalized multinomial models are interpreted as win-loss records of a tournament network and a Markov chain based algorithm is developed to compute the MLE computationally efficiently. In the third part, we suggest that the Delaunay triangulation implies a geometry-based network structure of datasets with the highest level of smoothness. Based on such interpretation, we incorporate the Delaunay triangulation into nonparametric regression and develop the crystallization learning to estimate conditional expectation function with computational efficiency. Compared to existing approaches, the crystallization learning and its variants can select neighbor data points uniformly in all directions and thus are robust to the local geometric structure of the data, leading to better estimation performance on both synthetic and real data. In the final part, a greedy algorithm is developed to compress 3D point cloud data with triangular network structure. Based on a local retriangulation method, which utilizes the network structure to fill in the resultant hole and compute the information loss caused by the removal of each point, the proposed algorithm progressively removes the least informative point so that local features of the network structure are polished earlier than global features. In addition, a rank-based procedure is proposed to detect the change point of information loss throughout iterations and used to select the optimal compression rate with the approximation quality of the network structure maintained.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshMathematical statistics - Data processing-
dc.subject.lcshComputer graphics-
dc.titleStatistical learning by embedding data into computational graphs-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineStatistics and Actuarial Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2022-
dc.identifier.mmsid991044649996203414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats