Integration and processing of large-scale biomedical data

Zhang, Wenhua; 张闻华

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Integration and processing of large-scale biomedical data

Title	Integration and processing of large-scale biomedical data
Authors	Zhang, Wenhua 张闻华
Advisors	Advisor(s):Pan, J Wang, WP
Issue Date	2023
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Zhang, W. [张闻华]. (2023). Integration and processing of large-scale biomedical data. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	With the improvements in data collection methods, high-quality data abounds in medical imaging and other fields. While many algorithms have emerged to detect, segment, or classify the data, few have proposed methods to re-organize or mine them. It is important to develop approaches to dive into the data and fully exploit them. This thesis tackles three dataset integration and processing problems of large-scale biomedical data: labeled dataset merging, unlabeled dataset self-supervised training, and scalable volumetric data mesh generating. The first part of this thesis addresses the problem of integrating inconsistent datasets. A large number of labeled data is required to train effective nucleus classification models. However, it is challenging to label a large-scale nucleus classification dataset, considering that high-quality labeling requires specific domain knowledge and tremendous efforts. In addition, existing public datasets are often inconsistently labeled. Due to this inconsistency, conventional models tend to work independently to infer their classification results, thus limiting the classification performance. To fully utilize all annotated datasets, we propose a method to integrate all the available annotated datasets. Specifically, we formulate the problem as a multi-label problem with missing labels. Thus, we can utilize all the datasets in a unified framework. Besides the substantial improvement compared to other methods, our result dataset also has a uniform format which can help future research on nucleus classification. The second part of this thesis addresses the problem of representation learning for nucleus instance classification. Unlike the limited scale of annotated data, unlabeled data is usually of large scale. Thus, we aim to design a self-supervised method for representation learning on unlabeled datasets to alleviate the burden of data annotation. Moreover, previous methods often downplay the contextual information that is critical for classification. To explicitly provide the information, we design a new structured input consisting of a content-rich image patch and a target instance mask. Benefiting from our structured input format, we propose Structured Triplet a triplet learning framework on unlabeled nucleus instances with customized sampling strategies. We also add two auxiliary branches to further improve its performance. Results show that our model reduces the burden of extensive labeling by fully exploiting the large-scale unlabeled data. The third part of this thesis considers the scalable mesh generation for volumetric data with multiple materials. With the improved imaging quality and the increased resolution, volumetric datasets are getting so large that the existing tools have become inadequate for processing and analyzing the data. Here we consider the problem of computing tetrahedral meshes to represent these large volumetric datasets. We propose a novel approach, called Marching Windows, that uses a moving window and a disk-swap strategy to reduce the run-time memory footprint. We also devise a new scheme that guarantees to preserve the topological structure of the original dataset, and adopt an error-guided optimization technique to improve both geometric approximation error and mesh quality. Extensive experiments show that our method is capable of processing very large volumetric datasets beyond the capability of the existing methods and producing tetrahedral meshes of high quality.
Degree	Doctor of Philosophy
Subject	Medical informatics - Data processing Biomedical engineering - Data processing
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/328917

DC Field	Value	Language
dc.contributor.advisor	Pan, J	-
dc.contributor.advisor	Wang, WP	-
dc.contributor.author	Zhang, Wenhua	-
dc.contributor.author	张闻华	-
dc.date.accessioned	2023-08-01T06:48:14Z	-
dc.date.available	2023-08-01T06:48:14Z	-
dc.date.issued	2023	-
dc.identifier.citation	Zhang, W. [张闻华]. (2023). Integration and processing of large-scale biomedical data. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/328917	-
dc.description.abstract	With the improvements in data collection methods, high-quality data abounds in medical imaging and other fields. While many algorithms have emerged to detect, segment, or classify the data, few have proposed methods to re-organize or mine them. It is important to develop approaches to dive into the data and fully exploit them. This thesis tackles three dataset integration and processing problems of large-scale biomedical data: labeled dataset merging, unlabeled dataset self-supervised training, and scalable volumetric data mesh generating. The first part of this thesis addresses the problem of integrating inconsistent datasets. A large number of labeled data is required to train effective nucleus classification models. However, it is challenging to label a large-scale nucleus classification dataset, considering that high-quality labeling requires specific domain knowledge and tremendous efforts. In addition, existing public datasets are often inconsistently labeled. Due to this inconsistency, conventional models tend to work independently to infer their classification results, thus limiting the classification performance. To fully utilize all annotated datasets, we propose a method to integrate all the available annotated datasets. Specifically, we formulate the problem as a multi-label problem with missing labels. Thus, we can utilize all the datasets in a unified framework. Besides the substantial improvement compared to other methods, our result dataset also has a uniform format which can help future research on nucleus classification. The second part of this thesis addresses the problem of representation learning for nucleus instance classification. Unlike the limited scale of annotated data, unlabeled data is usually of large scale. Thus, we aim to design a self-supervised method for representation learning on unlabeled datasets to alleviate the burden of data annotation. Moreover, previous methods often downplay the contextual information that is critical for classification. To explicitly provide the information, we design a new structured input consisting of a content-rich image patch and a target instance mask. Benefiting from our structured input format, we propose Structured Triplet a triplet learning framework on unlabeled nucleus instances with customized sampling strategies. We also add two auxiliary branches to further improve its performance. Results show that our model reduces the burden of extensive labeling by fully exploiting the large-scale unlabeled data. The third part of this thesis considers the scalable mesh generation for volumetric data with multiple materials. With the improved imaging quality and the increased resolution, volumetric datasets are getting so large that the existing tools have become inadequate for processing and analyzing the data. Here we consider the problem of computing tetrahedral meshes to represent these large volumetric datasets. We propose a novel approach, called Marching Windows, that uses a moving window and a disk-swap strategy to reduce the run-time memory footprint. We also devise a new scheme that guarantees to preserve the topological structure of the original dataset, and adopt an error-guided optimization technique to improve both geometric approximation error and mesh quality. Extensive experiments show that our method is capable of processing very large volumetric datasets beyond the capability of the existing methods and producing tetrahedral meshes of high quality.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Medical informatics - Data processing	-
dc.subject.lcsh	Biomedical engineering - Data processing	-
dc.title	Integration and processing of large-scale biomedical data	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2023	-
dc.identifier.mmsid	991044705906303414	-

File Download

Supplementary

postgraduate thesis: Integration and processing of large-scale biomedical data

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats