File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Integration and processing of large-scale biomedical data
Title | Integration and processing of large-scale biomedical data |
---|---|
Authors | |
Advisors | |
Issue Date | 2023 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Zhang, W. [张闻华]. (2023). Integration and processing of large-scale biomedical data. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | With the improvements in data collection methods, high-quality data abounds in medical imaging and other fields. While many algorithms have emerged to detect, segment, or classify the data, few have proposed methods to re-organize or mine them. It is important to develop approaches to dive into the data and fully exploit them. This thesis tackles three dataset integration and processing problems of large-scale biomedical data: labeled dataset merging, unlabeled dataset self-supervised training, and scalable volumetric data mesh generating.
The first part of this thesis addresses the problem of integrating inconsistent datasets. A large number of labeled data is required to train effective nucleus classification models. However, it is challenging to label a large-scale nucleus classification dataset, considering that high-quality labeling requires specific domain knowledge and tremendous efforts. In addition, existing public datasets are often inconsistently labeled. Due to this inconsistency, conventional models tend to work independently to infer their classification results, thus limiting the classification performance. To fully utilize all annotated datasets, we propose a method to integrate all the available annotated datasets. Specifically, we formulate the problem as a multi-label problem with missing labels. Thus, we can utilize all the datasets in a unified framework. Besides the substantial improvement compared to other methods, our result dataset also has a uniform format which can help future research on nucleus classification.
The second part of this thesis addresses the problem of representation learning for nucleus instance classification. Unlike the limited scale of annotated data, unlabeled data is usually of large scale. Thus, we aim to design a self-supervised method for representation learning on unlabeled datasets to alleviate the burden of data annotation. Moreover, previous methods often downplay the contextual information that is critical for classification. To explicitly provide the information, we design a new structured input consisting of a content-rich image patch and a target instance mask. Benefiting from our structured input format, we propose Structured Triplet a triplet learning framework on unlabeled nucleus instances with customized sampling strategies. We also add two auxiliary branches to further improve its performance. Results show that our model reduces the burden of extensive labeling by fully exploiting the large-scale unlabeled data.
The third part of this thesis considers the scalable mesh generation for volumetric data with multiple materials. With the improved imaging quality and the increased resolution, volumetric datasets are getting so large that the existing tools have become inadequate for processing and analyzing the data. Here we consider the problem of computing tetrahedral meshes to represent these large volumetric datasets. We propose a novel approach, called Marching Windows, that uses a moving window and a disk-swap strategy to reduce the run-time memory footprint. We also devise a new scheme that guarantees to preserve the topological structure of the original dataset, and adopt an error-guided optimization technique to improve both geometric approximation error and mesh quality. Extensive experiments show that our method is capable of processing very large volumetric datasets beyond the capability of the existing methods and producing tetrahedral meshes of high quality. |
Degree | Doctor of Philosophy |
Subject | Medical informatics - Data processing Biomedical engineering - Data processing |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/328917 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Pan, J | - |
dc.contributor.advisor | Wang, WP | - |
dc.contributor.author | Zhang, Wenhua | - |
dc.contributor.author | 张闻华 | - |
dc.date.accessioned | 2023-08-01T06:48:14Z | - |
dc.date.available | 2023-08-01T06:48:14Z | - |
dc.date.issued | 2023 | - |
dc.identifier.citation | Zhang, W. [张闻华]. (2023). Integration and processing of large-scale biomedical data. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/328917 | - |
dc.description.abstract | With the improvements in data collection methods, high-quality data abounds in medical imaging and other fields. While many algorithms have emerged to detect, segment, or classify the data, few have proposed methods to re-organize or mine them. It is important to develop approaches to dive into the data and fully exploit them. This thesis tackles three dataset integration and processing problems of large-scale biomedical data: labeled dataset merging, unlabeled dataset self-supervised training, and scalable volumetric data mesh generating. The first part of this thesis addresses the problem of integrating inconsistent datasets. A large number of labeled data is required to train effective nucleus classification models. However, it is challenging to label a large-scale nucleus classification dataset, considering that high-quality labeling requires specific domain knowledge and tremendous efforts. In addition, existing public datasets are often inconsistently labeled. Due to this inconsistency, conventional models tend to work independently to infer their classification results, thus limiting the classification performance. To fully utilize all annotated datasets, we propose a method to integrate all the available annotated datasets. Specifically, we formulate the problem as a multi-label problem with missing labels. Thus, we can utilize all the datasets in a unified framework. Besides the substantial improvement compared to other methods, our result dataset also has a uniform format which can help future research on nucleus classification. The second part of this thesis addresses the problem of representation learning for nucleus instance classification. Unlike the limited scale of annotated data, unlabeled data is usually of large scale. Thus, we aim to design a self-supervised method for representation learning on unlabeled datasets to alleviate the burden of data annotation. Moreover, previous methods often downplay the contextual information that is critical for classification. To explicitly provide the information, we design a new structured input consisting of a content-rich image patch and a target instance mask. Benefiting from our structured input format, we propose Structured Triplet a triplet learning framework on unlabeled nucleus instances with customized sampling strategies. We also add two auxiliary branches to further improve its performance. Results show that our model reduces the burden of extensive labeling by fully exploiting the large-scale unlabeled data. The third part of this thesis considers the scalable mesh generation for volumetric data with multiple materials. With the improved imaging quality and the increased resolution, volumetric datasets are getting so large that the existing tools have become inadequate for processing and analyzing the data. Here we consider the problem of computing tetrahedral meshes to represent these large volumetric datasets. We propose a novel approach, called Marching Windows, that uses a moving window and a disk-swap strategy to reduce the run-time memory footprint. We also devise a new scheme that guarantees to preserve the topological structure of the original dataset, and adopt an error-guided optimization technique to improve both geometric approximation error and mesh quality. Extensive experiments show that our method is capable of processing very large volumetric datasets beyond the capability of the existing methods and producing tetrahedral meshes of high quality. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Medical informatics - Data processing | - |
dc.subject.lcsh | Biomedical engineering - Data processing | - |
dc.title | Integration and processing of large-scale biomedical data | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2023 | - |
dc.identifier.mmsid | 991044705906303414 | - |