File Download

There are no files associated with this item.

Supplementary

Conference Paper: Advancing Fundus-Based Retinal Representations Through Multi-Modal Contrastive Pre-training for Detection of Glaucoma-Related Diseases

TitleAdvancing Fundus-Based Retinal Representations Through Multi-Modal Contrastive Pre-training for Detection of Glaucoma-Related Diseases
Authors
Issue Date5-Jun-2024
Abstract

Purpose : To develop and evaluate a multi-modal constative-based pretraining strategy for glaucoma detection using fundus photographs and different image-based modalities.

Methods : Two ResNet50 networks were used to extract representations from fundus photographs and other modalities, including red-free photographs, RNFL thickness maps, OCT 3D volume dataset, and OCT en face images (Fig.1A). The networks were pretrained on 1703 pairs of images from the same eye using InfoNCE loss, and then fine-tuned on downstream classification tasks using labeled fundus images (Fig.1B). The performance of the strategy was assessed on two datasets: an internal dataset of 528 eyes with RNFL defects or 975 without RNFL defects, and an external dataset of 3000 eyes with glaucoma and 3000 eyes without glaucoma (80% for training and 20% for testing).

Results : The performance of different pretraining strategies for fundus image encoder on glaucoma detection tasks was compared. The fundus encoder pretrained with multi-modal contrastive learning of fundus images and RNFL thickness maps achieved the highest AUC of 0.942 on the internal dataset and 0.881 on the external dataset, surpassing supervised pretraining on ImageNet (0.852 and 0.828 AUC). Self-supervised learning with only fundus images yielded 0.766 and 0.699 AUC. Other multi-modal combinations, such as fundus images and OCT 3D volumes, and fundus image and OCT enface images, obtained 0.870-0.834 AUC and 0.818-0.816 AUC respectively (Table 1). The labeling efficiency of the pretraining strategies was determined by using different proportions of training data in our datasets (Fig. 2). The fundus encoder pretrained with fundus images and OCT RNFL thickness maps achieved AUCs of 0.922 with only 10% of the training data, showing significantly better results than other methods.

Conclusions : The multi-modal contrastive learning approach, which leverages fundus images and RNFL thickness map correlations to pretrain the fundus encoder, demonstrated the best representation learning and labeling efficiency for glaucoma detection, reflecting the benefits of using related multi-modal data to learn more informative retinal representations.


Persistent Identifierhttp://hdl.handle.net/10722/347286

 

DC FieldValueLanguage
dc.contributor.authorGuo, Yawen-
dc.contributor.authorNg, Michelle-
dc.contributor.authorYan, Xu-
dc.contributor.authorHung, Calvin-
dc.contributor.authorLam, Alexander-
dc.contributor.authorLeung, Christopher Kai-Shun-
dc.date.accessioned2024-09-20T00:31:13Z-
dc.date.available2024-09-20T00:31:13Z-
dc.date.issued2024-06-05-
dc.identifier.urihttp://hdl.handle.net/10722/347286-
dc.description.abstract<p><strong>Purpose </strong>: To develop and evaluate a multi-modal constative-based pretraining strategy for glaucoma detection using fundus photographs and different image-based modalities.</p><p><strong>Methods </strong>: Two ResNet50 networks were used to extract representations from fundus photographs and other modalities, including red-free photographs, RNFL thickness maps, OCT 3D volume dataset, and OCT en face images (Fig.1A). The networks were pretrained on 1703 pairs of images from the same eye using InfoNCE loss, and then fine-tuned on downstream classification tasks using labeled fundus images (Fig.1B). The performance of the strategy was assessed on two datasets: an internal dataset of 528 eyes with RNFL defects or 975 without RNFL defects, and an external dataset of 3000 eyes with glaucoma and 3000 eyes without glaucoma (80% for training and 20% for testing).</p><p><strong>Results </strong>: The performance of different pretraining strategies for fundus image encoder on glaucoma detection tasks was compared. The fundus encoder pretrained with multi-modal contrastive learning of fundus images and RNFL thickness maps achieved the highest AUC of 0.942 on the internal dataset and 0.881 on the external dataset, surpassing supervised pretraining on ImageNet (0.852 and 0.828 AUC). Self-supervised learning with only fundus images yielded 0.766 and 0.699 AUC. Other multi-modal combinations, such as fundus images and OCT 3D volumes, and fundus image and OCT enface images, obtained 0.870-0.834 AUC and 0.818-0.816 AUC respectively (Table 1). The labeling efficiency of the pretraining strategies was determined by using different proportions of training data in our datasets (Fig. 2). The fundus encoder pretrained with fundus images and OCT RNFL thickness maps achieved AUCs of 0.922 with only 10% of the training data, showing significantly better results than other methods.</p><p><strong>Conclusions </strong>: The multi-modal contrastive learning approach, which leverages fundus images and RNFL thickness map correlations to pretrain the fundus encoder, demonstrated the best representation learning and labeling efficiency for glaucoma detection, reflecting the benefits of using related multi-modal data to learn more informative retinal representations.</p>-
dc.languageeng-
dc.relation.ispartofARVO 2024 Annual Meeting (05/05/2024-09/05/2024, Seattle, Washington)-
dc.titleAdvancing Fundus-Based Retinal Representations Through Multi-Modal Contrastive Pre-training for Detection of Glaucoma-Related Diseases-
dc.typeConference_Paper-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats