TP-DRSeg: Improving Diabetic Retinopathy Lesion Segmentation with Explicit Text-Prompts Assisted SAM (2024)

11institutetext: 1Monash-Airdoc Research, Monash University
2Monash Medical AI Group, Monash University
3School of Computer Science and Engineering, Sun Yat-sen University
11email: wxli408@gmail.com, {Lie.Ju1, zongyuan.ge}@monash.edu

Wenxue Li 1122โ€ƒโ€ƒXinyu Xiong33โ€ƒโ€ƒPeng Xia1122โ€ƒโ€ƒLie Ju(๐Ÿ–‚)1122โ€ƒโ€ƒZongyuan Ge (๐Ÿ–‚)1122

Abstract

Recent advances in large foundation models, such as the Segment Anything Model (SAM), have demonstrated considerable promise across various tasks. Despite their progress, these models still encounter challenges in specialized medical image analysis, especially in recognizing subtle inter-class differences in Diabetic Retinopathy (DR) lesion segmentation. In this paper, we propose a novel framework that customizes SAM for text-prompted DR lesion segmentation, termed TP-DRSeg. Our core idea involves exploiting language cues to inject medical prior knowledge into the vision-only segmentation network, thereby combining the advantages of different foundation models and enhancing the credibility of segmentation. Specifically, to unleash the potential of vision-language models in the recognition of medical concepts, we propose an explicit prior encoder that transfers implicit medical concepts into explicit prior knowledge, providing explainable clues to excavate low-level features associated with lesions. Furthermore, we design a prior-aligned injector to inject explicit priors into the segmentation process, which can facilitate knowledge sharing across multi-modality features and allow our framework to be trained in a parameter-efficient fashion. Experimental results demonstrate the superiority of our framework over other traditional models and foundation model variants. The code implementations are accessible at https://github.com/wxliii/TP-DRSeg.

Keywords:

Diabetic Retinopathy Segmentation Prompted Segmentation Segment Anything Parameter-Efficient Fine-Tuning

1 Introduction

TP-DRSeg: Improving Diabetic Retinopathy Lesion Segmentation with Explicit Text-Prompts Assisted SAM (1)

Diabetic retinopathy (DR) is a leading cause of visual impairment, becoming one of the worldโ€™s most serious health challenges. Automatic DR lesion segmentation is an important task in color fundus image analysis. Effective monitoring of certain lesions, including microaneurysms (MAs), hemorrhages (HEs), hard exudates (EXs), and soft exudates (SEs), provides vital assistance for ophthalmologists and significantly boosts early diagnostic accuracy and efficiency.

Existing DR segmentation networks[4, 22, 6, 5, 24, 9] have achieved certain promising results by designing exquisite attention mechanisms to comprehend the visual features provided by the encoder. However, they still face two main challenges. On the one hand, with small vision backbones (such as ResNet-34) and limited training data available, these methods typically require a lengthy training process[5] to learn valuable representations, which is time-consuming and prone to over-fitting. On the other hand, the subtle inter-class differences pose challenges in accurate lesion classification. Existing trials only focus on vision supervision, lacking the guidance of specialized domain knowledge[26].

Recent advances in foundational vision models have attracted considerable interest, such as the Segment Anything Model (SAM)[11], showcasing fantastic capacity in various scenarios[13, 21, 30, 10]. However, SAM still faces some limitations when applied to downstream medical tasks. As illustrated in Fig.1 (a(i)), vanilla SAM[11] heavily relies on manual prompts, such as points and boxes. However, due to the small and numerous nature of DR lesions, manual prompts become labor-intensive, rendering such an approach impractical for clinical applications.Some methods introduce in-context prompts to adapt SAM from the global perspective, like one-shot (Fig.1 (a(ii))) prompt[32], but they struggle to handle local lesions, resulting in suboptimal performance.Parameter-efficient fine-tuning methods[15, 29] (Fig.1 (a(iii))) adapt SAM for downstream tasks by tuning a limited number of parameters.However, these methods overlook the prompt-based strategy to achieve automatic inference. A more flexible approach to prompt segmentation would be preferred in practice, allowing for physicians to refine results with greater precision through targeted prompts when necessary.Moreover, these SAM-related methods struggle to distinguish fine-grained DR lesion categories and typically can only generate class-agnostic masks.

Vision-Language Models (VLMs), with their capability to align images with corresponding textual descriptions, have shown remarkable effectiveness in many downstream applications[17, 2, 25, 23, 19, 28, 27].This raises a possibility: Could VLM assist visual models in locating lesions using textual cues, further enhancing the accuracy of distinguishing different lesions and enhancing the credibility of segmentation? However, the potential of VLMs remains largely untapped in medical imaging primarily due to the significant gap between natural and medical domains. As shown in Fig.1 (b), the class activation maps generated by implicit class name reveal that the VLM (e.g., CLIP[16]) fails to provide useful priors in this context.

In this study, we concentrate on designing a flexible scheme for segmenting DR lesions, allowing the direct generation of masks corresponding to specific text-based categories, as illustrated in Fig.1 (a(iv)).Meanwhile, we aim at enhancing the modelโ€™s credibility and its ability to discern inter-class differences in DR lesions by integrating text-based cues via VLM.Thus, we propose an explicit prior encoder that utilizes explicit description of lesions rather than implicit class name to generate explainable cues for segmentation and distinguishing inter-class differences. Specifically, the morphological appearance of DR lesions can be represented by specific descriptions that VLMs easily understand, such as depicting hard exudates as yellowish-white deposits. These explainable cues improve the trustworthiness in the segmentation process.Further, we introduce a prior-aligned injector into the SAM encoder to inject the text-based external priors into the segmentation process, further facilitating knowledge sharing and alignment between the VLM and vision-only model.Lastly, the class-specific prompt generator generates the specific prompts tailored to the text-prompt input, which are subsequently fed into the SAM decoder to produce the corresponding segmentation mask.

Our main contributions are as follows. First, we propose a novel framework that exploits explainable cues generated from textual prompts, thereby enhancing the reliability of DR segmentation.Second, we introduce an explicit prior encoder to transfer implicit medical concepts into explicit priors, providing explainable global guidance for segmentation and enhancing the lesion discrimination ability.Third, we design a prior-aligned injector that integrates explainable explicit priors into the segmentation process and facilitates knowledge sharing across multiple modalities.

2 Methodology

Problem Definition.Given an input image ๐ˆโˆˆโ„Cร—Hร—W๐ˆsuperscriptโ„๐ถ๐ป๐‘Š\mathbf{I}\in\mathbb{R}^{C\times H\times W}bold_I โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_C ร— italic_H ร— italic_W end_POSTSUPERSCRIPT and a text class-prompt t๐‘กtitalic_t of the i๐‘–iitalic_i-tโขh๐‘กโ„Ž{th}italic_t italic_h class cisubscript๐‘๐‘–c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, our goal is to generate the mask ๐Œisubscript๐Œ๐‘–\mathbf{M}_{i}bold_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of cisubscript๐‘๐‘–c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Framework Overview.Fig.2 illustrates the overview of our method, consisting of four key components: a VLM-based explicit prior encoder, a SAM encoder with the prior-aligned injector, a class-specific prompt generator, and a SAM decoder.The explicit prior encoder first encodes the text class-prompt t๐‘กtitalic_t and produces the explicit prior.Next, the SAM encoder extracts multi-level features of the input image, and the prior-aligned injectors facilitate knowledge sharing between the text-guided explicit prior and multi-level visual features.Then, the class-specific prompt generator generates prompts based on explicit prior, subsequently fed into the SAM decoder to produce the corresponding mask.

TP-DRSeg: Improving Diabetic Retinopathy Lesion Segmentation with Explicit Text-Prompts Assisted SAM (2)

2.1 Explicit Prior Encoder

Different from the existing vision-only DR segmentation approaches, we resort to the language modality to provide external knowledge in segmentation. We delve into using the explicitdescription instead of the implicit class name to guide segmentation.This strategy entails harnessing external knowledge and preprocessing it through the robust image-text knowledge ingrained in the VLM (e.g., CLIP[16]), ultimately generating what we term as an explicit prior. The incorporation of explicit prior information provides explainable cues that enhance the trustworthiness of the segmentation process.

Specifically, the frozen image encoder Eiโข(โ‹…)subscript๐ธ๐‘–โ‹…E_{i}(\cdot)italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( โ‹… ) and text encoder Etโข(โ‹…)subscript๐ธ๐‘กโ‹…E_{t}(\cdot)italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( โ‹… ) of pre-trained CLIP are utilized to encode the visual input and the explicit lesion knowledge to get visual prior and text prior as ๐v=Eiโข(๐ˆ)โˆˆโ„Hsร—Wsร—Ctsubscript๐๐‘ฃsubscript๐ธ๐‘–๐ˆsuperscriptโ„subscript๐ป๐‘ subscript๐‘Š๐‘ subscript๐ถ๐‘ก\mathbf{P}_{v}=E_{i}(\mathbf{I})\in\mathbb{R}^{H_{s}\times W_{s}\times C_{t}}bold_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_I ) โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ร— italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ร— italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and ๐t=Etโข(t)โˆˆโ„1ร—Ctsubscript๐๐‘กsubscript๐ธ๐‘ก๐‘กsuperscriptโ„1subscript๐ถ๐‘ก\mathbf{P}_{t}=E_{t}(t)\in\mathbb{R}^{1\times C_{t}}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t ) โˆˆ blackboard_R start_POSTSUPERSCRIPT 1 ร— italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Ctsubscript๐ถ๐‘กC_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the dimension of the features.Then we reshape ๐vsubscript๐๐‘ฃ\mathbf{P}_{v}bold_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT into ๐v=Eiโข(๐ˆ)โˆˆโ„(HsโขWs)ร—Ctsubscript๐๐‘ฃsubscript๐ธ๐‘–๐ˆsuperscriptโ„subscript๐ป๐‘ subscript๐‘Š๐‘ subscript๐ถ๐‘ก\mathbf{P}_{v}=E_{i}(\mathbf{I})\in\mathbb{R}^{(H_{s}W_{s})\times C_{t}}bold_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_I ) โˆˆ blackboard_R start_POSTSUPERSCRIPT ( italic_H start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ร— italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and align text prior and visual prior as:

๐’=๐vโ€–๐vโ€–2โ‹…(๐tโ€–๐tโ€–2)T,๐’โ‹…subscript๐๐‘ฃsubscriptnormsubscript๐๐‘ฃ2superscriptsubscript๐๐‘กsubscriptnormsubscript๐๐‘ก2๐‘‡\small\mathbf{S}=\frac{\mathbf{P}_{v}}{||\mathbf{P}_{v}||_{2}}\cdot(\frac{%\mathbf{P}_{t}}{||\mathbf{P}_{t}||_{2}})^{T},bold_S = divide start_ARG bold_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG start_ARG | | bold_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG โ‹… ( divide start_ARG bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG | | bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ,(1)

where ||โ‹…||2||\cdot||_{2}| | โ‹… | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the L2 normalization and ๐’โˆˆโ„(HsโขWs)ร—1๐’superscriptโ„subscript๐ป๐‘ subscript๐‘Š๐‘ 1\mathbf{S}\in\mathbb{R}^{(H_{s}W_{s})\times 1}bold_S โˆˆ blackboard_R start_POSTSUPERSCRIPT ( italic_H start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ร— 1 end_POSTSUPERSCRIPT.Next, to generate explicit prior map, we reshape ๐’๐’\mathbf{S}bold_S to ๐’โ€ฒโˆˆโ„Hsร—Wssuperscript๐’โ€ฒsuperscriptโ„subscript๐ป๐‘ subscript๐‘Š๐‘ \mathbf{S}^{\prime}\in\mathbb{R}^{H_{s}\times W_{s}}bold_S start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ร— italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and map the ๐’โ€ฒsuperscript๐’โ€ฒ\mathbf{S}^{\prime}bold_S start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT into explicit prior ๐esubscript๐๐‘’\mathbf{P}_{e}bold_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT as ๐e=Nโขoโขrโขmโข(๐’โ€ฒ)subscript๐๐‘’๐‘๐‘œ๐‘Ÿ๐‘šsuperscript๐’โ€ฒ\mathbf{P}_{e}=Norm(\mathbf{S}^{\prime})bold_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_N italic_o italic_r italic_m ( bold_S start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ), where Nโขoโขrโขmโข(โ‹…)๐‘๐‘œ๐‘Ÿ๐‘šโ‹…Norm(\cdot)italic_N italic_o italic_r italic_m ( โ‹… ) denotes the min-max normalization.Using explicit lesion descriptions instead of implicit class names unleashes the potential of the VLM, offering global guidance for subsequent segmentation steps and enhancing the differentiation of inter-class differences.

TP-DRSeg: Improving Diabetic Retinopathy Lesion Segmentation with Explicit Text-Prompts Assisted SAM (3)

2.2 Prior-Aligned Injector

Since the knowledge embedded in the pre-trained SAM and CLIP models do not "see" each other and remain isolated before integrating,it is crucial to construct a bridge for interaction, ensuring the alignment of representations within a unified feature space.Moreover, a mechanism is necessary to inject external knowledge into the segmentation process.To address this, we propose a prior-aligned injector (shown in Fig.3(a)) in each encoder layer, aiming to facilitate knowledge sharing between the segmentation and vision-language models.Formally, for the intermediate i๐‘–iitalic_i-tโขh๐‘กโ„Žthitalic_t italic_h block in SAM encoder, the encoded feature ๐…iโˆˆโ„Hsร—Wsร—Cisubscript๐…๐‘–superscriptโ„subscript๐ป๐‘ subscript๐‘Š๐‘ subscript๐ถ๐‘–\mathbf{F}_{i}\in\mathbb{R}^{H_{s}\times W_{s}\times C_{i}}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ร— italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ร— italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is fed into the cross-modality interaction module for interacting with explicit prior ๐esubscript๐๐‘’\mathbf{P}_{e}bold_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT.The output of injector ๐…iโ€ฒsuperscriptsubscript๐…๐‘–โ€ฒ\mathbf{F}_{i}^{\prime}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT is fed into the next encoder block.
Cross-Modality Interaction.We first aggregate the encoded feature ๐…isubscript๐…๐‘–\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with explicit prior ๐esubscript๐๐‘’\mathbf{P}_{e}bold_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT to obtain the explicit activated feature as ๐…iaโขcโขt=๐…iร—๐esubscriptsuperscript๐…๐‘Ž๐‘๐‘ก๐‘–subscript๐…๐‘–subscript๐๐‘’\mathbf{F}^{act}_{i}=\mathbf{F}_{i}\times\mathbf{P}_{e}bold_F start_POSTSUPERSCRIPT italic_a italic_c italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ร— bold_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. This operation enables the accurate location prior can be utilized in the interaction process. Then, we take ๐…isubscript๐…๐‘–\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the query, and the explicit activated feature ๐…iโ€ฒsubscriptsuperscript๐…โ€ฒ๐‘–\mathbf{F}^{\prime}_{i}bold_F start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as both key and value.The query, key and value are fed into zoomed projected operation to adjust resolutions, achieved by an 1ร—1111\times 11 ร— 1 convolutional layer with stride-4, which can be written as:

๐…iQ=ฯ•qโข(๐…iaโขcโขt),๐…iK=ฯ•kโข(๐…iaโขcโขt),๐…iV=ฯ•vโข(๐…iaโขcโขt),formulae-sequencesubscriptsuperscript๐…๐‘„๐‘–subscriptitalic-ฯ•๐‘žsubscriptsuperscript๐…๐‘Ž๐‘๐‘ก๐‘–formulae-sequencesubscriptsuperscript๐…๐พ๐‘–subscriptitalic-ฯ•๐‘˜subscriptsuperscript๐…๐‘Ž๐‘๐‘ก๐‘–subscriptsuperscript๐…๐‘‰๐‘–subscriptitalic-ฯ•๐‘ฃsubscriptsuperscript๐…๐‘Ž๐‘๐‘ก๐‘–\small\mathbf{F}^{Q}_{i}=\phi_{q}(\mathbf{F}^{act}_{i}),\mathbf{F}^{K}_{i}=%\phi_{k}(\mathbf{F}^{act}_{i}),\mathbf{F}^{V}_{i}=\phi_{v}(\mathbf{F}^{act}_{i%}),bold_F start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ฯ• start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_F start_POSTSUPERSCRIPT italic_a italic_c italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_F start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ฯ• start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_F start_POSTSUPERSCRIPT italic_a italic_c italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_F start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ฯ• start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( bold_F start_POSTSUPERSCRIPT italic_a italic_c italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,(2)

where ฯ•qโข(โ‹…)subscriptitalic-ฯ•๐‘žโ‹…\phi_{q}(\cdot)italic_ฯ• start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( โ‹… ), ฯ•kโข(โ‹…)subscriptitalic-ฯ•๐‘˜โ‹…\phi_{k}(\cdot)italic_ฯ• start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( โ‹… ), ฯ•vโข(โ‹…)subscriptitalic-ฯ•๐‘ฃโ‹…\phi_{v}(\cdot)italic_ฯ• start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( โ‹… ) denote the zoomed projected operation.Then, we employ these projected features to conduct cross-modality interaction, as:

๐…iโ€ฒ=ฮณโข๐…i+sโขoโขfโขtโขmโขaโขxโข(๐…iQโ‹…๐…iKTd)โข๐…iV,superscriptsubscript๐…๐‘–โ€ฒ๐›พsubscript๐…๐‘–๐‘ ๐‘œ๐‘“๐‘ก๐‘š๐‘Ž๐‘ฅโ‹…subscriptsuperscript๐…๐‘„๐‘–superscriptsubscriptsuperscript๐…๐พ๐‘–๐‘‡๐‘‘subscriptsuperscript๐…๐‘‰๐‘–\small\mathbf{F}_{i}^{\prime}=\gamma\mathbf{F}_{i}+softmax(\frac{\mathbf{F}^{Q%}_{i}\cdot{\mathbf{F}^{K}_{i}}^{T}}{\sqrt{d}}){\mathbf{F}^{V}_{i}},bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT = italic_ฮณ bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG bold_F start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT โ‹… bold_F start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) bold_F start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,(3)

where d๐‘‘ditalic_d is the dimension of the key vectors, ฮณ๐›พ\gammaitalic_ฮณ is a learnable parameter used to adjust the blending ratio of the attention output with the original input.In this module, we introduce residual connections to enhance the stability.Finally, we adjust the resolution of ๐…iโ€ฒsuperscriptsubscript๐…๐‘–โ€ฒ\mathbf{F}_{i}^{\prime}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT back to the input size through upsampling, as ๐…iโ€ฒโ€ฒ=Uโขpโขsโขaโขmโขpโขlโขeโข(๐…iโ€ฒ)superscriptsubscript๐…๐‘–โ€ฒโ€ฒ๐‘ˆ๐‘๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘’superscriptsubscript๐…๐‘–โ€ฒ\mathbf{F}_{i}^{\prime\prime}=Upsample(\mathbf{F}_{i}^{\prime})bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โ€ฒ โ€ฒ end_POSTSUPERSCRIPT = italic_U italic_p italic_s italic_a italic_m italic_p italic_l italic_e ( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ).By doing so, this injector can model the contexts of images with the guidance of textual global prior without fully fine-tuning the encoder.

2.3 Class-specific Prompt Generator

The explicit prior based on text prompt is included in this module to guide the prompt generation for lesion segmentation, as shown in Fig.3 (b).The feature embeddings generated by the image encoder are denoted as ๐…eโˆˆโ„HeโขWeร—Cesubscript๐…๐‘’superscriptโ„subscript๐ป๐‘’subscript๐‘Š๐‘’subscript๐ถ๐‘’\mathbf{F}_{e}\in\mathbb{R}^{H_{e}W_{e}\times C_{e}}bold_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ร— italic_C start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where He,Wesubscript๐ป๐‘’subscript๐‘Š๐‘’H_{e},W_{e}italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and Cesubscript๐ถ๐‘’C_{e}italic_C start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT denote the width, height, and channel of the feature, respectively. We reshape it as ๐…eโˆˆโ„Heร—Weร—Cesubscript๐…๐‘’superscriptโ„subscript๐ป๐‘’subscript๐‘Š๐‘’subscript๐ถ๐‘’\mathbf{F}_{e}\in\mathbb{R}^{H_{e}\times W_{e}\times C_{e}}bold_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ร— italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ร— italic_C start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Then, it interacts with the prior feature ๐esubscript๐๐‘’\mathbf{P}_{e}bold_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT to get the prior-guided feature as ๐…p=๐…eร—๐esubscript๐…๐‘subscript๐…๐‘’subscript๐๐‘’\mathbf{F}_{p}=\mathbf{F}_{e}\times\mathbf{P}_{e}bold_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = bold_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ร— bold_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. We replicate ๐…psubscript๐…๐‘\mathbf{F}_{p}bold_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT c๐‘citalic_c times to get ๐…pโ€ฒโˆˆโ„cร—Heร—Weร—Cesubscriptsuperscript๐…โ€ฒ๐‘superscriptโ„๐‘subscript๐ป๐‘’subscript๐‘Š๐‘’subscript๐ถ๐‘’\mathbf{F}^{\prime}_{p}\in\mathbb{R}^{c\times H_{e}\times W_{e}\times C_{e}}bold_F start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_c ร— italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ร— italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ร— italic_C start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and assign a specific channel for each category, where c๐‘citalic_c is the overall class numbers. By doing so, each channel contains category-specific information. For the given class cisubscript๐‘๐‘–c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we keep only the channels relevant to the given class cisubscript๐‘๐‘–c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (as ๐…pcisubscriptsuperscript๐…subscript๐‘๐‘–๐‘\mathbf{F}^{c_{i}}_{p}bold_F start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT) and project it to get dense embeddings ๐„dsubscript๐„๐‘‘\mathbf{E}_{d}bold_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and sparse embeddings ๐„ssubscript๐„๐‘ \mathbf{E}_{s}bold_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as

๐„d=ฯ•dโขeโขnโขsโขeโข(๐…pci),๐„s=ฯ•sโขpโขaโขrโขsโขeโข(rโขeโขsโขhโขaโขpโขeโข(๐…pci)),formulae-sequencesubscript๐„๐‘‘subscriptitalic-ฯ•๐‘‘๐‘’๐‘›๐‘ ๐‘’subscriptsuperscript๐…subscript๐‘๐‘–๐‘subscript๐„๐‘ subscriptitalic-ฯ•๐‘ ๐‘๐‘Ž๐‘Ÿ๐‘ ๐‘’๐‘Ÿ๐‘’๐‘ โ„Ž๐‘Ž๐‘๐‘’subscriptsuperscript๐…subscript๐‘๐‘–๐‘\small\mathbf{E}_{d}=\phi_{dense}(\mathbf{F}^{c_{i}}_{p}),\mathbf{E}_{s}=\phi_%{sparse}(reshape(\mathbf{F}^{c_{i}}_{p})),bold_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = italic_ฯ• start_POSTSUBSCRIPT italic_d italic_e italic_n italic_s italic_e end_POSTSUBSCRIPT ( bold_F start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) , bold_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_ฯ• start_POSTSUBSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT ( italic_r italic_e italic_s italic_h italic_a italic_p italic_e ( bold_F start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ) ,(4)

where ฯ•dโขeโขnโขsโขeโข(โ‹…)subscriptitalic-ฯ•๐‘‘๐‘’๐‘›๐‘ ๐‘’โ‹…\phi_{dense}(\cdot)italic_ฯ• start_POSTSUBSCRIPT italic_d italic_e italic_n italic_s italic_e end_POSTSUBSCRIPT ( โ‹… ) is the convolution operation, ฯ•sโขpโขaโขrโขsโขeโข(โ‹…)subscriptitalic-ฯ•๐‘ ๐‘๐‘Ž๐‘Ÿ๐‘ ๐‘’โ‹…\phi_{sparse}(\cdot)italic_ฯ• start_POSTSUBSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT ( โ‹… ) is the linear projection operation, and rโขeโขsโขhโขaโขpโขeโข(โ‹…)๐‘Ÿ๐‘’๐‘ โ„Ž๐‘Ž๐‘๐‘’โ‹…reshape(\cdot)italic_r italic_e italic_s italic_h italic_a italic_p italic_e ( โ‹… ) operation reshapes the feature as ๐…pciโˆˆโ„Ceร—HeโขWesubscriptsuperscript๐…subscript๐‘๐‘–๐‘superscriptโ„subscript๐ถ๐‘’subscript๐ป๐‘’subscript๐‘Š๐‘’\mathbf{F}^{c_{i}}_{p}\in\mathbb{R}^{C_{e}\times H_{e}W_{e}}bold_F start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ร— italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Subsequently, ๐„dsubscript๐„๐‘‘\mathbf{E}_{d}bold_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and ๐„ssubscript๐„๐‘ \mathbf{E}_{s}bold_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are fed into the SAM decoder, serving as inputs for the dense and sparse embeddings within the original SAM decoder.Here, we leverage the original SAM decoder to process dense and sparse embeddings for lesion segmentation. Dense embeddings offer global guidance, while sparse embeddings retain more detailed information about the lesions, further boosting lesion segmentation.Finally, the SAM decoder outputs the prediction map ๐๐\mathbf{P}bold_P.

Training Objective.To train our segmentation model, the overall training objective adopts the combination of binary cross entropy loss and IoU loss, which is defined as โ„’=1Nโขโˆ‘j=1N(โ„’IโขoโขUโข(๐j,๐†j)+โ„’BโขCโขEโข(๐j,๐†j)),โ„’1๐‘superscriptsubscript๐‘—1๐‘subscriptโ„’๐ผ๐‘œ๐‘ˆsubscript๐๐‘—subscript๐†๐‘—subscriptโ„’๐ต๐ถ๐ธsubscript๐๐‘—subscript๐†๐‘—\mathcal{L}=\frac{1}{N}\sum_{j=1}^{N}(\mathcal{L}_{IoU}(\mathbf{P}_{j},\mathbf%{G}_{j})+\mathcal{L}_{BCE}(\mathbf{P}_{j},\mathbf{G}_{j})),caligraphic_L = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG โˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_I italic_o italic_U end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ,where ๐jsubscript๐๐‘—\mathbf{P}_{j}bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the prediction map and ๐†jsubscript๐†๐‘—\mathbf{G}_{j}bold_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the ground truth map.

3 Experiments

3.1 Experimental Setup

Dataset.To evaluate the effectiveness of our method, we adopt IDRiD[14] and DDR[12] datasets, which both contain four categories (MA, HE, EX, and SE) of DR lesion region from color fundus images.

Implementation Details.Our method is implemented with Pytorch and all experiments are conducted on 2ร—\timesร—NVIDIA RTX 4090 GPUs. We train the IDRiD dataset for 500 epochs and train the DDR dataset for 200 epochs.We resize the input images to 1024ร—1024102410241024\times 10241024 ร— 1024 in the training and inference stages.The AdamW optimizer is adopted with a learning rate of 0.0001. For SAM and CLIP, we use their ViT-B[3] and ViT-B/16 variants, respectively. The explicit descriptions are crafted from ophthalmology literature and have undergone validation by both relevant experts and GPT-4.

3.2 Comparison Study

Firstly, we compare our method with the SOTA specialized segmentation model, including U-Net[18], Fully Convolutional Transformer (FCT)[20].Secondly, we compare our method with SAM[11] and its variants, including MedSAM[13], SAMed[31], PerSAM/PerSAM-F[32] and SAM-Adapter[1].

DatasetIDRiD[14]DDR[12]
MethodmDiceAUC-ROCAUC-PRmDiceAUC-ROCAUC-PR
U-Net[18]36.6689.3434.6824.5991.3626.78
FCT[20]42.9693.4943.0923.3390.5325.01
Vanilla SAM[11]1.3545.350.610.5134.560.28
PerSAM[32]1.6561.200.940.7669.040.59
PerSAM-F[32]1.6457.980.730.7256.410.57
MedSAM[13]25.0687.2726.5717.9789.8319.30
SAMed[31]35.1888.9635.3728.3287.7030.03
SAM-Adapter[1]34.4288.2334.5929.1188.8229.90
Ours49.7295.9450.5538.7894.1239.15

DatasetIDRiD[14]DDR[12]
MethodmDiceAUC-ROCAUC-PRmDiceAUC-ROCAUC-PR
Adapter[7]47.4695.5549.6734.2095.7135.99
Multi-scale Adapter[29]45.9595.1647.7235.4294.5836.81
LoRA[8]44.2295.0747.2836.7794.5137.58
Prior-Aligned Injector49.7295.9450.5538.7894.1239.15

The experimental results in Table1 show that our TP-DRSeg outperforms other competitive methods on both IDRiD and DDR datasets.Using zero-shot (vanilla) SAM can not achieve satisfactory performance due to the gap between medical and natural domains. Meanwhile, in-context prompted SAM[32] is unable to identify fine-grained features and generate accurate prompts, consequently yielding suboptimal segmentation results.Those SAM variants[13, 31, 1] improve performance with parameter-efficient fine-tuning but only use visual cues, limiting their ability to exploit clues from other modalities. Our approach, integrating visual and language cues, demonstrates better segmentation performance.

Qualitative Analysis. The qualitative results are shown in Fig.4 and we can see that our method exhibits greater advantages in lesion identification.FCT can locate lesion regions but struggles with distinguishing HE from SE, showing the value of textual information in distinguishing inter-class differences.Additionally, we show the visualized features with and without the integration of explicit priors. It is evident that text-based explicit priors can serve as explainable clues for segmentation and can guide the network in recognizing lesions.

TP-DRSeg: Improving Diabetic Retinopathy Lesion Segmentation with Explicit Text-Prompts Assisted SAM (4)

DatasetIDRiD[14]DDR[12]
MethodmDiceAUC-ROCAUC-PRmDiceAUC-ROCAUC-PR
Ours-w/o Injector35.0990.3533.6424.0791.9922.74
Ours-w/o EP in Injector44.8994.9346.6137.2793.9938.47
Ours-w/o EP in CPG45.0394.6547.2135.2592.8236.65
Ours49.7295.9450.5538.7894.1239.15

Comparison with Other Parameter-Efficient Fine-Tuning Methods. Moreover, we compare our prior-aligned injector with other parameter-efficient fine-tuning methods, including Adapter[7], LoRA[8] and Multi-Scale Adapter[29], which is shown in Table2. The compared methods target at single modality, i.e. vision modality, without considering the text-guided information. Our proposed prior-aligned injector leverages explicit textual guidance to help locate DR lesions and demonstrate better segmentation capability.

3.3 Ablation Study

The results of the ablation study are shown in Table3. We can observe the performance degradation when removing the prior-aligned injector (Injector) in training. Then we test the model performance with and without the integration of explicit prior (EP) in the Injector and in the class-specific prompt generator (CPG).The experimental results in the table demonstrate that the text-based explicit prior is effective in boosting lesion segmentation.

4 Conclusion

In this paper, we focus on how language cues benefit DR lesion segmentation and propose a novel framework in a text-prompted scheme, termed TP-DRSeg.Specifically, we design an explicit prior encoder to provide explainable clues with text-based prompts.We also introduce a prior-aligned injector to efficiently inject explicit prior knowledge in the segmentation process and enable our framework training in a parameter-efficient fashion.The experimental results demonstrate the superiority and effectiveness of our proposed TP-DRSeg.

{credits}

4.0.1 Acknowledgements

The work was partially supported by Airdoc medical AI projects donation Phase 2, Monash-Airdoc Research Centre, and in part by the MRFF NCRI GA89126.

4.0.2 \discintname

The authors declare that they have no competing interests.

References

  • [1]Chen, T., Zhu, L., Ding, C., Cao, R., Zhang, S., Wang, Y., Li, Z., Sun, L., Mao, P., Zang, Y.: Sam fails to segment anything?โ€“sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, and more. arXiv:2304.09148 (2023)
  • [2]Crowson, K., Biderman, S., Kornis, D., Stander, D., Hallahan, E., Castricato, L., Raff, E.: Vqgan-clip: Open domain image generation and editing with natural language guidance. In: European Conference on Computer Vision. pp. 88โ€“105. Springer (2022)
  • [3]Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., etal.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
  • [4]Guo, S., Li, T., Kang, H., Li, N., Zhang, Y., Wang, K.: L-seg: An end-to-end unified framework for multi-lesion segmentation of fundus images. Neurocomputing 349, 52โ€“63 (2019)
  • [5]Guo, T., Yang, J., Yu, Q.: Diabetic retinopathy lesion segmentation using deep multi-scale framework. Biomedical Signal Processing and Control 88, 105050 (2024)
  • [6]Guo, Y., Peng, Y.: Carnet: Cascade attentive refinenet for multi-lesion segmentation of diabetic retinopathy images. Complex & Intelligent Systems 8(2), 1681โ€“1701 (2022)
  • [7]Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., DeLaroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International Conference on Machine Learning. pp. 2790โ€“2799. PMLR (2019)
  • [8]Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., etal.: Lora: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)
  • [9]Hu, M., Xia, P., Wang, L., Yan, S., Tang, F., Xu, Z., Luo, Y., Song, K., Leitner, J., Cheng, X., Cheng, J., Liu, C., Zhou, K., Ge, Z.: Ophnet: A large-scale video benchmark for ophthalmic surgical workflow understanding (2024)
  • [10]Huang, D., Xiong, X., Ma, J., Li, J., Jie, Z., Ma, L., Li, G.: Alignsam: Aligning segment anything model to open context via reinforcement learning. In: Computer Vision and Pattern Recognition. pp. 3205โ€“3215 (2024)
  • [11]Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., Girshick, R.: Segment anything. In: International Conference on Computer Vision. pp. 4015โ€“4026 (2023)
  • [12]Li, T., Gao, Y., Wang, K., Guo, S., Liu, H., Kang, H.: Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening. Information Sciences 501, 511โ€“522 (2019)
  • [13]Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical images. Nature Communications 15(1), 654 (2024)
  • [14]Porwal, P., Pachade, S., Kamble, R., Kokare, M., Deshmukh, G., Sahasrabuddhe, V., Meriaudeau, F.: Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research. Data 3(3), 25 (2018)
  • [15]Qiu, Z., Hu, Y., Li, H., Liu, J.: Learnable ophthalmology sam. arXiv preprint arXiv:2304.13425 (2023)
  • [16]Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., etal.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. pp. 8748โ€“8763. PMLR (2021)
  • [17]Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition. pp. 10684โ€“10695 (2022)
  • [18]Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 234โ€“241. Springer (2015)
  • [19]Tang, F., Xu, Z., Qu, Z., Feng, W., Jiang, X., Ge, Z.: Hunting attributes: Context prototype-aware learning for weakly supervised semantic segmentation. In: Computer Vision and Pattern Recognition. pp. 3324โ€“3334 (2024)
  • [20]Tragakis, A., Kaul, C., Murray-Smith, R., Husmeier, D.: The fully convolutional transformer for medical image segmentation. In: Winter Conference on Applications of Computer Vision. pp. 3660โ€“3669 (2023)
  • [21]Wang, D., Zhang, J., Du, B., Xu, M., Liu, L., Tao, D., Zhang, L.: Samrs: Scaling-up remote sensing segmentation dataset with segment anything model. Advances in Neural Information Processing Systems 36 (2024)
  • [22]Wang, X., Fang, Y., Yang, S., Zhu, D., Wang, M., Zhang, J., Zhang, J., Cheng, J., Tong, K.y., Han, X.: Clc-net: Contextual and local collaborative network for lesion segmentation in diabetic retinopathy images. Neurocomputing 527, 100โ€“109 (2023)
  • [23]Xia, P., Chen, Z., Tian, J., Gong, Y., Hou, R., Xu, Y., Wu, Z., Fan, Z., Zhou, Y., Zhu, K., etal.: Cares: A comprehensive benchmark of trustworthiness in medical vision language models. arXiv preprint arXiv:2406.06007 (2024)
  • [24]Xia, P., Hu, M., Tang, F., Li, W., Zheng, W., Ju, L., Duan, P., Yao, H., Ge, Z.: Generalizing to unseen domains in diabetic retinopathy with disentangled representations. arXiv preprint arXiv:2406.06384 (2024)
  • [25]Xia, P., Yu, X., Hu, M., Ju, L., Wang, Z., Duan, P., Ge, Z.: Hgclip: Exploring vision-language models with graph representations for hierarchical understanding. arXiv preprint arXiv:2311.14064 (2023)
  • [26]Xie, X., Niu, J., Liu, X., Chen, Z., Tang, S., Yu, S.: A survey on incorporating domain knowledge into deep learning for medical image analysis. Medical Image Analysis 69, 101985 (2021)
  • [27]Xing, Z., Ye, T., Yang, Y., Liu, G., Zhu, L.: Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. arXiv preprint arXiv:2401.13560 (2024)
  • [28]Xing, Z., Zhu, L., Yu, L., Xing, Z., Wan, L.: Hybrid masked image modeling for 3d medical image segmentation. IEEE Journal of Biomedical and Health Informatics (2024)
  • [29]Xiong, X., Wang, C., Li, W., Li, G.: Mammo-sam: Adapting foundation segment anything model for automatic breast mass segmentation in whole mammograms. In: International Workshop on Machine Learning in Medical Imaging. pp. 176โ€“185. Springer (2023)
  • [30]Yue, W., Zhang, J., Hu, K., Xia, Y., Luo, J., Wang, Z.: Surgicalsam: Efficient class promptable surgical instrument segmentation. arXiv preprint arXiv:2308.08746 (2023)
  • [31]Zhang, K., Liu, D.: Customized segment anything model for medical image segmentation. arXiv preprint arXiv:2304.13785 (2023)
  • [32]Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Dong, H., Gao, P., Li, H.: Personalize segment anything model with one shot. In: International Conference on Learning Representations (2024)
TP-DRSeg: Improving Diabetic Retinopathy Lesion Segmentation with Explicit Text-Prompts Assisted SAM (2024)
Top Articles
Latest Posts
Article information

Author: Barbera Armstrong

Last Updated:

Views: 6253

Rating: 4.9 / 5 (79 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Barbera Armstrong

Birthday: 1992-09-12

Address: Suite 993 99852 Daugherty Causeway, Ritchiehaven, VT 49630

Phone: +5026838435397

Job: National Engineer

Hobby: Listening to music, Board games, Photography, Ice skating, LARPing, Kite flying, Rugby

Introduction: My name is Barbera Armstrong, I am a lovely, delightful, cooperative, funny, enchanting, vivacious, tender person who loves writing and wants to share my knowledge and understanding with you.