C2F-Explainer: Explaining Transformers Better through a Coarse-to-Fine Strategy
Weiping Ding*, Xiaotian Cheng, Yu Geng, Jiashuang Huang, Hengrong Ju
MOTIVATION
With the development of artificial intelligence technology, the interpretability of deep learning models is crucial for the application of artificial intelligence systems. Explainable artificial intelligence (XAI) helps humans understand and trust the decision-making process of the system, especially in the case of opacity of complex models such as Transformers.
In high-stakes fields such as medical diagnostic systems, the interpretability of deep learning models is critical. Doctors need to understand why the model made a specific diagnosis to make more accurate treatment decisions. If the model lacks interpretability, doctors may doubt the diagnosis results, reducing patient trust and treatment effectiveness.
Traditional interpretation methods mostly use the final layer output of the Transformer encoder as masks to generate an explanation map. However, These approaches overlook two crucial aspects. At the coarse-grained level, the mask may contain uncertain information, including unreliable and incomplete object location data; at the fine-grained level, there is information loss on the mask, resulting in spatial noise and detail loss.
INNOVATION
This study proposes the S3WM module to address the problem of the mask generated by the ViT final layer's feature map containing uncertainty information. By selecting specific thresholds and conditions, masks can be effectively classified into three types, namely positive, negative, and uncertain masks. Positive masks can significantly improve model interpretability.
This paper proposes the AF module, which aggregates the attention matrix of each layer in the ViT to generate a relation matrix that reflects the relation between foreground, background, and edge regions of image blocks. The cosine similarity measures are used to compute importance scores for image blocks, and a weighted fusion is performed on the interpretation results. This can effectively solve the noise problem caused by the loss of information in the mask in preliminary interpretation results, enhancing model interpretability.
The proposed C2F-Explainer model is quantitatively and qualitatively evaluated on several datasets, including the ImageNet natural image dataset, the COCO 2017 natural image dataset, the VOC 2012 natural image dataset, and the BraTS brain tumor medical image dataset, to validate its performance. The experimental results show that the proposed method can achieve better explanation results than traditional interpretation methods.
METHOD
The proposed Transformer explanation method, named the C2F-Explainer, adopts a coarse-to-fine strategy, as shown in Fig. 1. Fig. 1(a) presents the standard ViT module, Fig. 1(b) shows the S3WM module, and Fig. 1(c) displays the AF module. Due to the uncertainty of mask quality, this study puts the mask set into the S3WM module. The multi-granularity level sequential analysis is conducted to select positive mask setsMp that are used to perturb an image to generate a preliminary explanation resultS. The resultS can accurately locate object positions but often contains background noise, so the interpretation results are coarse-grained. Subsequently, this study analyzes the self-attention mechanism, explores the interaction of image blocks, and proposes the AF module. This module uses cross-layer attention information to generate a relation matrixR, which is used to optimize detailed information in the explanation results. Finally, an optimal fine-grained explanation resultV is generated, effectively improving the model’s interpretability.
Sequential Three-way Mask Module: The S3WM module reduces the redundancy of the masks generated by Transformer through agglomerative clustering, and then divides the masks into positive masks, negative masks, and uncertain masks based on three decisions. KL divergence is then used to further process the uncertain masks to filter out high-quality positive masks. Finally, the importance value of each pixel in the image is calculated based on these positive masks to improve the model's explanation effect.
Attention Fusion module: The AF module refines the coarse-grained interpretations generated by the S3WM module. First, it calculates a relation matrixR, capturing global image block relationships by aggregating attention across different encoder layers and heads in a Transformer model. The module then calculates the importance scoreP, for each image block based on its cosine similarity to the preliminary interpretation resultS. These scores reflect the correlation between the image block and object-position data. Finally, the AF module fuses the importance scores with S to generate a more fine-grained importance mapV.
Fig. 1. Overall framework of the proposed method
EXPERIMENTAL RESULTS
TABLE Ⅰ SEGMENTATION PERFORMANCE COMPARISON RESULTS OF DIFFERENT
METHODS ON THE VOC DATASET
TABLE II SEGMENTATION PERFORMANCE COMPARISON RESULTS OF DIFFERENT
METHODS ON THE COCO DATASET
TABLE ⅠII RESULTS OF THE ABLATION EXPERIMENTS ON THE VOC AND COCO DATASETS
Fig. 2. Comparison of interpretation results on the single-category natural images.
Fig. 3. Comparison of interpretation results on the multi-category natural images. For each image, the interpretation results for two different categories are presented
.
Fig. 4. Comparison of interpretation results on the medical images. The last column shows the actual location of the brain tumor.
Fig. 5. ROC curves of different methods on VOC and COCO datasets.
Link: https://doi.org/10.1109/TKDE.2024.3443888