repo Colab notebook

Fine-tuning Electra and interpreting with Captum Integrated Gradients


This notebook contains an example of fine-tuning an Electra model on the GLUE SST-2 dataset. After fine-tuning, the Integrated Gradients interpretability method is applied to compute tokens' attributions for each target class.

The notebook is based on the Hugging Face documentation and the implementation of Integrated Gradients attribution methods is adapted from the Captum.ai Interpreting BERT Models (Part 1).

Visualization

Captum visualization library shows in green tokens that push the prediction towards the target class. Those driving the score towards the reference value are marked in red. As a result, words perceived as positive will appear in green if attribution is performed against class 1 (positive) but will be highlighted in red with an attribution targeting class 0 (negative). Because importance scores ar assigned to tokens, not words, some examples may show that attribution is highly dependent on tokenization.

Attributions for a correctly classified positive example


Attributions for a correctly classified negative example


Attributions for a negative sample misclassified as positive