Fine-tuning Electra and interpreting with Captum Integrated Gradients

Repo Colab notebook

This notebook contains an example of fine-tuning an Electra model on the GLUE SST-2 dataset. After fine-tuning, the Integrated Gradients interpretability method is applied to compute tokens' attributions for each target class.

The notebook is based on the Hugging Face documentation; the Integrated Gradients implementation is adapted from Interpreting BERT Models (Part 1).

Visualization

The Captum visualization library shows in green tokens that push the prediction toward the target class. Those driving the score toward the reference value are marked in red. As a result, words perceived as positive will appear green when attribution targets class 1 (positive), but red when attribution targets class 0 (negative).

Because importance scores are assigned to tokens — not words — some examples show that attribution is highly dependent on tokenization.

Attributions for a correctly classified positive example

Electra attributions, positive sample classified as positive

Attributions for a correctly classified negative example

Electra attributions, negative sample classified as negative

Attributions for a negative sample misclassified as positive

Electra attributions, negative sample misclassified as positive