This notebook contains an example of fine-tuning an Electra model on the GLUE SST-2 dataset. After fine-tuning, the Integrated Gradients interpretability method is applied to compute tokens' attributions for each target class.
- We instantiate a pre-trained Electra model from the Transformers library.
- The data is downloaded from the nlp library. Input text is tokenized with ElectraTokenizerFast, backed by the HF tokenizers library.
- Fine-tuning for sentiment analysis is handled by the Trainer class.
- After fine-tuning, the Integrated Gradients algorithm assigns importance scores to input tokens, using a PyTorch implementation from Captum.
- The algorithm requires a reference sample (a baseline); attribution is performed based on the model's output as inputs change from reference values to the actual sample.
- Integrated Gradients satisfies the completeness property — the sum of attributions for a sample approximates the prediction's shift from the baseline value.
- The final sections of the notebook contain a colour-coded visualization of attribution results, made with
captum.attr.visualization.
The notebook is based on the Hugging Face documentation; the Integrated Gradients implementation is adapted from Interpreting BERT Models (Part 1).
Visualization
The Captum visualization library shows in green tokens that push the prediction toward the target class. Those driving the score toward the reference value are marked in red. As a result, words perceived as positive will appear green when attribution targets class 1 (positive), but red when attribution targets class 0 (negative).
Because importance scores are assigned to tokens — not words — some examples show that attribution is highly dependent on tokenization.
Attributions for a correctly classified positive example
Attributions for a correctly classified negative example
Attributions for a negative sample misclassified as positive