🪄 Interpreto: A Unified Toolkit for Interpretability of Transformer Models
Blog post from HuggingFace
Interpreto is an open-source library designed to enhance the explainability of transformer models in natural language processing (NLP), crucial for applications in sensitive and high-stakes environments where understanding model predictions is essential for trust and fairness. Unlike existing libraries that focus on specific paradigms, Interpreto supports both attribution-based and concept-based explanations, making it versatile for both classification and generative models. The library integrates seamlessly with Hugging Face transformers and offers evaluation tools to assess explanation quality. For attribution-based methods, Interpreto provides both inference and gradient-based approaches to determine token importance, while concept-based methods aim to identify and interpret higher-level features within model activations. This includes tools for learning and interpreting concepts, such as Semi-NMF and various sparse autoencoders, and metrics to evaluate explanation faithfulness and complexity. Overall, Interpreto aims to make explainability in NLP models both practical and accessible, catering to researchers and practitioners who require transparent insights into model behavior.