interpretability
view markdownsome interesting papers on interpretable machine learning, largely organized based on this interpretable ml review (murdoch et al. 2019) and notes from this interpretable ml book (molnar 2019)
reviews
definitions
The definition of interpretability I find most useful is that given in murdoch et al. 2019: basically that interpretability requires a pragmatic approach in order to be useful. As such, interpretability is only defined with respect to a specific audience + problem and an interpretation should be evaluated in terms of how well it benefits a specific context. It has been defined and studied more broadly in a variety of works:
 Explore, Explain and Examine Predictive Models (biecek & burzykowski, in progress)  another book on exploratory analysis with interpretability
 Explanation Methods in Deep Learning: Users, Values, Concerns and Challenges (ras et al. 2018)
 Explainable Deep Learning: A Field Guide for the Uninitiated
 Interpretable Deep Learning in Drug Discovery
 Explainable AI: A Brief Survey on History, Research Areas, Approaches and Challenges
 Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI
 Against Interpretability: a Critical Examination of the Interpretability Problem in Machine Learning  “where possible, discussion should be reformulated in terms of the ends of interpretability”
overviews
 Towards a Generic Framework for Blackbox Explanation Methods (henin & metayer 2019)
 sampling  selection of inputs to submit to the system to be explained
 generation  analysis of links between selected inputs and corresponding outputs to generate explanations
 proxy  approximates model (ex. rule list, linear model)
 explanation generation  explains the proxy (ex. just give most important 2 features in rule list proxy, ex. LIME gives coefficients of linear model, Shap: sums of elements)
 interaction (with the user)
 this is a super useful way to think about explanations (especially local), but doesn’t work for SHAP / CD which are more about how much a variable contributes rather than a local approximation
 feature (variable) importance measurement review (VIM) (wei et al. 2015)
 oftentermed sensitivity, contribution, or impact
 some of these can be applied to data directly w/out model (e.g. correlation coefficient, rank correlation coefficient, momentindependent VIMs)
 Pitfalls to Avoid when Interpreting Machine Learning Models (molnar et al. 2020)
 Feature Removal Is a Unifying Principle for Model Explanation Methods (covert, lundberg, & lee 2020)
 Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges (rudin et al. ‘21)
evaluating interpretability
Evaluating interpretability can be very difficult (largely because it rarely makes sense to talk about interpretability outside of a specific context). The best possible evaluation of interpretability requires benchmarking it with respect to the relevant audience in a context. For example, if an interpretation claims to help understand radiology models, it should be tested based on how well it helps radiologists when actually making diagnoses. The papers here try to find more generic alternative ways to evaluate interp methods (or just define desiderata to do so).
 Towards A Rigorous Science of Interpretable Machine Learning (doshivelez & kim 2017)
 Benchmarking Attribution Methods with Relative Feature Importance (yang & kim 2019)
 train a classifier, add random stuff (like dogs) to the image, classifier should assign them little importance
 Visualizing the Impact of Feature Attribution Baselines
 topkablation: should identify top pixels, ablate them, and want it to actually decrease
 centerofmass ablation: also could identify center of mass of saliency map and blur a box around it (to avoid destroying feature correlations in the model)
 should we be truetothemodel or truetothedata?
 Evaluating Feature Importance Estimates (hooker et al. 2019)
 removeandretrain test accuracy decrease
 Quantifying Interpretability of Arbitrary Machine Learning Models Through Functional Decomposition (molnar 2019)
 An Evaluation of the HumanInterpretability of Explanation (lage et al. 2019)
 Multivalue Rule Sets for Interpretable Classification with FeatureEfficient Representations (wang, 2018)
 measure how long it takes for people to calculate predictions from different rulebased models
 On the (In)fidelity and Sensitivity for Explanations
 Benchmarking Attribution Methods with Relative Feature Importance (yang & kim 2019)
 Do Explanations Reflect Decisions? A Machinecentric Strategy to Quantify the Performance of Explainability Algorithms
 On Validating, Repairing and Refining Heuristic ML Explanations
 How Much Can We See? A Note on Quantifying Explainability of Machine Learning Models
 Manipulating and Measuring Model Interpretability (sangdeh et al. … wallach 2019)
 participants who were shown a clear model with a small number of features were better able to simulate the model’s predictions
 no improvements in the degree to which participants followed the model’s predictions when it was beneficial to do so.
 increased transparency hampered people’s ability to detect when the model makes a sizable mistake and correct for it, seemingly due to information overload
 Towards a Framework for Validating Machine Learning Results in Medical Imaging
 An Integrative 3C evaluation framework for Explainable Artificial Intelligence
 Evaluating Explanation Without Ground Truth in Interpretable Machine Learning (yang et al. 2019)
 predictability (does the knowledge in the explanation generalize well)
 fidelity (does explanation reflect the target system well)
 persuasibility (does human satisfy or comprehend explanation well)
basic failures
 Sanity Checks for Saliency Maps (adebayo et al. 2018)
 Model Parameter Randomization Test  attributions should be different for trained vs random model, but they aren’t for many attribution methods
 Rethinking the Role of Gradientbased Attribution Methods for Model Interpretability (srinivas & fleuret, 2021)
 logits can be arbitrarily shifted without affecting preds / gradientbased explanations

gradientbased explanations then, don’t necessarily capture info about $p_\theta(y x)$
 Assessing the (Un)Trustworthiness of Saliency Maps for Localizing Abnormalities in Medical Imaging (arun et al. 2020)  CXR images from SIIMACR Pneumothorax Segmentation + RSNA Pneumonia Detection
 metrics: localizers (do they overlap with GT segs/bounding boxes), variation with model weight randomization, repeatable (i.e. same after retraining?), reproducibility (i.e. same after training different model?)
 Interpretable Deep Learning under Fire (zhang et al. 2019)
 Wittgenstein: “if a lion could speak, we could not understand him.”
adv. vulnerabilities
 How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods
 we can build classifiers which use important features (such as race) but explanations will not reflect that
 basically classifier is different on X which is OOD (and used by LIME and SHAP)
 Interpretation of Neural Networks is Fragile (ghorbani et al. 2018)
 minor perturbations to inputs can drastically change DNN interpretations
 Fooling Neural Network Interpretations via Adversarial Model Manipulation (heo, joo, & moon 2019)  can change model weights so that it keeps predictive accuracy but changes its interpretation
 motivation: could falsely look like a model is “fair” because it places little saliency on sensitive attributes
 output of model can still be checked regardless
 fooled interpretation generalizes to entire validation set
 can force the new saliency to be whatever we like
 passive fooling  highlighting uninformative pixels of the image
 active fooling  highlighting a completely different object, the firetruck
 model does not actually change that much  predictions when manipulating pixels in order of saliency remains similar, very different from random (fig 4)
 motivation: could falsely look like a model is “fair” because it places little saliency on sensitive attributes
intrinsic interpretability (i.e. how can we fit simpler models)
For an implementation of many of these models, see the python imodels package.
decision rules overview
For more on rules, see logic notes.
 2 basic concepts for a rule
 converage = support
 accuracy = confidence = consistency
 measures for rules: precision, info gain, correlation, mestimate, Laplace estimate
 these algorithms usually don’t support regression, but you can get regression by cutting the outcome into intervals
 why might these be useful?
 The Magical Mystery Four: How is Working Memory Capacity Limited, and Why? (cowan, 2010)  a central memory store limited to 3 to 5 meaningful items in young adults
 connections
 every decision list is a (onesided) decision tree
 every decision tree can be expressed as an equivalent decision list (by listing each path to a leaf as a decision rule)
 leaves of a decision tree (or a decision list) form a decision set
 recent work directly optimizes the performance metric (e.g., accuracy) with soft or hard sparsity constraints on the tree size, where sparsity is measured by the number of leaves in the tree using:
 mathematical programming, including mixed integer programming (MIP) / SAT solvers
 stochastic search through the space of trees
 customized dynamic programming algorithms that incorporate branchandbound techniques for reducing the size of the search space
rule sets
Rule sets commonly look like a series of independent ifthen rules. Unlike trees / lists, these rules can be overlapping and might not cover the whole space. Final predictions can be made via majority vote, using most accurate rule, or averaging predictions. Sometimes also called rule ensembles.
 popular ways to learn rule sets
 SLIPPER (cohen, & singer, 1999)  repeatedly boosting a simple, greedy rulebuilder
 Lightweight Rule Induction (weiss & indurkhya, 2000)  specify number + size of rules and classify via majority vote
 Maximum Likelihood Rule Ensembles (Dembczyński et al. 2008)  MLRules  rule is base estimator in ensemble  build by greedily maximizing loglikelihood
 rulefit (friedman & popescu, 2008)  extract rules from many decision trees, then fit sparse linear model on them
 A statistical approach to rule learning (ruckert & kramer, 2006)  unsupervised objective to mine rules with large maring and low variance before fitting linear model
 Generalized Linear Rule Models (wei et al. 2019)  use column generation (CG) to intelligently search space of rules
 refit GLM as rules are generated, reweighting + discarding
 with large number of columns, can be intractable even to enumerate rules  CG avoids this by fitting a subset and using it to construct most promising next column
 also propose a nonCG algorithm using only 1stdegree rules
 note: from every pair of complementary singleton rules (e.g., $X_j \leq1$, $X_j > 1$), they remove one member as otherwise the pair together is collinear
 refit GLM as rules are generated, reweighting + discarding
 Multivariate Adaptive Regression Splines (MARS) (friedman, 1991)  sequentially learn weighted linear sum of ReLus (or products of ReLus)
 do backward deletion procedure at the end
 more recent global versions of learning rule sets
 interpretable decision set (lakkaraju et al. 2016)  set of if then rules
 short, accurate, and nonoverlapping rules that cover the whole feature space and pay attention to small but important classes
 A Bayesian Framework for Learning Rule Sets for Interpretable Classification (wang et al. 2017)  rules are a bunch of clauses OR’d together (e.g. if (X1>0 AND X2<1) OR (X2<1 AND X3>1) OR … then Y=1)
 they call this method “Bayesian Rule Sets”
 Or’s of And’s for Interpretable Classification, with Application to ContextAware Recommender Systems (wang et al. 2015)  BOA  Bayesian Or’s of And’s
 Vanishing boosted weights: A consistent algorithm to learn interpretable rules (sokolovska et al. 2021)  simple efficient finetuning procedure for decision stumps
 interpretable decision set (lakkaraju et al. 2016)  set of if then rules
 when learning sequentially, often useful to prune at each step (Furnkranz, 1997)
rule lists
 oneR algorithm  select feature that carries most information about the outcome and then split multiple times on that feature
 sequential covering  keep trying to cover more points sequentially
 premining frequent patterns (want them to apply to a large amount of data and not have too many conditions)
 FPGrowth algorithm (borgelt 2005) is fast
 Aprior + Eclat do the same thing, but with different speeds
 interpretable classifiers using rules and bayesian analysis (letham et al. 2015)
 start by premining frequent patterns rules
 current approach does not allow for negation (e.g. not diabetes) and must split continuous variables into categorical somehow (e.g. quartiles)
 mines things that frequently occur together, but doesn’t look at outcomes in this step  okay (since this is all about finding rules with high support)
 learn rules w/ prior for short rule conditions and short lists
 start w/ random list
 sample new lists by adding/removing/moving a rule
 at the end, return the list that had the highest probability
 scalable bayesian rule lists (yang et al. 2017)  faster algorithm for computing
 doesn’t return entire posterior
 learning certifiably optimal rules lists (angelino et al. 2017)  even faster optimization for categorical feature space
 can get upper / lower bounds for loss = risk + $\lambda$ * listLength
 doesn’t return entire posterior
 start by premining frequent patterns rules
 Expertaugmented machine learning (gennatas et al. 2019)
 make rule lists, then compare the outcomes for each rule with what clinicians think should be outcome for each rule
 look at rules with biggest disagreement and engineer/improve rules or penalize unreliable rules
 Fast and frugal heuristics: The adaptive toolbox.  APA PsycNET (gigerenzer et al. 1999)  makes rule lists that can split on either node of the tree each time
trees
Trees suffer from the fact that they have to cover the entire decision space and often we end up with replicated subtrees.
 Generalized and Scalable Optimal Sparse Decision Trees (lin et al. 2020)
 optimize for $\min L(X, y) + \lambda \cdot (\text{numLeaves})$
 full decision tree optimization is NPhard (Laurent & Rivest, 1976)
 can optimize many different losses (e.g. accuracy, AUC)
 speedups: use dynamic programming, prune the searchspace with bounds
 if know best loss so far, know we shouldn’t add too many more leaves since each adds $\lambda$ to the total loss
 similarsupport bound  if two features are similar, then bounds for splitting on the first can be used to obtain bounds for the second
 hash trees with bitvectors that represent similar trees using shared subtrees
 tree is a set of leaves
 bounds
 OSDT
 Upper Bound on Number of Leaves
 Leaf Permutation Bound
 GOSDT
 Hierarchical Objective Lower Bound
 Incremental Progress Bound to Determine Splitting
 Lower Bound on Incremental Progress
 Equivalent Points Bound
 Similar Support Bound
 Incremental Similar Support Bound
 Subset Bound
 OSDT
 optimal sparse decision trees (hu et al. 2019)  previous paper, slower
 costcomplexity pruning (breiman et al. 1984 ch 3)  greedily prune while minimizing loss function of loss + $\lambda \cdot (\text{numLeaves})$
 optimal classification trees methodology paper (bertsimas & dunn, 2017)  globally optimal decision tree with expensive optimization (solved with mixedinteger optimization)  realistically, usually too slow
 $\begin{array}{cl}
\min & \overbrace{R_{x y}(T)}^{\text{misclassification err}}+\alphaT
\text { s.t. } & N_{x}(l) \geq N_{\min } \quad \forall l \in \text { leaves }(T) \end{array}$ 
$ T $ is the number of branch nodes in tree $T$  $N_x(l)$ is the number of training points contained in leaf node $l$
 optimal classification trees vs PECARN (bertsimas et al. 2019)
 supplemental tables
 $\begin{array}{cl}
\min & \overbrace{R_{x y}(T)}^{\text{misclassification err}}+\alphaT
 replicated subtree problem (Bagallo & Haussler, 1990)
 use iterative algorithms to try to overcome it
 Building more accurate decision trees with the additive tree (luna et al. 2019)
 present additive tree (AddTree), which builds a single decision tree, which is between a single CART tree and boosted decision stumps
 cart can be seen as a boosting algorithm on stumps
 can rewrite boosted stumps as a tree very easily
 previous work: can grow tree based on Adaboost idea = AdaTree
 extremely randomized trees  randomness goes further, not only feature is selected randomly but also split has some randomness
 Bayesian Treed Models (chipman et al. 2001)  impose priors on tree parameters
 treed models  fit a model (e.g. linear regression) in leaf nodes
 tree structure e.g. depth, splitting criteria
 values in terminal nodes coditioned on tree structure
 residual noise’s standard deviation
 BART: Bayesian additive regression trees (chipman et al. 2008)  learns an ensemble of tree models using MCMC on a distr. imbued with a prior
 prespecify number of trees in ensemble
 MCMC step: add split, remove split, switch split
 cycles through the trees one at a time
 On the price of explainability for some clustering problems (laber et al. 2021)  trees for clustering
 history
 automatic interaction detection (AID) regression trees (Morgan & Sonquist, 1963)
 THeta Automatic Interaction Detection (THAID) classification trees (Messenger & Mandell, 1972)
 Chisquared Automatic Interaction Detector (CHAID) (Kass, 1980)
 Classification And Regression Trees (CART) (Breiman et al. 1984)
 ID3 (Quinlan, 1986) / C4.5 (Quinlan, 1993)
 new directions
 ensemble methods: Random Forest, GBDT, BART
 global trees: Bennet, Street, Mangasarian (e.g. Global Tree Optimization, 1994)
 improvements in splitting criteria, missing variables
 other problems: longitudinal data, survival curves
linear (+algebraic) models
supersparse models
 four main types of approaches to building scoring systems
 exact solutions using optimization techniques (often use MIP)
 approximation algorithms using linear programming (use L1 penalty instead of L0)
 can also try sampling
 more sophisticated rounding techniques  e.g. random, constrain sum, round each coef sequentially
 computeraided exploration techniques
 Supersparse linear integer models for optimized medical scoring systems (ustun & rudin 2016)
 2helps2b paper
 note: scoring systems map points to a risk probability
 An Interpretable Model with Globally Consistent Explanations for Credit Risk (chen et al. 2018)  a 2layer linear additive model
gams (generalized additive models)
 gam takes form $g(\mu) = b + f_0(x_0) + f_1(x_1) + f_2(x_2) + …$
 usually assume some basis for the $f$, like splines or polynomials (and we select how many either manually or with some complexity penalty)
 traditional way to fit  backfitting: each $f_i$ is fitted sequentially to the residuals of the previously fitted $f_0,…,f_{i1}$ (hastie & tibshirani, 199)
 boosting  fit all $f$ simultaneously, e.g. one tree for each $f_i$ on each iteration
 can make this more interpretable by (1) making the $f$ functions smoother or (2) sparsity in the number of functions
 could also add in interaction terms…
 Demystifying Blackbox Models with Symbolic Metamodels
 GAM parameterized with Meijer Gfunctions (rather than prespecifying some forms, as is done with symbolic regression)
 Neural Additive Models: Interpretable Machine Learning with Neural Nets  GAM where we learn $f$ with a neural net
 Accuracy, Interpretability, and Differential Privacy via Explainable Boosting (nori, caruana et al. 2021)
symbolic regression
 learn form of the equation using priors on what kinds of thinngs are more difficult
 Logic Regression (ruczinski, kooperberg & leblanc, 2012)  given binary input variables, automatically construct interaction terms and linear model (fit using simulated annealing)
 Building and Evaluating Interpretable Models using Symbolic Regression and Generalized Additive Models
 gams  assume model form is additive combination of some funcs, then solve via GD
 however, if we don’t know the form of the model we must generate it
 Bridging the Gap: Providing PostHoc Symbolic Explanations for Sequential DecisionMaking Problems with Black Box Simulators
 Model Learning with Personalized Interpretability Estimation (MLPIE)  use human feedback in the loop to decide which symbolic functions are most interpretable
examplebased = casebased (e.g. prototypes, nearest neighbor)
 “this looks like that” prototypes II (chen et al. 2018)
 can have prototypes smaller than original input size
 l2 distance
 require the filters to be identical to the latent representation of some training image patch
 cluster image patches of a particular class around the prototypes of the same class, while separating image patches of different classes
 maxpool class prototypes so spatial size doesn’t matter
 also get heatmap of where prototype was activated (only max really matters)
 train in 3 steps
 train everything: classification + clustering around intraclass prototypes + separation between interclass prototypes (last layer fixed to 1s / 0.5s)
 project prototypes to data patches
 learn last layer
 original prototypes paper (li et al. 2017)
 uses encoder/decoder setup
 encourage every prototype to be similar to at least one encoded input
 learned prototypes in fact look like digits
 correct class prototypes go to correct classes
 loss: classification + reconstruction + distance to a training point

ProtoPShare: Prototype Sharing for Interpretable Image Classification and Similarity Discovery  share some prototypes between classes with datadependent merge pruning
 merge “similar” prototypes, where similarity is measured as dist of all training patches in repr. space
 Towards Explainable Deep Neural Networks (xDNN) (angelov & soares 2019)  more complex version of using prototypes
 CaseBased Reasoning for Assisting Domain Experts in Processing Fraud Alerts of BlackBox Machine Learning Models
 Explaining Latent Representations with a Corpus of Examples (crabbe, …, van der schaar 2021)  for an individual prediction,
 Which corpus examples explain the prediction issued for a given test example?
 What features of these corpus examples are relevant for the model to relate them to the test example?
 SelfInterpretable Model with Transformation Equivariant Interpretation (wang & wang, 2021)
 generate datadependent prototypes for each class and formulate the prediction as the inner product between each prototype and the extracted features
 interpretation is hadamard product of prototype and extracted features (prediction is sum of this product)
 interpretations can be easily visualized by upsampling from the prototype space to the input data space
 regularization
 reconstruction regularizer  regularizes the interpretations to be meaningful and comprehensible
 for each image, enforce each prototype to be similar to its corresponding class’s latent repr.
 transformation regularizer  constrains the interpretations to be transformation equivariant
 reconstruction regularizer  regularizes the interpretations to be meaningful and comprehensible
 selfconsistency score quantifies the robustness of interpretation by measuring the consistency of interpretations to geometric transformations.
 generate datadependent prototypes for each class and formulate the prediction as the inner product between each prototype and the extracted features
interpretable neural nets
 concepts
 Concept Bottleneck Models (koh et al. 2020)  predict concepts before making final prediction
 Concept Whitening for Interpretable Image Recognition (chen et al. 2020)  force network to separate “concepts” (like in TCAV) along different axes
 Interpretability Beyond Classification Output: Semantic Bottleneck Networks  add an interpretable intermediate bottleneck representation
 How to represent partwhole hierarchies in a neural network (hinton, 2021)
 The idea is simply to use islands of identical vectors to represent the nodes in the parse tree (parse tree would be things like wheel> cabin > car)
 each patch / pixel gets representations at different levels (e.g. texture, parrt of wheel, part of cabin, etc.)
 each repr. is a vector  vector for highlevel stuff (e.g. car) will agree for different pixels but low level (e.g. wheel) will differ
 during training, each layer at each location gets information from nearby levels
 hinton assumes weights are shared between locations (maybe don’t need to be)
 also attention mechanism across other locations in same layer
 each location also takes in its positional location (x, y)
 could have the lowestlevel repr start w/ a convnet
 iCaps: An Interpretable Classifier via Disentangled Capsule Networks (jung et al. 2020)
 the class capsule also includes classificationirrelevant information
 uses a novel classsupervised disentanglement algorithm
 entities represented by the class capsule overlap
 adds additional regularizer
 the class capsule also includes classificationirrelevant information
 localization
 WILDCAT: Weakly Supervised Learning of Deep ConvNets for Image Classification, Pointwise Localization and Segmentation (durand et al. 2017)  constrains architecture
 after extracting conv features, replace linear layers with special pooling layers, which helps with spatial localization
 each class gets a pooling map
 prediction for a class is based on topk spatial regions for a class
 finally, can combine the predictions for each class
 after extracting conv features, replace linear layers with special pooling layers, which helps with spatial localization
 Approximating CNNs with BagoflocalFeatures models works surprisingly well on ImageNet
 CNN is restricted to look at very local features only and still does well (and produces an inbuilt saliency measure)
 learn shapes not texture
 code
 Symbolic Semantic Segmentation and Interpretation of COVID19 Lung Infections in Chest CT volumes based on Emergent Languages (chowdhury et al. 2020)  combine some segmentation with the classifier
 WILDCAT: Weakly Supervised Learning of Deep ConvNets for Image Classification, Pointwise Localization and Segmentation (durand et al. 2017)  constrains architecture
 Sparse Epistatic Regularization of Deep Neural Networks for Inferring Fitness Functions (aghazadeh et al. 2020)  directly regularize interactions / highorder freqs in DNNs
 Physicsinformed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations (raissi et al. 2019)  PINN  solve PDEs by constraining neural net to predict specific parameters / derivatives
 MonoNet: Towards Interpretable Models by Learning Monotonic Features  enforce output to be a monotonic function of individuaul features
 Improved Deep Fuzzy Clustering for Accurate and Interpretable Classifiers  extract features with a DNN then do fuzzy clustering on this
 Towards Robust Interpretability with SelfExplaining Neural Networks (alvarezmelis & jaakkola 2018)  building architectures that explain their predictions
 Harnessing Deep Neural Networks with Logic Rules
connecting dnns with rulebased models
 Distilling a Neural Network Into a Soft Decision Tree (frosst & hinton 2017)  distills DNN into DNNlike tree which uses sigmoid neuron decides which path to follow
 training on distilled DNN predictions outperforms training on original labels
 to make the decision closer to a hard cut, can multiply by a large scalar before applying sigmoid
 parameters updated with backprop
 regularization to ensure that all paths are taken equally likely
 Neural Random Forests (biau et al. 2018)  convert DNN to RF
 first layer learns a node for each split
 second layer learns a node for each leaf (by only connecting to nodes on leaves in the path)
 finally map each leaf to a value
 relax + retrain
 Deep Neural Decision Forests (2015)
 dnn learns small intermediate representation, which outputs all possible splits in a tree
 these splits are forced into a treestructure and optimized via SGD
 neurons use sigmoid function
 Gradient Boosted Decision Tree Neural Network  build DNN based on decision tree ensemble  basically the same but with gradientboosted trees
 Neural Decision Trees  treat each neural net like a node in a tree
 Controlling Neural Networks with Rule Representations (seo, …, pfister, 21)
 DEEPCTRL  encodes rules into DNN
 one encoder for rules, one for data
 both are concatenated with stochastic parameter $\alpha$ (which also weights the loss)
 at testtime, can select $\alpha$ to vary contribution of rule part can be varied (e.g. if rule doesn’t apply to a certain point)
 training
 normalize losses initially to ensure they are on the same scale
 some rules can be made differentiable in a straightforward way: $r(x, \hat y) \leq \tau \to \max (r(x, \hat y )  \tau, 0)$, but can’t do this for everything e.g. decision tree rules
 rulebased loss is defined by looking at predictions fo perturbations of the input
 evaluation
verification ratio
 fraction of samples that satisfy the rule
 see also Lagrangian Duality for Constrained Deep Learning (fioretto et al. 2020)
 one encoder for rules, one for data
 DEEPCTRL  encodes rules into DNN
 RRL: A Scalable Classifier for Interpretable RuleBased Representation Learning (wang et al. 2020)
 Rulebased Representation Learner (RRL)  automatically learns interpretable nonfuzzy rules for data representation
 project RRL it to a continuous space and propose a novel training method, called Gradient Grafting, that can directly optimize the discrete model using gradient descent
 Differentiable Pattern Set Mining (fischer & vreeken, 2021)
 use neural autoencoder with binary activations + binarizing weights
 optimizing a datasparsity aware reconstruction loss, continuous versions of the weights are learned in small, noisy steps
misc models
 learning ANDOR Templates for Object Recognition and Detection (zhu_13)
 ross et al.  constraing model during training
 scat transform idea (mallat_16 rvw, oyallan_17)
 force interpretable description by piping through something interpretable (ex. tenenbaum scene derendering)
 learn concepts through probabilistic program induction
 force biphysically plausible learning rules
 The Convolutional Tsetlin Machine  uses easytointerpret conjunctive clauses
 Beyond Sparsity: Tree Regularization of Deep Models for Interpretability
 regularize so that deep model can be closely modeled by tree w/ few nodes
 Tensor networks  like DNN that only takes boolean inputs and deals with interactions explicitly
 widely used in physics
 Coefficient tree regression: fast, accurate and interpretable predictive modeling (surer, apley, & malthouse, 2021)  iteratively group linear terms with similar coefficients into a bigger term
bayesian models
posthoc interpretability (i.e. how can we interpret a fitted model)
Note that in this section we also include importances that work directly on the data (e.g. we do not first fit a model, rather we do nonparametric calculations of importance)
programs
 program synthesis  automatically find a program in an underlying programming language that satisfies some user intent
 ex. program induction  given a dataset consisting of input/output pairs, generate a (simple?) program that produces the same pairs
 probabilistic programming  specify graphical models via a programming language
modelagnostic
 local surrogate (LIME)  fit a simple model locally to on point and interpret that
 select data perturbations and get new predictions
 for tabular data, this is just varying the values around the prediction
 for images, this is turning superpixels on/off
 superpixels determined in unsupervised way
 weight the new samples based on their proximity
 train a kernelweighted, interpretable model on these points
 LEMNA  like lime but uses lasso + small changes
 select data perturbations and get new predictions
 anchors (ribeiro et al. 2018)  find biggest square region of input space that contains input and preserves same output (with high precision)
 does this search via iterative rules
 What made you do this? Understanding blackbox decisions with sufficient input subsets
 want to find smallest subsets of features which can produce the prediction
 other features are masked or imputed
 want to find smallest subsets of features which can produce the prediction
 VIN (hooker 04)  variable interaction networks  globel explanation based on detecting additive structure in a blackbox, based on ANOVA
 localgradient (bahrens et al. 2010)  direction of highest slope towards a particular class / other class
 golden eye (henelius et al. 2014)  randomize different groups of features and search for groups which interact
 shapley value  average marginal contribution of a feature value across all possible sets of feature values
 “how much does prediction change on average when this feature is added?”
 tells us the difference between the actual prediction and the average prediction
 estimating: all possible sets of feature values have to be evaluated with and without the jth feature
 this includes sets of different sizes
 to evaluate, take expectation over all the other variables, fixing this variables value
 shapley sampling value  sample instead of exactly computing
 quantitative input influence is similar to this…
 satisfies 3 properties
 local accuracy  basically, explanation scores sum to original prediction
 missingness  features with $x’_i=0$ have 0 impact
 consistency  if a model changes so that some simplified input’s contribution increases or stays the same regardless of the other inputs, that input’s attribution should not decrease.
 interpretation: Given the current set of feature values, the contribution of a feature value to the difference between the actual prediction and the mean prediction is the estimated Shapley value
 recalculate via sampling other features in expectation
 followup propagating shapley values (chen, lundberg, & lee 2019)  can work with stacks of different models
 probes  check if a representation (e.g. BERT embeddings) learned a certain property (e.g. POS tagging) by seeing if we can predict this property (maybe linearly) directly from the representation
 problem: if the posthoc probe is a complex model (e.g. MLP), it can accurately predict a property even if that property isn’t really contained in the representation
 potential solution: benchmark against control tasks, where we construct a new random task to predict given a representation, and see how well the posthoc probe can do on that task
 Explaining individual predictions when features are dependent: More accurate approximations to Shapley values (aas et al. 2019)  tries to more accurately compute conditional expectation
 Feature relevance quantification in explainable AI: A causal problem (janzing et al. 2019)  argues we should just use unconditional expectation
 quantitative input influence  similar to shap but more general
 permutation importance  increase in the prediction error after we permuted the feature’s values
 $\mathbb E[Y]  \mathbb E[Y\vert X_{\sim i}]$
 If features are correlated, the permutation feature importance can be biased by unrealistic data instances (PDP problem)
 not the same as model variance
 Adding a correlated feature can decrease the importance of the associated feature
 L2X: informationtheoretical local approximation (chen et al. 2018)  locally assign feature importance based on mutual information with function
 Learning Explainable Models Using Attribution Priors + Expected Gradients  like doing integrated gradients in many directions (e.g. by using other points in the training batch as the baseline)
 can use this prior to help improve performance
 Variable Importance Clouds: A Way to Explore Variable Importance for the Set of Good Models
 All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously (Aaron, Rudin, & Dominici 2018)
 Interpreting Black Box Models via Hypothesis Testing
feature interactions
How interactions are defined and summarized is a very difficult thing to specify. For example, interactions can change based on monotonic transformations of features (e.g. $y= a \cdot b$, $\log y = \log a + \log b$). Nevertheless, when one has a specific question it can make sense to pursue finding and understanding interactions.
 buildup = contextfree, less faithful: score is contribution of only variable of interest ignoring other variables
 breakdown = occlusion = contextdependent, more faithful: score is contribution of variable of interest given all other variables (e.g. permutation test  randomize var of interest from right distr.)
 Hstatistic: 0 for no interaction, 1 for complete interaction
 how much of the variance of the output of the joint partial dependence is explained by the interaction instead of the individuals
 $H^2{jk} = \underbrace{\sum_i [\overbrace{PD(x_j^{(i)}, x_k^{(i)})}^{\text{interaction}} \overbrace{ PD(x_j^{(i)})  PD(x_k^{(i)})}^{\text{individual}}]^2}{\text{sum over data points}} : / : \underbrace{\sum_i [PD(x_j^{(i)}, x_k^{(i)})}_{\text{normalization}}]^2$
 alternatively, using ANOVA decomp: $H_{jk}^2 = \sum_i g_{ij}^2 / \sum_i (\mathbb E [Y \vert X_i, X_j])^2$
 same assumptions as PDP: features need to be independent
 alternatives
 variable interaction networks (Hooker, 2004)  decompose pred into main effects + feature interactions
 PDPbased feature interaction (greenwell et al. 2018)
 featurescreening (feng ruan’s work)
 want to find beta which is positive when a variable is important
 idea: maximize difference between (distances for interclass) and (distances for intraclass)
 using an L1 distance yields better gradients than an L2 distance
 ANOVA  factorial method to detect feature interactions based on differences among group means in a dataset
 Automatic Interaction Detection (AID)  detects interactions by subdividing data into disjoint exhaustive subsets to model an outcome based on categorical features
 Shapley Taylor Interaction Index (STI) (Dhamdhere et al., 2019)  extends shap to all interactions
 retraining
 Additive groves (Sorokina et al. 2008) proposed use random forest with and without an interaction (forcibly removed) to detect feature interactions  very slow
 gradientbased methods (originally Friedman and Popescu, 2008 then later used with many models such as logit)
 test if partial derivatives for some subset (e.g. $x_1, …, x_p$) are nonzero $\mathbb{E}{\mathbf{x}}\left[\frac{\partial^p f(\mathbf{x})}{\partial x{i_{1}} \partial x_{i_{2}} \ldots \partial x_{i_p}}\right]^{2}>0$
 doesn’t work well for piecewise functions (e.g. Relu) and computationally expensive
 include interactions explicitly then run lasso (e.g. bien et al. 2013)
 methods for finding frequent item sets
vim (variable importance measure) framework
 VIM
 a quantitative indicator that quantifies the change of model output value w.r.t. the change or permutation of one or a set of input variables
 an indicator that quantifies the contribution of the uncertainties of one or a set of input variables to the uncertainty of model output variable
 an indicator that quantifies the strength of dependence between the model output variable and one or a set of input variables.
 differencebased  deriv=based methods, local importance measure, morris’ screening method
 LIM (local importance measure)  like LIME
 can normalize weights by values of x, y, or ratios of their standard deviations
 can also decompose variance to get the covariances between different variables
 can approximate derivative via adjoint method or smth else
 morris’ screening method
 take a grid of local derivs and look at the mean / std of these derivs
 can’t distinguish between nonlinearity / interaction
 using the squared derivative allows for a close connection w/ sobol’s total effect index
 can extend this to taking derivs wrt different combinations of variables
 LIM (local importance measure)  like LIME
 parametric regression
 correlation coefficient, linear reg coeffeicients
 partial correlation coefficient (PCC)  wipe out correlations due to other variables
 do a linear regression using the other variables (on both X and Y) and then look only at the residuals
 rank regression coefficient  better at capturing nonlinearity
 could also do polynomial regression
 more techniques (e.g. relative importance analysis RIA)
 nonparametric regression
 use something like LOESS, GAM, projection pursuit
 rank variables by doing greedy search (add one var at a time) and seeing which explains the most variance
 nonparametric regression
 hypothesis test
 gridbased hypothesis tests: splitting the sample space (X, Y) into grids and then testing whether the patterns of sample distributions across different grid cells are random
 ex. see if means vary
 ex. look at entropy reduction
 other hypothesis tests include the squared rank difference, 2D kolmogorovsmirnov test, and distancebased tests
 gridbased hypothesis tests: splitting the sample space (X, Y) into grids and then testing whether the patterns of sample distributions across different grid cells are random
 variancebased vim (sobol’s indices)
 ANOVA decomposition  decompose model into conditional expectations $Y = g_0 + \sum_i g_i (X_i) + \sum_i \sum_{j > i} g_{ij} (X_i, X_j) + \dots + g_{1,2,…, p}$
 $g_0 = \mathbf E (Y)\ g_i = \mathbf E(Y \vert X_i)  g_0 \ g_{ij} = \mathbf E (Y \vert X_i, X_j)  g_i  g_j  g_0\…$
 take variances of these terms
 if there are correlations between variables some of these terms can misbehave
 note: $V(Y) = \sum_i V (g_i) + \sum_i \sum_{j > i} V(g_{ij}) + … V(g_{1,2,…,p})$  variances are orthogonal and all sum to total variance
 anova decomposition basics  factor function into means, firstorder terms, and interaction terms
 $S_i$: Sobol’s main effect index: $=V(g_i)=V(E(Y \vert X_i))=V(Y)E(V(Y \vert X_i))$
 small value indicates $X_i$ is noninfluential
 usually used to select important variables
 $S_{Ti}$: Sobol’s total effect index  include all terms (even interactions) involving a variable
 equivalently, $V(Y)  V(E[Y \vert X_{\sim i}])$
 usually used to screen unimportant variables
 it is common to normalize these indices by the total variance $V(Y)$
 three methods for computation  Fourire amplitude sensitivity test, metamodel, MCMC
 when features are correlated, these can be strange (often inflating the main effects)
 can consider $X_i^{\text{Correlated}} = E(X_i \vert X_{\sim i})$ and $X_i^{\text{Uncorrelated}} = X_i  X_i^{\text{Correlated}}$
 usually used to screen unimportant variables
 this can help us understand the contributions that come from different features, as well as the correlations between features (e.g. $S_i^{\text{Uncorrelated}} = V(E[Y \vert X_i^{\text{Uncorrelated}}])/V(Y)$
 sobol indices connected to shapley value
 $SHAP_i = \underset{S, i \in S}{\sum} V(g_S) / \vert S \vert$
 sobol indices connected to shapley value
 efficiently compute SHAP values directly from data (williamson & feng, 2020 icml)
 ANOVA decomposition  decompose model into conditional expectations $Y = g_0 + \sum_i g_i (X_i) + \sum_i \sum_{j > i} g_{ij} (X_i, X_j) + \dots + g_{1,2,…, p}$
 momentindependent vim
 want more than just the variance ot the output variables
 e.g. delta index = average dist. between $f_Y(y)$ and $f_{Y \vert X_i}(y)$ when $X_i$ is fixed over its full distr.
 $\delta_i = \frac 1 2 \mathbb E \int \vert f_Y(y)  f_{Y\vert X_i} (y) \vert dy = \frac 1 2 \int \int \vert f_{Y, X_i}(y, x_i)  f_Y(y) f_{X_i}(x_i) \vert dy \,dx_i$
 momentindependent because it depends on the density, not just any moment (like measure of dependence between $y$ and $X_i$
 can also look at KL, max dist..
 graphic vim  like curves
 e.g. scatter plot, metamodel plot, regional VIMs, parametric VIMs
 CSM  relative change of model ouput mean when range of $X_i$ is reduced to any subregion
 CSV  same thing for variance
 A Simple and Effective ModelBased Variable Importance Measure
 measures the feature importance (defined as the variance of the 1D partial dependence function) of one feature conditional on different, fixed points of the other feature. When the variance is high, then the features interact with each other, if it is zero, they don’t interact.
importance curves
 pdp plots  marginals (force value of plotted var to be what you want it to be)
 separate into ice plots  marginals for instance
 average of ice plots = pdp plot
 sometimes these are centered, sometimes look at derivative
 both pdp ice suffer from many points possibly not being real
 totalvis: A Principal Components Approach to Visualizing Total Effects in Black Box Models  visualize pdp plots along PC directions
 separate into ice plots  marginals for instance
 possible solution: Marginal plots Mplots (bad name  uses conditional, not marginal)
 only use points conditioned on certain variable
 problem: this bakes things in (e.g. if two features are correlated and only one important, will say both are important)
 ALEplots  take points conditioned on value of interest, then look at differences in predictions around a window
 this gives pure effect of that var and not the others
 needs an order (i.e. might not work for caterogical)
 doesn’t give you individual curves
 recommended very highly by the book…
 they integrate as you go…
 summary: To summarize how each type of plot (PDP, M, ALE) calculates the effect of a feature at a certain grid value v:
 Partial Dependence Plots: “Let me show you what the model predicts on average when each data instance has the value v for that feature. I ignore whether the value v makes sense for all data instances.”
 MPlots: “Let me show you what the model predicts on average for data instances that have values close to v for that feature. The effect could be due to that feature, but also due to correlated features.”
 ALE plots: “Let me show you how the model predictions change in a small “window” of the feature around v for data instances in that window.”
examplebased explanations
 influential instances  want to find important data points
 deletion diagnostics  delete a point and see how much it changed
 influence funcs (koh & liang, 2017): use Hessian ($\theta x \theta$) to give effect of upweighting a point
 influence functions = inifinitesimal approach  upweight one person by infinitesimally small weight and see how much estimate changes (e.g. calculate first derivative)
 influential instance  when data point removed, has a strong effect on the model (not necessarily same as an outlier)
 requires access to gradient (e.g. nn, logistic regression)
 take single step with Newton’s method after upweighting loss
 yield change in parameters by removing one point
 yield change in loss at one point by removing a different point (by multiplying above by cahin rule)
 yield change in parameters by modifying one point
tree ensembles
 mean decrease impurity = MDI = Gini importance
 Breiman proposes permutation tests = MDA: Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1). Springer: 5–32
 conditional variable importance for random forests (strobl et al. 2008)
 propose permuting conditioned on the values of variables not being permuted
 to find region in which to permute, define the grid within which the values of $X_j$ are permuted for each tree by means of the partition of the feature space induced by that tree
 many scores (such as MDI, MDA) measure marginal importance, not conditional importance
 as a result, correlated variables get importances which are too high
 propose permuting conditioned on the values of variables not being permuted
 conditional variable importance for random forests (strobl et al. 2008)
 Extracting Optimal Explanations for Ensemble Trees via Logical Reasoning (zhang et al. ‘21)  OptExplain  extracts global explanation of tree ensembles using logical reasoning, sampling, + optimization
 treeshap (lundberg, erion & lee, 2019): predictionlevel
 individual feature attribution: want to decompose prediction into sum of attributions for each feature
 each thing can depend on all features
 Saabas method: basic thing for tree
 you get a pred at end
 count up change in value at each split for each variable
 three properties
 local acc  decomposition is exact
 missingness  features that are already missing are attributed no importance
 for missing feature, just (weighted) average nodes from each split
 consistency  if F(X) relies more on a certain feature j, $F_j(x)$ should
 however Sabaas method doesn’t change $F_j(X)$ for $F’(x) = F(x) + x_j$
 these 3 things iply we want shap values
 average increase in func value when selecting i (given all subsets of other features)
 for binary features with totally random splits, same as Saabas
 can cluster based on explanation similarity (fig 4)
 can quantitatively evaluate based on clustering of explanations
 their fig 8  qualitatively can see how different features alter outpu
 gini importance is like weighting all of the orderings
 individual feature attribution: want to decompose prediction into sum of attributions for each feature
 Explainable AI for Trees: From Local Explanations to Global Understanding (lundberg et al. 2019)
 shapinteraction scores  distribute among pairwise interactions + local effects
 plot lots of local interactions together  helps detect trends
 propose doing shap directly on loss function (identify how features contribute to loss instead of prediction)
 can run supervised clustering (where SHAP score is the label) to get meaningful clusters
 alternatively, could do smth like CCA on the model output
 understanding variable importances in forests of randomized trees (louppe et al. 2013)
 consider fully randomized trees
 assume all categorical
 randomly pick feature at each depth, split on all possibilities
 also studied by biau 2012
 extreme case of random forest w/ binary vars?
 real trees are harder: correlated vars and stuff mask results of other vars lower down
 asymptotically, randomized trees might actually be better
 consider fully randomized trees
 Actionable Interpretability through Optimizable Counterfactual Explanations for Tree Ensembles (lucic et al. 2019)
 iterative random forest (basu et al. 2018)
 fit RF and get MDI importances
 iteratively refit RF, weighting probability of feature being selected by its previous MDI
 find interactions as features which cooccur on paths (using RIT algorithm)
neural nets (dnns)
dnn visualization
 good summary on distill
 visualize intermediate features
 visualize filters by layer  doesn’t really work past layer 1
 decoded filter  rafegas & vanrell 2016  project filter weights into the image space  pooling layers make this harder
 deep visualization  yosinski 15
 Understanding Deep Image Representations by Inverting Them (mahendran & vedaldi 2014)  generate image given representation
 pruning for identifying critical data routing paths  prune net (while preserving prediction) to identify neurons which result in critical paths
 penalizing activations
 interpretable cnns (zhang et al. 2018)  penalize activations to make filters slightly more intepretable
 could also just use specific filters for specific classes…
 teaching compositionality to cnns  mask features by objects
 interpretable cnns (zhang et al. 2018)  penalize activations to make filters slightly more intepretable
 approaches based on maximal activation
 images that maximally activate a feature
 deconv nets  Zeiler & Fergus (2014) use deconvnets (zeiler et al. 2011) to map features back to pixel space
 given one image, get the activations (e.g. maxpool indices) and use these to get back to pixel space
 everything else does not depend on the original image
 might want to use optimization to generate image that makes optimal feature instead of picking from training set  before this, erhan et al. did this for unsupervised features  dosovitskiy et al 16  train generative deconv net to create images from neuron activations  aubry & russel 15 do similar thing
 deep dream  reconstruct image from feature map
 could use natural image prior
 could train deconvolutional NN
 also called deep neuronal tuning  GD to find image that optimally excites filters
 deconv nets  Zeiler & Fergus (2014) use deconvnets (zeiler et al. 2011) to map features back to pixel space
 neuron feature  weighted average version of a set of maximum activation images that capture essential properties  rafegas_17
 can also define color selectivity index  angle between first PC of color distribution of NF and intensity axis of opponent color space
 class selectivity index  derived from classes of images that make NF
 saliency maps for each image / class
 simonyan et al 2014
 Diagnostic Visualization for Deep Neural Networks Using Stochastic Gradient Langevin Dynamics  sample deep dream images generated by gan
 images that maximally activate a feature
 Zoom In: An Introduction to Circuits (olah et al. 2020)
 study of inceptionV1 (GoogLeNet)
 some interesting neuron clusters: curve detectors, highlow freq detectors (useful for finding background)
 an overview of early vision (olah et al. 2020)
 many groups
 conv2d0: gabor, colorcontrast, other
 conv2d1: lowfreq, gaborlike, color contrast, multicolor, complex gabor, color, hatch, other
 conv2d2: color contrast, line, shifted line, textures, other, color centersurround, tiny curves, etc.
 many groups
 curvedetectors (cammarata et al. 2020)
 curvecircuits (cammarata et al. 2021)
 engineering curve circuit from scratch
 posthoc prototypes
 counterfactual explanations  like adversarial, counterfactual explanation describes smallest change to feature vals that changes the prediction to a predefined output
 maybe change fewest number of variables not their values
 counterfactual should be reasonable (have likely feature values)
 humanfriendly
 usually multiple possible counterfactuals (Rashomon effect)
 can use optimization to generate counterfactual
 anchors  opposite of counterfactuals, once we have these other things won’t change the prediction
 prototypes (assumed to be data instances)
 prototype = data instance that is representative of lots of points
 criticism = data instances that is not well represented by the set of prototypes
 examples: kmedoids or MMDcritic
 selects prototypes that minimize the discrepancy between the data + prototype distributions
 counterfactual explanations  like adversarial, counterfactual explanation describes smallest change to feature vals that changes the prediction to a predefined output
 Architecture Disentanglement for Deep Neural Networks (hu et al. 2021)  “NAD learns to disentangle a pretrained DNN into subarchitectures according to independent tasks”
 Explaining Deep Learning Models with Constrained Adversarial Examples
 Understanding Deep Architectures by Visual Summaries
 Semantics for Global and Local Interpretation of Deep Neural Networks
 Iterative augmentation of visual evidence for weaklysupervised lesion localization in deep interpretability frameworks
 explaining image classifiers by counterfactual generation
 generate changes (e.g. with GAN infilling) and see if pred actually changes
 can search for smallest sufficient region and smallest destructive region
dnn conceptbased explanations
 concept activation vectors
 Given: a userdefined set of examples for a concept (e.g., ‘striped’), and random
examples, labeled trainingdata examples for the studied class (zebras)
 given trained network
 TCAV can quantify the model’s sensitivity to the concept for that class. CAVs are learned by training a linear classifier to distinguish between the activations produced by a concept’s examples and examples in any layer
 CAV  vector orthogonal to the classification boundary
 TCAV uses the derivative of the CAV direction wrt input
 automated concept activation vectors  Given a set of concept discovery images, each image is segmented with different resolutions to find concepts that are captured best at different sizes. (b) After removing duplicate segments, each segment is resized tothe original input size resulting in a pool of resized segments of the discovery images. (c) Resized segments are mapped to a model’s activation space at a bottleneck layer. To discover the concepts associated with the target class, clustering with outlier removal is performed. (d) The output of our method is a set of discovered concepts for each class, sorted by their importance in prediction
 Given: a userdefined set of examples for a concept (e.g., ‘striped’), and random
examples, labeled trainingdata examples for the studied class (zebras)
 On Completenessaware ConceptBased Explanations in Deep Neural Networks
 Interpretable Basis Decomposition for Visual Explanation (zhou et al. 2018)  decompose activations of the input image into semantically interpretable components pretrained from a large concept corpus
dnn causalmotivated attribution
 Explaining The Behavior Of BlackBox Prediction Algorithms With Causal Learning  specify some interpretable features and learn a causal graph of how the classifier uses these features (sani et al. 2021)
 partial ancestral graph (PAG) (zhang 08) is a graphical representation which includes
 directed edges (X $\to$ Y means X is a causal ancestor of Y)
 bidirected edges (X $\leftrightarrow$ Y means X and Y are both caused by some unmeasured common factor(s), e.g., X ← U → Y )
 partially directed edges (X $\circ \to$ Y or X $\circ\circ$ Y ) where the circle marks indicate ambiguity about whether the endpoints are arrows or tails
 PAGs may also include additional edge types to represent selection bias
 given a model’s predictions $\hat Y$ and some potential causes $Z$, learn a PAGE among them all
 assume $\hat Y$ is a causal nonancestor of $Z$ (there is no directed path from $\hat Y$ into any element of $Z$)
 search for a PAG and not DAG bc $Z$ might not include all possibly relevant variables
 partial ancestral graph (PAG) (zhang 08) is a graphical representation which includes
 Neural Network Attributions: A Causal Perspective (Chattopadhyay et al. 2019)
 the neural network architecture is viewed as a Structural Causal Model, and a methodology to compute the causal effect of each feature on the output is presented
 CXPlain: Causal Explanations for Model Interpretation under Uncertainty (schwab & karlen, 2019)
 modelagnostic  efficiently query model to figure out which inputs are most important
 pixellevel attributions
 Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals (elezar…goldberg, 2020)
 instead of simple probing, generate counterfactuals in representations and see how final prediction changes
 remove a property (e.g. part of speech) from the repr. at a layer using Iterative Nullspace Projection (INLP) (Ravfogel et al., 2020)
 iteratively tries to predict the property linearly, then removes these directions
 remove a property (e.g. part of speech) from the repr. at a layer using Iterative Nullspace Projection (INLP) (Ravfogel et al., 2020)
 instead of simple probing, generate counterfactuals in representations and see how final prediction changes
 Bayesian Interpolants as Explanations for Neural Inferences (mcmillan 20)
 if $A \implies B$, interpolant $I$ satisfies $A\implies I$, $I \implies B$ and $I$ expressed only using variables common to $A$ and $B$
 here, $A$ is model input, $B$ is prediction, $I$ is activation of some hidden layer

Bayesian interpolant show $P(A B) \geq \alpha^2$ when $P(I A) \geq \alpha$ and $P(B I) \geq \alpha$
 if $A \implies B$, interpolant $I$ satisfies $A\implies I$, $I \implies B$ and $I$ expressed only using variables common to $A$ and $B$
dnn feature importance
 saliency maps
 occluding parts of the image  sweep over image and remove patches  which patch removals had highest impact on change in class?
 text usually uses attention maps
 ex. karpathy et al LSTMs
 ex. lei et al.  most relevant sentences in sentiment prediction
 attention is not explanation (jain & wallace, 2019)
 attention is not not explanation (wiegreffe & pinter, 2019)
 influence = pred with a word  pred with a word masked
 attention corresponds to this kind of influence
 deceptive attention  we can successfully train a model to make similar predictions but have different attention
 influence = pred with a word  pred with a word masked
 classactivation map (CAM) (zhou et al. 2016)
 sum the activations across channels (weighted by their weight for a particular class)
 weirdness: drop negative activations (can be okay if using relu), normalize to 01 range
 CALM (kim et al. 2021)  fix issues with normalization before by introducing a latent variable on the activations
 RISE (Petsiuk et al. 2018)  randomized input sampling
 randomly mask the images, get prediction
 saliency map = sum of masks weighted by the produced predictions
 gradientbased methods  visualize what in image would change class label
 gradient * input
 integrated gradients (sundararajan et al. 2017)  just sum up the gradients from some baseline to the image times the differ (in 1d, this is just $\int_{x’=baseline}^{x’=x} (xx’) \cdot (f(x)  f(x’))$
 in higher dimensions, such as images, we pick the path to integrate by starting at some baseline (e.g. all zero) and then get gradients as we interpolate between the zero image and the real image
 if we picture 2 features, we can see that integrating the gradients will not just yield $f(x)  f(baseline)$, because each time we evaluate the gradient we change both features
 explanation distill article
 ex. any pixels which are same in original image and modified image will be given 0 importance
 lots of different possible choices for baseline (e.g. random Gaussian image, blurred image, random image from the training set)
 multiplying by $xx’$ is strange, instead can multiply by distr. weight for $x’$
 could also average over distributions of baseline (this yields expected gradients)
 when we do a Gaussian distr., this is very similar to smoothgrad
 lrp
 taylor decomposition
 deeplift
 smoothgrad  average gradients around local perturbations of a point
 guided backpropagation  springenberg et al
 lets you better create maximally specific image
 selvaraju 17  gradCAM
 gradcam++
 competitive gradients (gupta & arora 2019)
 Label “wins” a pixel if either (a) its map assigns that pixel a positive score higher than the scores assigned by every other label ora negative score lower than the scores assigned by every other label.
 final saliency map consists of scores assigned by the chosen label to each pixel it won, with the map containing a score 0 for any pixel it did not win.
 can be applied to any method which satisfies completeness (sum of pixel scores is exactly the logit value)
 Saliency Methods for Explaining Adversarial Attacks
 critiques
 Do Input Gradients Highlight Discriminative Features? (shah et a. 2021)  input gradients often don’t highlight relevant features (they work better for adv. robust models)
 prove/demonstrate this in synthetic dataset where $x=ye_i$ for standard basis vector $e_i$, $y={\pm 1}$
 Do Input Gradients Highlight Discriminative Features? (shah et a. 2021)  input gradients often don’t highlight relevant features (they work better for adv. robust models)
 newer methods
 ScoreCAM:Improved Visual Explanations Via ScoreWeighted Class Activation Mapping
 Removing input features via a generative model to explain their attributions to classifier’s decisions
 NeuroMask: Explaining Predictions of Deep Neural Networks through Mask Learning
 Interpreting Undesirable Pixels for Image Classification on BlackBox Models
 GuidelineBased Additive Explanation for ComputerAided Diagnosis of Lung Nodules
 Learning how to explain neural networks: PatternNet and PatternAttribution  still gradientbased
 Learning Reliable Visual Saliency for Model Explanations
 Neural Network Attributions: A Causal Perspective
 Gradient Weighted Superpixels for Interpretability in CNNs
 Decision Explanation and Feature Importance for Invertible Networks (mundhenk et al. 2019)
 Efficient Saliency Maps for Explainable AI
dnn feature interactions
 hierarchical interpretations for neural network predictions (singh et al. 2019)
 contextual decomposition (murdoch et al. 2018)
 ACD followup work
 Compositional Explanations for Image Classifiers (chockler et al. 21)  use perturbationbased interpretations to greedily search for pixels which increase prediction the most (simpler version of ACD)
 Detecting Statistical Interactions from Neural Network Weights  interacting inputs must follow strongly weighted connections to a common hidden unit before the final output
 Neural interaction transparency (NIT) (tsang et al. 2017)
 Explaining Explanations: Axiomatic Feature Interactions for Deep Networks (janizek et al. 2020)  integrated hessians
 not clear the distinction between main and interaction effects
 Interpretable Artificial Intelligence through the Lens of Feature Interaction (tsang et al. 2021)
 feature interaction any nonadditive effect between multiple features on an outcome (i.e. cannot be decomposed into a sum of subfunctions of individual variables)
 Learning Global Pairwise Interactions with Bayesian Neural Networks (cui et al. 2020)  Bayesian Group Expected Hessian (GEH)  train bayesian neural net and analyze hessian to understand interactions
 Sparse Epistatic Regularization of Deep Neural Networks for Inferring Fitness Functions (aghazadeh et al. 2020)  penalize DNNs spectral representation to limit learning noisy highorder interactions
dnn textual explanations
 Adversarial Inference for MultiSentence Video Description  adversarial techniques during inference for a better multisentence video description
 Object Hallucination in Image Captioning  image relevance metric  asses rate of object hallucination
 CHAIR metric  what proportion of words generated are actually in the image according to gt sentences and object segmentations
 women also snowboard  force caption models to look at people when making genderspecific predictions
 Fooling Vision and Language Models Despite Localization and Attention Mechanism  can do adversarial attacks on captioning and VQA
 Grounding of Textual Phrases in Images by Reconstruction  given text and image provide a bounding box (supervised problem w/ attention)
 Natural Language Explanations of Classifier Behavior
model summarization / distillation
 piecewise linear interp
 Computing Linear Restrictions of Neural Networks  calculate function of neural network restricting its points to lie on a line
 Interpreting CNN Knowledge via an Explanatory Graph (zhang et al. 2017)  create a graph that responds better to things like objects than individual neurons
 model distillation (modelagnostic)
 Trepan  approximate model w/ a decision tree
 BETA (lakkaraju et al. 2017)  approximate model by a rule list
 exact distillation
 Bornagain tree ensembles (vidal et al. 2020)  efficient algorithm to exactly find a minimal tree which reproduces the predictions of a tree ensemble
 Knowledge Distillation as Semiparametric Inference (dao…mackey, 2021
 background on when kd should succeed
 probabilities more informative than labels (hinton, vinyals, & dean, 2015)
 linear students exactly mimic linear teachers (phuong & lampert, 2019)
 students can learn at a faster rate given knowledge of datapoint difficulty (lopezpaz et al. 2015)
 regularization for kernel ridge regression (mobahi farajtabar, & bartlett, 2020)

teacher class probabilities are proxies for the true bayes class probabilities $\mathbb E [Y x]$
 adjustments
 teacher underfitting $\to$ loss correction
 teacher overfitting $\to$ crossfitting (chernozhukov et al. 2018)  like crossvalidation, fit student only to heldout predictions
 background on when kd should succeed
different problems / perspectives
improving models
 Interpretations are useful: penalizing explanations to align neural networks with prior knowledge (rieger et al. 2020)
 Refining Neural Networks with Compositional Explanations (yao et al. 21)  human looks at saliency maps of interactions, gives natural language explanation, this is converted back to interactions (defined using IG), and then regularized
 Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations
 Explain to Fix: A Framework to Interpret and Correct DNN Object Detector Predictions
 Understanding Misclassifications by Attributes
 Improving VQA and its Explanations by Comparing Competing Explanations (wu et al. 2020)
 train to distinguish correct human explanations from competing explanations supporting incorrect answers
 first, predict answer candidates
 second, retrieve/generate explanation for each candidate
 third, predict verification score from explanation (trained on gt explanations)
 fourth, reweight predictions by verification scores
 generated explanations are rated higher by humans
 VQAE: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions (li et al. 2018)  train to jointly predict answer + generate an explanation
 SelfCritical Reasoning for Robust Visual Question Answering (wu & mooney, 2019)  use textual explanations to extract a set of important visual objects
 train to distinguish correct human explanations from competing explanations supporting incorrect answers
recourse
 actionable recourse in linear classification (ustun et al. 2019)
 want model to provide actionable inputs (e.g. income) rather than immutable variables (e.g. age, marital status)
 drastic changes in actionable inputs are basically immutable
 recourse  can person obtain desired prediction from fixed mode by changing actionable input variables (not just standard explainability)
 want model to provide actionable inputs (e.g. income) rather than immutable variables (e.g. age, marital status)
interp for rl
 heatmaps
 visualize most interesting states / rollouts
 language explanations
 interpretable intermediate representations (e.g. bounding boxes for autonomous driving)
 policy extraction  distill a simple model from a bigger model (e.g. neural net > tree)
differential privacy
 differential private if the outputs of the model do not change (within some epsilon tolerance) if you remove a single datapoint
interpretation over sets / perturbations
These papers don’t quite connect to prediction, but are generally about finding stable interpretations across a set of models / choices.
 Exploring the cloud of variable importance for the set of all good models (dong & rudin, 2020)
 All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously (fisher, rudin, & dominici, 2019)  also had title Model class reliance: Variable importance measures for any machine learning model class, from the “Rashomon” perspective
 model reliance = MR  like permutation importance, measures how much a model relies on covariates of interest for its accuracy
 defined (for a feature) as the ratio of expected loss after permuting (with all possible permutation pairs) to before permuting
 could also be defined as a difference or using predictions rather than loss
 connects to Ustatistics  can shows unbiased etc.
 related to Algorithm Reliance (AR)  fitting with/without a feature and measuring the difference in loss (see gevrey et al. 03)
 defined (for a feature) as the ratio of expected loss after permuting (with all possible permutation pairs) to before permuting
 modelclass reliance = MCR = highest/lowest degree of MR within a class of wellperforming models
 with some assumptions on model class complexity (in the form of a covering number), can create uniform bounds on estimation error
 MCR can be efficiently computed for (regularized) linear / kernel linear models
 Rashomon set = class of wellperforming models
 “Rashomon” effect of statistics  many prediction models may fit the data almost equally well (breiman 01)
 “This set can be thought of as representing models that might be arrived at due to differences in data measurement, processing, filtering, model parameterization, covariate selection, or other analysis choices”
 can study these tools for describing rank of risk predictions, variance of predictions, e.g. confidence intervals
 confidence intervals  can get finitesample interval for anything, not just loss (e.g. norm of coefficients, prediction for a specific point)
 connections to causality
 when function is conditional expectation, then MR is similar to many things studies in causal literature
 conditional importance measures a different notion (takes away things attributed to spurious variables)
 can be hard to do conditional permutation well when some feature pairs are rare so can use weighting, matching, or imputation
 here, application is to see on COMPAS dataset whether one can build an accurate model which doesn’t rely on race / sex (in order to audit blackbox COMPAS models)
 model reliance = MR  like permutation importance, measures how much a model relies on covariates of interest for its accuracy
 A Theory of Statistical Inference for Ensuring the Robustness of Scientific Results (coker, rudin, & king, 2018)
 Inference = process of using facts we know to learn about facts we do not know
 hacking intervals  the range of a summary statistic one may obtain given a class of possible endogenous manipulations of the data
 prescriptively constrained hacking intervals  explicitly define reasonable analysis perturbations
 ex. hyperparameters (e.g. k in kNN), matching algorithm, adding a new feature
 tethered hacking intervals  take any model with small enough loss on the data
 rather than choosing $\alpha$, we choose error tolerance
 for MLE, equivalent to profile likelihood confidence intervals
 ex. SVM distance from point to boundary, Kernel regression prediction for a specific new point, feature selection
 ex. linear regression ATE, individual treatment effect
 PCS intervals could be seen as slightly broader, including data cleaning and problem translations
 prescriptively constrained hacking intervals  explicitly define reasonable analysis perturbations
 different theories of inference have different counterfactual worlds
 pvalues  data from a superpopulation
 Fisher’s exact pvalues  fix the data and randomize counterfactual treatment assignments
 Causal sensitivity analysis  unmeasured confounders from a defined set
 bayesian credible intervals  redrawing the data from the same data generating process, given the observed data and assumed prior and likelihood model
 hacking intervals  counterfactual researchers making counterfactual analysis choices
 2 approaches to replication
 replicating studies  generally replication is very low
 pcurve approach: look at distr. of pvalues, check if lots of things are near 0.05
 A study in Rashomon curves and volumes: A new perspective on generalization and model simplicity in machine learning (semenova, rudin, & parr, 2020)
 rashomon ratio  ratio of the volume of the set of accurate models to the volume of the hypothesis space
 can use this to perform model selection over different hypothesis spaces using empirical risk v. rashomon ratio (rashomon curve)
 pattern Rashomon ratio  considers unique predictions on the data (called “patterns”) rather than the count of functions themselves.
 rashomon ratio  ratio of the volume of the set of accurate models to the volume of the hypothesis space
 Underspecification Presents Challenges for Credibility in Modern Machine Learning (D’Amour et al. 2020)
 different models can achieve the same validation accuracy but perform differently wrt different data perturbations
 shortcuts = spurious correlations cause failure because of ambiguity in the data
 stress tests probe a broader set of requirements
 ex. subgroup analyses, domain shift, contrastive evaluations (looking at transformations of an individual example, such as counterfactual notions of fairness)
 suggestions
 need to test models more thoroughly
 need criteria to select among good models (e.g. explanations)
 Predictive Multiplicity in Classification (marx et al. 2020)
 predictive multiplicity = ability of a prediction problem to admit competing models with conflicting predictions
 A general framework for inference on algorithmagnostic variable importance (williamson et al. 2021)
misc new papers
 iNNvestigate neural nets  provides a common interface and outofthebox implementation
 tensorfuzz  debugging
 ICIE 1.0: A Novel Tool for Interactive Contextual Interaction Explanations
 ConvNets and ImageNet Beyond Accuracy: Understanding Mistakes and Uncovering Biases
 cnns are more accurate, robust, and biased then we might expect on imagenet
 Bridging Adversarial Robustness and Gradient Interpretability
 explaining a blackbox w/ deep variational bottleneck
 Global Explanations of Neural Networks: Mapping the Landscape of Predictions
 neural stethoscopes
 xGEMs
 maximally invariant data perturbation
 hard coding
 SSIM layer
 Inverting Supervised Representations with Autoregressive Neural Density Models
 nonparametric var importance
 supervised local modeling
 detect adversarial cnn attacks w/ feature maps
 adaptive dropout
 lesion detection saliency
 How Important Is a Neuron?
 symbolic execution for dnns
 Lshapley abd Cshapley
 Interpreting Neural Network Judgments via Minimal, Stable, and Symbolic Corrections
 DeepPINK: reproducible feature selection in deep neural networks
 “Explaining Deep Learning Models – A Bayesian Nonparametric Approach”
 Detecting Potential Local Adversarial Examples for HumanInterpretable Defense
 Interpreting Layered Neural Networks via Hierarchical Modular Representation
 Entropic Variable Boosting for Explainability & Interpretability in Machine Learning
 Understanding Individual Decisions of CNNs via Contrastive Backpropagation
 Understanding Impacts of HighOrder Loss Approximations and Features in Deep Learning Interpretation
 A Game Theoretic Approach to Classwise Selective Rationalization
 Additive Explanations for Anomalies Detected from Multivariate Temporal Data
 Asymmetric Shapley values: incorporating causal knowledge into modelagnostic explainability
 Consistent Regression using DataDependent Coverings
 Contextual Local Explanation for Black Box Classifiers
 Contextual Prediction Difference Analysis
 Contrastive Relevance Propagation for Interpreting Predictions by a SingleShot Object Detector
 CXPlain: Causal Explanations for Model Interpretation under Uncertainty
 Ensembles of Locally Independent Prediction Models
 Explaining BlackBox Models Using Interpretable Surrogates
 Explaining Classifiers with Causal Concept Effect (CaCE)
 Generative Counterfactual Introspection for Explainable Deep Learning
 Grid Saliency for Context Explanations of Semantic Segmentation
 Optimal Explanations of Linear Models
 Privacy Risks of Explaining Machine Learning Models
 RLLIM: Reinforcement Learningbased Locally Interpretable Modeling
 The many Shapley values for model explanation
 XDeep: An Interpretation Tool for Deep Neural Networks
 Shapley Decomposition of RSquared in Machine Learning Models
 Understanding Global Feature Contributions Through Additive Importance Measures (covert, lundberg, & lee 2020)
 SAGE score looks at reduction in predictive accuracy due to subsets of features
 Personal insights for altering decisions of treebased ensembles over time
 GradientAdjusted Neuron Activation Profiles for Comprehensive Introspection of Convolutional Speech Recognition Models[
 Learning Global Transparent Models from Local Contrastive Explanations
 Boosting Algorithms for Estimating Optimal Individualized Treatment Rules
 Explaining Knowledge Distillation by Quantifying the Knowledge
 Adversarial TCAV – Robust and Effective Interpretation of Intermediate Layers in Neural Networks
 Problems with Shapleyvaluebased explanations as feature importance measures*
 When Explanations Lie: Why Many Modified BP Attributions Fail
 Estimating Training Data Influence by Tracking Gradient Descent
 Interpreting Interpretations: Organizing Attribution Methods by Criteria
 Explaining Groups of Points in LowDimensional Representations
 Causal Interpretability for Machine Learning – Problems, Methods and Evaluation
 Cyclic Boosting  An Explainable Supervised Machine Learning Algorithm  IEEE Conference Publication
 A Causality Analysis for Nonlinear Classification Model with SelfOrganizing Map and Locally Approximation to Linear Model
 BlackBox Saliency Map Generation Using Bayesian Optimisation
 ON NETWORK SCIENCE AND MUTUAL INFORMATION FOR EXPLAINING DEEP NEURAL NETWORKS