RashomonSet
This class is the core class of the package. It creates the Rashomon Set for a given epsilon value and selected evaluation metric.
leaderboard : pd.DataFrame Leaderboard in a DataFrame format consisting of models and their evaluation metrics scores. (Returned by converters) predictions : dict Dictionary with model names as keys and class prediction vectors as values. (Returned by convertees) proba_predictions : dict Dictionary with model names as keys and class probabilities prediction DataFrames as values. (Returned by converters) base_metric : str Evaluation metric to be used as a primary value for sorting model performances. Allowed metrics are specified in the METRICS attribute. epsilon : float Epsilon parameter specifying the allowable deviation from the best evaluation metric, within which a model qualifies for inclusion in the Rashomon Set.
Allowed values of the base_metric parameter are:
Binary classification:
'accuracy', 'balanced_accuracy', 'roc_auc', 'average_precision', 'precision', 'precision_macro', 'precision_micro','precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_weighted'
Multiclass classification: 'accuracy', 'balanced_accuracy', 'precision_macro', 'precision_micro', 'precision_weighted','recall_macro', 'recall_micro', 'recall_weighted', 'f1_macro', 'f1_micro', 'f1_weighted', 'roc_auc_ovo', 'roc_auc_ovo_weighted','roc_auc_ovr', 'roc_auc_ovr_micro', 'roc_auc_ovr_weighted'
The epsilon parameter value should be greater than 0, as it represents the absolute difference threshold.
base_model : str Name of the model that achieved the highest score for the evaluation metric specified by base_metric parameter. best_score : float The best value of the base_metric achieved by all models. worst_score : dict The worst value of the base_metric achieved by all models. rashomon_set : list Names of the models that are included in the Rashomon Set for given parameters. Obtained from get_rashomon_set() method. If the size of the set happens to contain less than 2 models, the constructor throws ValueError and asks to specify different parameters such as base_metric and epsilon. rashomon_predictions : dict A subset of predictions dict parameter containing only models that are included in the Rashomon Set. rashomon_proba_predictions : dict A subset of proba_predictions dict parameter containing only models that are included in the Rashomon Set. number_of_classes : int Determining number of classes in the task based on the predictions vector. Obtained from determine_number_of_classes() method. task_type : str Classification task type - 'binary' or 'multiclass' obtained from determine_task_type() method. METRICS : list All evaluation metrics that are allowed for analysis. METRICS_GREATER_IS_BETTER : dict A dictionary containing classification evaluation metrics names and the boolean value indicating whether this metric is interpreted as 'greater is better' (e.g accuracy) or not (e.g log loss). leaderboard : pd.DataFrame leaderboard parameter value predictions : dict predictions parameter value proba_predictions : dict proba_predictions parameter value base_metric : str base_metric parameter value epsilon : float epsilon parameter value
Returns whether the task type for a given dataset is binary or multiclass classification based on prediction vectors. If another task type is determined, the method throws ValueError.
Returns :
task_type : str
Method for determining number of classes in the task based on predictions vector.
Returns :
number_of_classes : int
Method for determining number of samples in the task based on predictions vector.
Returns :
number_of_samples : int
Returns the name of the model which achieved the best score for the evaluation metric specified by base_metric parameter. Uses the METRICS_GREATER_IS_BETTER to include the differences in metrics evaluation.
Returns :
base_model : str
Returns the name of the model which achieved the worst score for the evaluation metric specified by base_metric parameter. Uses the METRICS_GREATER_IS_BETTER to include the differences in metrics evaluation.
Returns :
worst_model : str
Returns the number and names of models which achieved the same score for the evaluation metric specified by base_metric parameter as the base_model. Used especially when there are multiple models with the same scores.
Returns :
same_scores_count : int
same_scores_models : list
epsilon)
Returns the names of models that are included in the Rashomon Set for a specified epsilon and base_metric. Rashomon set is defined as:
$$
S_\epsilon(h_0) := \lbrace h \in H : R(h_0) \le R(h) + \epsilon \rbrace
$$
where \(h_0\) is the base_model, H is a set of all models in a hypothesis space and R is a selected risk function (base_metric).
Read more about the Rashomon Set : Predictive Multiplicity in Classification (Definition 1)
Parameters :
epsilon: float, default = None
When a different epsilon value is not specified the method uses the epsilon parameter value.
Returns :
rashomon_set : list
Returns the subsets of predictions and proba_predictions dictionaries containing only models that are included in the Rashomon Set.
Returns :
rashomon_predictions : dict
rashomon_proba_predictions : dict
Returns the subsets of the feature_importance_dict containing only models that are included in the Rashomon Set.
Returns :
rashomon_feature_importances : dict
Calculates binary ambiguity of a rashomon set by counting all observations where at least one model made a different class prediction than the base model. Those observations are considered ambiguous. Returns the fraction of ambiguous observations. Ambiguity is defined as:
$$
\displaystyle
{ \alpha(h_0) := \frac{1}{n}\sum_{i=1}^n max_{h \in S_{\epsilon}} \mathbf{1}[h(x_i) \neq h_0(x_i)]}
$$
where \(S_\epsilon\) is the Rashomon Set and \(h(x_i)\) is the models predction for observation \(x_i\).
Read more about binary ambiguity : Predictive Multiplicity in Classification (Definition 3)
Note : Method available only for binary classification task type.
Returns :
binary_ambiguity : float
Calculates multiclass ambiguity of the Rashomon Set by counting all observations where at least one model made a different class prediction than the base model. Those observations are considered ambiguous. Returns the fraction of ambiguous observations. Multiclass ambiguity is defined as:
$$
\displaystyle
{ \alpha(h_0) := \frac{1}{n}\sum_{i=1}^n max_{h \in S_{\epsilon}} \mathbf{1}[argmax(h(x_i)) \neq argmax(h_0(x_i))]}
$$
where \(S_\epsilon\) is the Rashomon Set and \(h(x_i)\) is the class probabilities prediction vector for observation \(x_i\), then the \(argmax(h(x_i))\) is the predicted class.
Read more about muliclass ambiguity : Rashomon Capacity: A Metric for
Predictive Multiplicity in Classification (Equation 2)
Note: Method available only for multiclass classification task type.
Returns :
multiclass_ambiguity : float
delta)
Calculates probabilistic ambiguity of the Rashomon Set by counting all observations where at least one model made a different risk prediction than the base model. Risk predictions need to have a difference greater than delta to be considered conflicting. Returns the fraction of ambiguous observations. The definition of probabilistic ambiguity is as follows:
$$
\displaystyle
{ \alpha(h_0, \delta) := \frac{1}{n}\sum_{i=1}^n \mathbf{1}[max_{h \in S_{\epsilon}} |g(x_i) - g_0(x_i)| \ge \delta ]}
$$
where \(S_\epsilon\) is the Rashomon Set and \(g(x_i)\) is the risk probability prediction \(P(y_i=1|x_i)\) for observation \(x_i\).
Read more about probabilistic ambiguity : Predictive Multiplicity in Probabilistic Classification (Definition 3)
Note : Method available only for binary classification task type.
Parameters :
delta : float
delta parameter indicates the minimum difference between two risk probabilities for the predictions to be considered conflicting.
Returns :
probabilistic_ambiguity : float
Calculates discrepancy for binary classification task by counting how many predictions differ between base and reference model. Then choses the max sum of different predictions across all models from the Rashomon Set. Discrepancy is defined as:
$$
\displaystyle
{ \delta(h_0) := max_{h \in S_{\epsilon}} \frac{1}{n}\sum_{i=1}^n \mathbf{1}[h(x_i) \neq h_0(x_i)]}
$$
where \(S_\epsilon\) is the Rashomon Set and \(h(x_i)\) is the models predction for observation \(x_i\).
Read more about binary discrepancy : Predictive Multiplicity in Classification (Definition 4)
Note: Method available only for binary classification task type.
Returns :
binary_discrepancy : float
Calculates discrepancy for multiclass classification tasks by counting how many predictions differ between base and reference model. Then chooses the max sum of different predictions across all models from the Rashomon Set. Multiclass disrepancy is defined as:
$$
\displaystyle
{ \delta(h_0) := max_{h \in S_{\epsilon}} \frac{1}{n}\sum_{i=1}^n \mathbf{1}[argmax(h(x_i)) \neq argmax(h_0(x_i))]}
$$
where \(S_\epsilon\) is the Rashomon Set and \(h(x_i)\) is the class probabilities prediction vector for observation \(x_i\), then the \(argmax(h(x_i))\) is the predicted class.
Read more about muliclass discrepancy : Rashomon Capacity: A Metric for
Predictive Multiplicity in Classification (Equation 2)
Note : Method available only for multiclass classification task type.
Returns :
multiclass_discrepancy : float
delta)
Calculates discrepancy for binary target task by counting how many risk predictions differ between base and reference model. For predictions to be considered conflicting their difference must be greater than delta. Then choses the max sum of different predictions across all models from the Rashomon Set. Probabilistic discrepancy is defined as follows:
$$
\displaystyle
{ \delta_\epsilon(h_0, \delta) := max_{h \in S_{\epsilon}}\frac{1}{n}\sum_{i=1}^n \mathbf{1}[ |g(x_i) - g_0(x_i)| \ge \delta ]}
$$
where \(S_\epsilon\) is the Rashomon Set and \(g(x_i)\) is the risk probability prediction \(P(y_i=1|x_i)\) for observation \(x_i\).
Read more about probabilistic discrepancy : Predictive Multiplicity in Probabilistic Classification (Definition 4)
Note : Method available only for binary classification task type.
Parameters :
delta : float
delta parameter indicates the minimum difference between two risk probabilities for the predictions to be considered conflicting.
Returns :
probabilistic_discrepancy : float
Calculates the viable prediction range for each observation in the test dataset as the min and max risk probability predicted across all models in the Rashomon Set. Wide prediction range suggests high uncertainty about the observations prediction. VPR for the given observation is defined as:
$$
\displaystyle
{ V_\epsilon(x_i) := [min_{g \in S_\epsilon} g(x_i), max_{g \in S_\epsilon} g(x_i)] }
$$
where \(S_\epsilon\) is the Rashomon Set and \(g(x_i)\) is the risk probability prediction \(P(y_i=1|x_i)\) for observation \(x_i\).
Read more about Viable Prediction Range : Predictive Multiplicity in Probabilistic Classification (Definition 2)
Note : Method available only for binary classification task type.
Returns :
vprs : list[tuple[float, float]]
For each observation, calculates the percentage of models, which made the same class prediction as the base model. If all models from the Rashomon Set predicted the same class as the base model, the agreement rate for the observation equals 1.
Returns :
agreement_rates : list[float]
Calculates the proportion of observations for which each model from the Rashomon Set predicted the same class as the base model. Returns a dictionary mapping model names to their corresponding agreement percentages with the base model. If a given model made the same predictions for all observations as the base model, its percent agreement equals 100%.
Returns :
percent_agreements : dict[str, float]
Calculates Rashomon Ratio which is the ratio of the Rashomon Set size to the total number of models present in the leaderboard.
where \(\text{vol}(S_{\epsilon}(h_0))\) denotes the "volume" of the Rashomon Set — in our empirical setting this corresponds to the number of models in the Rashomon Set. Similarly, \(\text{vol}(H)\) represents volume of hypothesis space, in this case number of models generated by AutoML framework.
Read more about the Rashomon Ratio in: A Study in Rashomon Curves and Volumes: A New Perspective on Generalization and Model Simplicity in Machine Learning (Definition 2)
Returns :
rashomon_ratio : float
Method to retrieve all unique prediction patterns from the Rashomon Set, representing the collection of predictions produced by each model in the set. Returns a set of patterns in the Rashomon Set.
Returns :
patterns : set
Method to retrieve all unique prediction patterns from models present in the hypothesis space (leaderboard).
Returns :
patterns : set
Method for calculating Pattern Rashomon Ratio, defined as ratio of the number of unique prediction patterns in the Rashomon Set to the total number of unique predictions patterns in the hypothesis space (leaderboard).
where \(X\) denotes the dataset all the models were trained on, and \(h(X)\) denotes all predictions made by model \(h\) across all samples in the dataset.
\(H\) is the hypothesis space (in this case all models present in the leaderboard), and \(S_\epsilon\) is the Rashomon Set.
Read more about the Pattern Rashomon Ratio in: A Study in Rashomon Curves and Volumes: A New Perspective on Generalization and Model Simplicity in Machine Learning (Definition 3)
Returns :
pattern_rashomon_ratio: float
sample_index )
Method for calculating Rashomon Capacity for a given sample. The parameter sample_index should be an index of a sample the Rashomon Capacity is to be calculated. The output value is a float in the range [1, c], where c is the number of classes in the classification task.
If Rashomon Capacity equals 1, then all the models produced the same outputs for given sample, so there is no predictive multiplicity.
If Rashomon Capacity equals c, that indicates that the models produce maximally diverse predictions for the given sample, resulting in highest predictive multiplicity.
Read more about Rashomon Capacity: Rashomon Capacity: A Metric for Predictive Multiplicity in Classification (Definition 2).
Parameters :
sample_index : int
Returns :
rashomon_capacity : float
sample_index )
Method for generating transition matrix for a given sample. Transition matrix (m, c) is a matrix, where m corresponds to the number of models included in the set, while c denotes the number of classes associated with the prediction task. This matrix is used by the Blahut–Arimoto algorithm to compute the channel capacity.
Parameters :
sample_index : int
Returns :
transition_matrix : np.ndarray
transition_matrix, max_iterations, tolerance)
Method for computing the channel capacity for a given sample.
Channel capacity is defined as the maximum mutual information between the channel input X (here representing the models in the Rashomon set) and the channel output Y (the corresponding probabilistic predictions), maximized over all possible input distributions p(x).
Intuitively, this algorithm finds the probability distribution over models (inputs) that maximizes how informative their predictions (outputs) are.
For more details and inspiration, see the Rashomon Capacity GitHub repository.
Parameters :
transition_matrix: np.array, matrix with m rows (models from the Rashomon Set) and c columns (number of classes for specified task). The columns of the transition matrix represent the class probability distributions predicted by each model for every class for specific sample.max_iterations: int, maximum number of algorithm iterations, default = 1000tolerance: float, error tolerance for algorithm to stop iterations, default = 1e-8
channel_capacity : float
sample_index, threshold)
Method for calculating Rashomon Capacity for a given sample and specified threshold. The parameter sample_index should be an index of a sample the Rashomon Capacity is to be calculated.
The threshold specifies the probability cutoff; If the probability of positive class (1) is greater than threshold, then the sample is assigned a positive label (1), else (0). By default set to 0.5.
The output value is a float in the range [1, c], where c is the number of classes in the classification task.
If Rashomon Capacity equals 1, then all the models produced the same outputs for given sample, so there is no predictive multiplicity.
If Rashomon Capacity equals c, that indicates that the models produce maximally diverse predictions for the given sample, resulting in highest predictive multiplicity.
Note: this method is only available for binary classification tasks.
Read more about Rashomon Capacity: Rashomon Capacity: A Metric for Predictive Multiplicity in Classification (Definition 2).
Parameters :
sample_index : intthreshold : float Decision threshold for binary classification. If the predicted probability of the positive class (1) is greater than this threshold, the sample is assigned a label 1 (positive); otherwise, it is assigned 0 (negative).
Returns :
rashomon_capacity : float
sample_index )
Method for calculating Rashomon Capacity for a given sample, calculated on predictions with labels instead of probabilistic predictions. The parameter sample_index should be an index of a sample the Rashomon Capacity is to be calculated. The output value is a float in the range [1, c], where c is the number of classes in the classification task.
If Rashomon Capacity equals 1, then all the models produced the same outputs for given sample, so there is no predictive multiplicity.
If Rashomon Capacity equals c, that indicates that the models produce maximally diverse predictions for the given sample, resulting in highest predictive multiplicity.
Read more about Rashomon Capacity: Rashomon Capacity: A Metric for Predictive Multiplicity in Classification (Definition 2).
Parameters :
sample_index : int
Returns :
rashomon_capacity : float
Method for calculating Cohen's Kappa metric for every model in the Rashomon Set relative to the base model.
Cohen's Kappa measures agreement between two models predictions adjusted for agreement expected to occur by chance.
Values range from -1 to 1, where 1 indicates perfect agreement in models predictions, 0 means agreement that would be expected by chance and -1 denotes no agreement where models make opposite predictions.
The formula for calculating Cohen's Kappa is defined as follows:
$$
\displaystyle
\kappa = \frac{p_o - p_e}{1 - p_e}
$$
where \(p_o\) denotes observed agreement between two models (proportion of matching predictions generated by both models, see percent agreement metric) and \(p_e\) is the expected agreement by chance, calculated from the marginal proportions of predictions for each class by the models.
Returns :
cohens_kappa_dict : dict
Method for calculating the Cohen's Kappa metric for every pair of models in the Rashomon Set. It returns a symmetric matrix where the entry at [i, j] represents the Cohen's Kappa score between model i and model j.
Diagonal entries are 1.0, representing a model compared with itself.
Returns :
kappa_matrix : pd.DataFrame
delta )
Method for returning all Rashomon Set related metrics and attributes in a dictionary format. The retuned properties are: base model, rashomon set size, task type, number of classes, rashomon ratio, pattern rashomon ratio, ambiguity, discrepancy, probabilistic ambiguity, probabilistic discrepancy, VPRs, agreement rates, percent agreements, mean rashomon capacity, min rashomon capacity, max rashomon capacity, std rashomon capacity.
Parameters :
delta : float, default = 0.1
delta parameter used in probabilitisc_ambiguity() and probabilitic_discrepancy() methods for binary task type
Returns :
metrics : dict
delta )
Method for printing key Rashomon Set related metrics and attributes on the console. The printed properties are: base metric, base model, models with the same score as base, best and worst score across all rashomon set models, rashomon set size, task type, number of classes, rashomon ratio, pattern rashomon ratio, ambiguity, discrepancy, probabilistic ambiguity, probabilistic discrepancy, min agreement rate, max agreement rate, std agreement rage, percent agreements, mean rashomon capacity, min rashomon capacity, max rashomon capacity, std rashomon capacity.
Parameters :
delta : float, default = 0.1
delta parameter used in probabilitisc_ambiguity() and probabilitic_discrepancy() methods for binary task type