H2OConverter

class arsa_ml.converters.H2OConverter(models_directory, test_data, target_column, df_name, feature_imp_needed=True)

This is a subclass of the Converter abstract class, wich is used to transform H2O trained models into leaderboard and dictionaries that can be used to build the Rashomon Set. Uses H2O framework that expects Java to be installed.

Parameters

models_directory : Path
Path object navigating to the directory with saved models from H2O framework.
test_data : H2OFrame
Test data for analysis in a H2OFrame format.
target_column : str
Name of the target column. Used to determine the task type - binary or multiclass classification.
df_name : str
The name of the dataset to be used while saving converted results.
feature_imp_needed : bool, default = True
Whether to obtain feature importances from trained models or not. Can
result in a longer runtime of .convert() method.

Attributes

task_type : str
Name of the task type - 'binary' or 'multiclass' indicating whether models were trained for binary or multiclass classification purposes. Obtained from determine_task_type() method.
loaded_models : list
List containing all models that were loaded from the given models_directory. Obtained from load_all_models() method.
prediction_frames_dict : dict
A dictionary containing loaded models as keys and pd.DataFrames containing predcitions returned by H2O models as values. Obtained from get_prediction_frames_dict() method.
class_prediction_dict : dict
A dictionary containing loaded models as keys and pd.Series containing class predcitions obtained from H2O models as values in pd.Series format. Calculated by get_class_predictions_dict() method.
proba_prediction_dict : dict
A dictionary containing loaded models as keys and pd.Series containing class probabilities predcitions obtained from H2O models as values in pd.DataFrame format. Calculated by get_proba_predictions_dict() method.
classes : list
List of all classes found for a given task. Obtained from the test_data.
MULTICLASS_METRICS_DICT : dict
A dictionary containing typical multiclasss classification evaluation metrics name and their implementation functions from sklearn.
BINARY_METRICS_DICT : dict
A dictionary containing typical binary classification evaluation metrics name and their implementation functions from sklearn.
models_directory : Path
models_directory parameter value
test_data : H2OFrame
test_data parameter value
target_column : str
target_column parameter value
df_name : str
df_name parameter value
feature_imp_needed : bool, default = True
feture_imp_needed parameter value

Methods

determine_task_type()

Returns whether the task type for a given dataset is binary or multiclass classification based on target_column and test_data parameters. If another task type is determined, the method throws ValueError.

Returns :
task_type : str

load_all_models()

Loads all H2O models objects from a given model_directory and saves them in a list format.

Returns :
loaded_models : list

get_prediction_frames()

Used to obtain predictions returned by all H2O models and save them in a dictionary format with model names as keys and predictions dataframes as values.

Returns :
predictions : dict[str, pd.DataFrame]

get_class_predictions_dict()

Used to obtain class predictions returned by all H2O models and save them in a dictionary format with model names as keys and predictions in a pd.Series format as values.

Returns :
class_predictions_dict : dict[str, pd.Series]

get_proba_predictions_dict()

Used to obtain class probabilities predictions returned by all H2O models and save them in a dictionary format with model names as keys and probability predictions in a pd.DataFrame format as values.

Returns :
proba_predictions_dict : dict[str, pd.DataFrame]

get_feature_importance_dict()

Used to obtain feature importances from all H2O models and save them in a dictionary format with model names as keys and features sorted in a descending order based on their importance as values.

Returns :
feature_importance_dict : dict[str, list]

extract_target_column()

Used to extract the target column from the test dataset.

Returns :
y_true : pd.DataFrame

calculate_multiclass_metrics()

Calculates multiclass classification evaluation metrics from MULTICLASS_METRICS_DICT for all loaded models based on their prediction vectors and target column. Calculated results are stored in a pd.DataFrame object containing all evaluation metric scores for each model.

Returns :
metrics_df : pd.DataFrame

calculate_binary_metrics()

Calculates binary classification evaluation metrics from BINARY_METRICS_DICT for all loaded models based on their prediction vectors and target column. Calculated results are stored in a pd.DataFrame object containing all evaluation metric scores for each model.

Returns :
metrics_df : pd.DataFrame

save_results(leaderboard, predictions_dict, proba_predictions_dict, feature_importance_dict, y_true, saving_path)

Method for saving results from creating a leaderboard and all dictionaries on disk in .csv and .pickle formats.

Parameters :
leaderboard: pd.DataFrame
created leaderboard to be saved as csv

predictions_dict : dict
created class predictions dict to be saved as pickle

proba_predictions_dict : dict
created proba predictions dict to be saved as pickle

feature_importance_dict : dict
created feature importance dict to be saved as pickle

y_true : pd.DataFrame
extracted target column to be saved as a csv

saving_path : Path
path to a directory where the results should be saved, if not specified the default of timestamp + df_name is used to create a new directory

convert(saving_path)

Final method used to create and save leaderboard created in calculate_binary_metrics() or calculate_multiclass_metrics() based on the task type, class_predictions_dict, proba_predictions_dict and feature_importance_dict created in the corresponding methods. If feature_imp_needed parameter is False, feature_importance_dict is not created and the method returns NaN as its value.

Parameters :
saving_path : Path
path to a directory where the results should be saved, if not specified the default of timestamp + df_name is used to create a new directory

Returns :
leaderboard : pd.DataFrame

class_predictions_dict : dict[str, pd.Series]

proba_predictions_dict : dict[str, pd.DataFrame]

feature_importance_dict : dict[str, list]

y_true : pd.DataFrame