H2OConverter
This is a subclass of the Converter abstract class, wich is used to transform H2O trained models into leaderboard and dictionaries that can be used to build the Rashomon Set. Uses H2O framework that expects Java to be installed.
models_directory : Path
Path object navigating to the directory with saved models from H2O framework.
test_data : H2OFrame
Test data for analysis in a H2OFrame format.
target_column : str
Name of the target column. Used to determine the task type - binary or multiclass classification.
df_name : str
The name of the dataset to be used while saving converted results.
feature_imp_needed : bool, default = True
Whether to obtain feature importances from trained models or not. Can
result in a longer runtime of .convert() method.
task_type : str Name of the task type - 'binary' or 'multiclass' indicating whether models were trained for binary or multiclass classification purposes. Obtained from determine_task_type() method. loaded_models : list List containing all models that were loaded from the given models_directory. Obtained from load_all_models() method. prediction_frames_dict : dict A dictionary containing loaded models as keys and pd.DataFrames containing predcitions returned by H2O models as values. Obtained from get_prediction_frames_dict() method. class_prediction_dict : dict A dictionary containing loaded models as keys and pd.Series containing class predcitions obtained from H2O models as values in pd.Series format. Calculated by get_class_predictions_dict() method. proba_prediction_dict : dict A dictionary containing loaded models as keys and pd.Series containing class probabilities predcitions obtained from H2O models as values in pd.DataFrame format. Calculated by get_proba_predictions_dict() method. classes : list List of all classes found for a given task. Obtained from the test_data. MULTICLASS_METRICS_DICT : dict A dictionary containing typical multiclasss classification evaluation metrics name and their implementation functions from sklearn. BINARY_METRICS_DICT : dict A dictionary containing typical binary classification evaluation metrics name and their implementation functions from sklearn. models_directory : Path models_directory parameter value test_data : H2OFrame test_data parameter value target_column : str target_column parameter value df_name : str df_name parameter value feature_imp_needed : bool, default = True feture_imp_needed parameter value
Returns whether the task type for a given dataset is binary or multiclass classification based on target_column and test_data parameters. If another task type is determined, the method throws ValueError.
Returns :
task_type : str
Loads all H2O models objects from a given model_directory and saves them in a list format.
Returns :
loaded_models : list
Used to obtain predictions returned by all H2O models and save them in a dictionary format with model names as keys and predictions dataframes as values.
Returns :
predictions : dict[str, pd.DataFrame]
Used to obtain class predictions returned by all H2O models and save them in a dictionary format with model names as keys and predictions in a pd.Series format as values.
Returns :
class_predictions_dict : dict[str, pd.Series]
Used to obtain class probabilities predictions returned by all H2O models and save them in a dictionary format with model names as keys and probability predictions in a pd.DataFrame format as values.
Returns :
proba_predictions_dict : dict[str, pd.DataFrame]
Used to obtain feature importances from all H2O models and save them in a dictionary format with model names as keys and features sorted in a descending order based on their importance as values.
Returns :
feature_importance_dict : dict[str, list]
Used to extract the target column from the test dataset.
Returns :
y_true : pd.DataFrame
Calculates multiclass classification evaluation metrics from MULTICLASS_METRICS_DICT for all loaded models based on their prediction vectors and target column. Calculated results are stored in a pd.DataFrame object containing all evaluation metric scores for each model.
Returns :
metrics_df : pd.DataFrame
Calculates binary classification evaluation metrics from BINARY_METRICS_DICT for all loaded models based on their prediction vectors and target column. Calculated results are stored in a pd.DataFrame object containing all evaluation metric scores for each model.
Returns :
metrics_df : pd.DataFrame
leaderboard, predictions_dict, proba_predictions_dict, feature_importance_dict, y_true, saving_path)
Method for saving results from creating a leaderboard and all dictionaries on disk in .csv and .pickle formats.
Parameters :
leaderboard: pd.DataFrame
created leaderboard to be saved as csv
predictions_dict : dict
created class predictions dict to be saved as pickle
proba_predictions_dict : dict
created proba predictions dict to be saved as pickle
feature_importance_dict : dict
created feature importance dict to be saved as pickle
y_true : pd.DataFrame
extracted target column to be saved as a csv
saving_path : Path
path to a directory where the results should be saved, if not specified the default of timestamp + df_name is used to create a new directory
saving_path)
Final method used to create and save leaderboard created in calculate_binary_metrics() or calculate_multiclass_metrics() based on the task type, class_predictions_dict, proba_predictions_dict and feature_importance_dict created in the corresponding methods. If feature_imp_needed parameter is False, feature_importance_dict is not created and the method returns NaN as its value.
Parameters :
saving_path : Path
path to a directory where the results should be saved, if not specified the default of timestamp + df_name is used to create a new directory
Returns :
leaderboard : pd.DataFrame
class_predictions_dict : dict[str, pd.Series]
proba_predictions_dict : dict[str, pd.DataFrame]
feature_importance_dict : dict[str, list]
y_true : pd.DataFrame