ðŸŠīAutoGluonSelectML.py

Trains a model using AutoGluon on provided data path and returns feature importance and model leaderboard.


Parameters

  • gene_data_path (str):

    • Path to the gene expression data CSV file.

    • For example: '../data/gene_tpm.csv'

  • class_data_path (str):

    • Path to the class data CSV file.

    • For example: '../data/tumor_class.csv'

  • label_column (str):

    • Name of the column in the dataset that is the target label for prediction.

  • test_size (float):

    • Proportion of the data to be used as the test set.

  • threshold (float):

    • The threshold used to filter out rows based on the proportion of non-zero values.

  • hyperparameters (dict, optional):

    • Dictionary of hyperparameters for the models.

    • For example: {'GBM': {}, 'RF': {}}

  • random_feature (int, optional):

    • The number of random feature to select. If None, no random feature selection is performed.

    • Default is None.

  • num_bag_folds (int, optional)

    • Please note: This parameter annotation source can be referred to the documentation link in References.

    • Number of folds used for bagging of models. When num_bag_folds = k, training time is roughly increased by a factor of k (set = 0 to disable bagging). Disabled by default (0), but we recommend values between 5-10 to maximize predictive performance. Increasing num_bag_folds will result in models with lower bias but that are more prone to overfitting. num_bag_folds = 1 is an invalid value, and will raise a ValueError. Values > 10 may produce diminishing returns, and can even harm overall results due to overfitting. To further improve predictions, avoid increasing num_bag_folds much beyond 10 and instead increase num_bag_sets.

    • default = None

  • num_stack_levels (int, optional)

    • Please note: This parameter annotation source can be referred to the documentation link in References.

    • Number of stacking levels to use in stack ensemble. Roughly increases model training time by factor of num_stack_levels+1 (set = 0 to disable stack ensembling). Disabled by default (0), but we recommend values between 1-3 to maximize predictive performance. To prevent overfitting, num_bag_folds >= 2 must also be set or else a ValueError will be raised.

    • default = None

  • time_limit (int, optional):

    • Time limit for training in seconds.

    • Default is 120.

  • random_state (int, optional):

    • The seed used by the random number generator.

    • Default is 42.


Return

  • importance (DataFrame):

    • DataFrame containing feature importance.

  • leaderboard (DataFrame):

    • DataFrame containing model performance on the test data.


References

Scientific Publications

Articles

Documentation

Usage

Autogluon_TimeLimit(gene_data_path='../data/gene_tpm.csv', class_data_path='../data/tumor_class.csv', label_column='sex', test_size=0.3, threshold=0.9, hyperparameters={'GBM': {}, 'RF': {}},random_feature=none, num_bag_folds=None, num_stack_levels=None, time_limit=120, random_state=42)

Last updated