6 AutogluonTimeLimit.py

Trains a model using AutoGluon on provided data path and returns feature importance and model leaderboard.

6.1 Parameters

gene_data_path (str):

Path to the gene expression data CSV file.

For example: ‘../data/gene_tpm.csv’

class_data_path (str):

Path to the class data CSV file.

For example: ‘../data/tumor_class.csv’

label_column (str):

Name of the column in the dataset that is the target label for prediction.

test_size (float):

Proportion of the data to be used as the test set.

threshold (float):

The threshold used to filter out rows based on the proportion of non-zero values.

random_feature (int, optional):

The number of random feature to select. If None, no random feature selection is performed.

Default is None.

num_bag_folds (int, optional):

Please note: This parameter annotation source can be referred to the documentation link in References.

Number of folds used for bagging of models. When num_bag_folds = k, training time is roughly increased by a factor of k (set = 0 to disable bagging). Disabled by default (0), but we recommend values between 5-10 to maximize predictive performance. Increasing num_bag_folds will result in models with lower bias but that are more prone to overfitting. num_bag_folds = 1 is an invalid value, and will raise a ValueError. Values > 10 may produce diminishing returns, and can even harm overall results due to overfitting. To further improve predictions, avoid increasing num_bag_folds much beyond 10 and instead increase num_bag_sets.

default = None

num_stack_levels (int, optional):

Please note: This parameter annotation source can be referred to the documentation link in References.

Number of stacking levels to use in stack ensemble. Roughly increases model training time by factor of num_stack_levels+1 (set = 0 to disable stack ensembling). Disabled by default (0), but we recommend values between 1-3 to maximize predictive performance. To prevent overfitting, num_bag_folds >= 2 must also be set or else a ValueError will be raised.

default = None

time_limit (int, optional):

Time limit for training in seconds.

Default is 120.

random_state (int, optional):

The seed used by the random number generator.

Default is 42.

6.2 Returns

importance (DataFrame):

DataFrame containing feature importance.

leaderboard (DataFrame):

DataFrame containing model performance on the test data.

6.3 Usage of Autogluon_TimeLimit

Performing training and prediction tasks on tabular data using Autogluon.

6.3.1 Objectives

6.3.1.1 Model Training and Selection

Autogluon will attempt various models and hyperparameter combinations within a given time limit to find the best-performing model on the test data. During training, Autogluon may output training logs displaying performance metrics and progress information for different models. The goal is to select the best-performing model for use in subsequent prediction tasks.

6.3.1.2 Leaderboard

The leaderboard displays performance scores of different models on the test data, typically including metrics like accuracy, precision, recall, and more. The purpose is to assist users in understanding the performance of different models to choose the most suitable model for predictions.

6.3.1.3 Importance

Feature importance indicates which features are most critical for the model’s prediction performance. The purpose is to help users understand the importance of specific features in the data, which can be used for feature selection or further data analysis.

6.3.2 Note

Please note that Autogluon’s output results may vary depending on your data and task. You can review the generated model leaderboard and feature importance to understand model performance and the significance of specific features in the data. These results can aid you in making better predictions and decisions.

6.4 Insignificant Correlation

Please note:Data characteristics: Features have weak correlation with the classification.

Randomly shuffling the class labels to a certain extent simulates reducing the correlation.

6.4.1 Import the corresponding module

from TransProPy.AutogluonTimeLimit import Autogluon_TimeLimit

6.4.2 Data

import pandas as pd
data_path = '../test_TransProPy/data/four_methods_degs_intersection.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10])

  Unnamed: 0  TCGA-D9-A4Z2-01A  TCGA-ER-A2NH-06A  TCGA-BF-A5EO-01A  \
0        A2M         16.808499         16.506184         17.143433   
1      A2ML1          1.584963          9.517669          7.434628   
2      AADAC          4.000000          2.584963          1.584963   
3    AADACL2          1.000000          1.000000          0.000000   
4     ABCA12          4.523562          4.321928          3.906891   
5    ABCA17P          4.584963          5.169925          3.807355   
6      ABCA9          9.753217          6.906891          3.459432   
7      ABCB4          9.177420          6.700440          5.000000   
8      ABCB5         10.134426          4.169925          9.167418   
9     ABCC11         10.092757          6.491853          5.459432   

   TCGA-D9-A6EA-06A  TCGA-D9-A4Z3-01A  TCGA-GN-A26A-06A  TCGA-D3-A3BZ-06A  \
0         17.760739         14.766839         16.263691         16.035207   
1          2.584963          1.584963          2.584963          5.285402   
2          0.000000          0.000000          0.000000          3.321928   
3          0.000000          1.000000          0.000000          0.000000   
4          3.459432          1.584963          3.000000          4.321928   
5          8.366322          7.228819          7.076816          4.584963   
6          2.584963          6.357552          6.475733          7.330917   
7          9.342075         10.392317          7.383704         11.032735   
8          4.906891         11.340963          3.169925         11.161762   
9          6.807355          4.247928          5.459432          5.977280   

   TCGA-D3-A51G-06A  TCGA-EE-A29R-06A  
0         18.355114         16.959379  
1          2.584963          3.584963  
2          1.000000          4.584963  
3          0.000000          1.000000  
4          4.807355          3.700440  
5          6.409391          7.139551  
6          7.954196          9.177420  
7         10.082149         10.088788  
8          4.643856         12.393927  
9          5.614710          8.233620

import pandas as pd
data_path = '../test_TransProPy/data/random_classification_class.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10])

         Unnamed: 0  class
0  TCGA-D9-A4Z2-01A      2
1  TCGA-ER-A2NH-06A      2
2  TCGA-BF-A5EO-01A      2
3  TCGA-D9-A6EA-06A      2
4  TCGA-D9-A4Z3-01A      1
5  TCGA-GN-A26A-06A      1
6  TCGA-D3-A3BZ-06A      1
7  TCGA-D3-A51G-06A      1
8  TCGA-EE-A29R-06A      1
9  TCGA-D3-A2JE-06A      1

6.4.3 Autogluon_TimeLimit

importance, leaderboard = Autogluon_TimeLimit(
    gene_data_path='../test_TransProPy/data/four_methods_degs_intersection.csv', 
    class_data_path='../test_TransProPy/data/random_classification_class.csv', 
    label_column='class',  
    test_size=0.3, 
    threshold=0.9, 
    random_feature=None, 
    num_bag_folds=None, 
    num_stack_levels=None, 
    time_limit=1000, 
    random_state=42
    )

No path specified. Models will be saved in: "AutogluonModels\ag-20240804_044508\"

Beginning AutoGluon training ... Time limit = 1000s

AutoGluon will save models to "AutogluonModels\ag-20240804_044508\"

AutoGluon Version:  0.8.2

Python Version:     3.10.11

Operating System:   Windows

Platform Machine:   AMD64

Platform Version:   10.0.19044

Disk Space Avail:   208.56 GB / 925.93 GB (22.5%)

Train Data Rows:    896

Train Data Columns: 1605

Label Column: class

Preprocessing data ...

AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).

    2 unique label values:  [1, 2]

    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Selected class <--> label mapping:  class 1 = 2, class 0 = 1

    Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive (2) vs negative (1) class.
    To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.

Using Feature Generators to preprocess the data ...

Fitting AutoMLPipelineFeatureGenerator...

    Available Memory:                    13791.23 MB

    Train Data (Original)  Memory Usage: 11.5 MB (0.1% of available memory)

    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.

    Stage 1 Generators:

        Fitting AsTypeFeatureGenerator...

    Stage 2 Generators:

        Fitting FillNaFeatureGenerator...

    Stage 3 Generators:

        Fitting IdentityFeatureGenerator...

    Stage 4 Generators:

        Fitting DropUniqueFeatureGenerator...

    Stage 5 Generators:

        Fitting DropDuplicatesFeatureGenerator...

    Types of features in original data (raw dtype, special dtypes):

        ('float', []) : 1605 | ['A2M', 'A2ML1', 'ABCA12', 'ABCA17P', 'ABCA9', ...]

    Types of features in processed data (raw dtype, special dtypes):

        ('float', []) : 1605 | ['A2M', 'A2ML1', 'ABCA12', 'ABCA17P', 'ABCA9', ...]

    1.3s = Fit runtime

    1605 features in original data used to generate 1605 features in processed data.

    Train Data (Processed) Memory Usage: 11.5 MB (0.1% of available memory)

Data preprocessing and feature engineering runtime = 1.37s ...

AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'

    To change this, specify the eval_metric parameter of Predictor()

Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 716, Val Rows: 180

User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}

Fitting 13 L1 models ...

Fitting model: KNeighborsUnif ... Training model for up to 998.63s of the 998.62s of remaining time.

    0.5722   = Validation score   (accuracy)

    1.11s    = Training   runtime

    0.15s    = Validation runtime

Fitting model: KNeighborsDist ... Training model for up to 997.35s of the 997.34s of remaining time.

    0.5722   = Validation score   (accuracy)

    0.12s    = Training   runtime

    0.03s    = Validation runtime

Fitting model: LightGBMXT ... Training model for up to 997.17s of the 997.16s of remaining time.

    0.7  = Validation score   (accuracy)

    2.2s     = Training   runtime

    0.01s    = Validation runtime

Fitting model: LightGBM ... Training model for up to 994.93s of the 994.91s of remaining time.

    0.6667   = Validation score   (accuracy)

    12.75s   = Training   runtime

    0.01s    = Validation runtime

Fitting model: RandomForestGini ... Training model for up to 982.1s of the 982.08s of remaining time.

    0.6278   = Validation score   (accuracy)

    3.88s    = Training   runtime

    0.07s    = Validation runtime

Fitting model: RandomForestEntr ... Training model for up to 978.11s of the 978.09s of remaining time.

    0.6167   = Validation score   (accuracy)

    4.4s     = Training   runtime

    0.06s    = Validation runtime

Fitting model: CatBoost ... Training model for up to 973.57s of the 973.55s of remaining time.

    0.6667   = Validation score   (accuracy)

    67.59s   = Training   runtime

    0.04s    = Validation runtime

Fitting model: ExtraTreesGini ... Training model for up to 905.91s of the 905.89s of remaining time.

    0.6167   = Validation score   (accuracy)

    1.35s    = Training   runtime

    0.05s    = Validation runtime

Fitting model: ExtraTreesEntr ... Training model for up to 904.46s of the 904.44s of remaining time.

    0.6222   = Validation score   (accuracy)

    1.25s    = Training   runtime

    0.05s    = Validation runtime

Fitting model: NeuralNetFastAI ... Training model for up to 903.1s of the 903.08s of remaining time.

    0.6722   = Validation score   (accuracy)

    3.06s    = Training   runtime

    0.02s    = Validation runtime

Fitting model: XGBoost ... Training model for up to 899.95s of the 899.93s of remaining time.

    0.6778   = Validation score   (accuracy)

    26.11s   = Training   runtime

    0.02s    = Validation runtime

Fitting model: NeuralNetTorch ... Training model for up to 873.79s of the 873.77s of remaining time.

    0.6889   = Validation score   (accuracy)

    53.47s   = Training   runtime

    0.1s     = Validation runtime

Fitting model: LightGBMLarge ... Training model for up to 820.19s of the 820.17s of remaining time.

    0.6556   = Validation score   (accuracy)

    29.38s   = Training   runtime

    0.01s    = Validation runtime

Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 790.71s of remaining time.

    0.7056   = Validation score   (accuracy)

    0.75s    = Training   runtime

    0.0s     = Validation runtime

AutoGluon training complete, total runtime = 210.09s ... Best model: "WeightedEnsemble_L2"

TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels\ag-20240804_044508\")

Computing feature importance via permutation shuffling for 1605 features using 385 rows with 5 shuffle sets...

    704.28s = Expected runtime (140.86s per shuffle set)

    215.75s = Actual runtime (Completed 5 of 5 shuffle sets)

                  model  score_test  score_val  pred_time_test  pred_time_val   fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0              LightGBM    0.633766   0.666667        0.026911       0.013913  12.754798                 0.026911                0.013913          12.754798            1       True          4
1            LightGBMXT    0.631169   0.700000        0.019933       0.012957   2.199846                 0.019933                0.012957           2.199846            1       True          3
2               XGBoost    0.631169   0.677778        0.038872       0.018937  26.106133                 0.038872                0.018937          26.106133            1       True         11
3   WeightedEnsemble_L2    0.631169   0.705556        0.073754       0.161193   4.068005                 0.002990                0.000997           0.754443            2       True         14
4        ExtraTreesGini    0.631169   0.616667        0.075747       0.054816   1.352519                 0.075747                0.054816           1.352519            1       True          8
5        KNeighborsUnif    0.623377   0.572222        0.050830       0.147239   1.113717                 0.050830                0.147239           1.113717            1       True          1
6        KNeighborsDist    0.623377   0.572222        0.054817       0.033886   0.123587                 0.054817                0.033886           0.123587            1       True          2
7        ExtraTreesEntr    0.623377   0.622222        0.077321       0.053820   1.253270                 0.077321                0.053820           1.253270            1       True          9
8      RandomForestEntr    0.620779   0.616667        0.074385       0.064918   4.396520                 0.074385                0.064918           4.396520            1       True          6
9        NeuralNetTorch    0.615584   0.688889        0.088704       0.104267  53.465611                 0.088704                0.104267          53.465611            1       True         12
10     RandomForestGini    0.610390   0.627778        0.073753       0.065792   3.883219                 0.073753                0.065792           3.883219            1       True          5
11      NeuralNetFastAI    0.589610   0.672222        0.039866       0.018937   3.063668                 0.039866                0.018937           3.063668            1       True         10
12        LightGBMLarge    0.576623   0.655556        0.016943       0.012806  29.383221                 0.016943                0.012806          29.383221            1       True         13
13             CatBoost    0.571429   0.666667        0.083720       0.039867  67.590781                 0.083720                0.039867          67.590781            1       True          7

6.4.4 Return

# Filtering the DataFrame to get features where importance > 0 and p_value < 0.05
filtered_importance = importance[(importance['importance'] > 0) & (importance['p_value'] < 0.05)]
# Counting the number of such features
number_of_features = filtered_importance.shape[0]
# Printing the result
print(f"Number of features with importance > 0 and p_value < 0.05: {number_of_features}")

Number of features with importance > 0 and p_value < 0.05: 703

# Display the top 20 rows of the Importance DataFrame
print("Top 20 rows of Importance DataFrame:")
print(importance.head(20))

Top 20 rows of Importance DataFrame:
              importance    stddev   p_value  n  p99_high   p99_low
RP11-641D5.1    0.012987  0.002597  0.000182  5  0.018335  0.007639
CKM             0.012468  0.002173  0.000106  5  0.016942  0.007993
ZBED3-AS1       0.011948  0.003939  0.001234  5  0.020059  0.003837
PPP1R14C        0.010909  0.005323  0.005082  5  0.021869 -0.000051
RPS10-NUDT3     0.010909  0.006201  0.008525  5  0.023677 -0.001859
AOX1            0.010909  0.002173  0.000179  5  0.015384  0.006435
RP11-411B6.6    0.010909  0.003387  0.000985  5  0.017882  0.003936
NTRK1           0.010390  0.007113  0.015453  5  0.025036 -0.004257
GAPDHP1         0.010390  0.001837  0.000112  5  0.014171  0.006608
SPP1            0.009870  0.003387  0.001431  5  0.016843  0.002897
CLDN1           0.009351  0.004718  0.005705  5  0.019066 -0.000365
ADORA3          0.009351  0.003939  0.003027  5  0.017461  0.001240
AC016292.3      0.008831  0.004718  0.006931  5  0.018546 -0.000884
CORIN           0.008312  0.002845  0.001419  5  0.014170  0.002453
GAS6            0.008312  0.001162  0.000045  5  0.010703  0.005920
STMN2           0.007792  0.001837  0.000344  5  0.011574  0.004011
MAPK13          0.007792  0.001837  0.000344  5  0.011574  0.004011
LYPD6B          0.007792  0.003181  0.002704  5  0.014342  0.001242
TBC1D3B         0.007273  0.002173  0.000853  5  0.011747  0.002798
MT1M            0.007273  0.003853  0.006733  5  0.015205 -0.000660

# Display the top 20 rows of the Leaderboard DataFrame
print("\nTop 20 rows of Leaderboard DataFrame:")
print(leaderboard.head(20))


Top 20 rows of Leaderboard DataFrame:
                  model  score_test  score_val  pred_time_test  pred_time_val  \
0              LightGBM    0.633766   0.666667        0.026911       0.013913   
1            LightGBMXT    0.631169   0.700000        0.019933       0.012957   
2               XGBoost    0.631169   0.677778        0.038872       0.018937   
3   WeightedEnsemble_L2    0.631169   0.705556        0.073754       0.161193   
4        ExtraTreesGini    0.631169   0.616667        0.075747       0.054816   
5        KNeighborsUnif    0.623377   0.572222        0.050830       0.147239   
6        KNeighborsDist    0.623377   0.572222        0.054817       0.033886   
7        ExtraTreesEntr    0.623377   0.622222        0.077321       0.053820   
8      RandomForestEntr    0.620779   0.616667        0.074385       0.064918   
9        NeuralNetTorch    0.615584   0.688889        0.088704       0.104267   
10     RandomForestGini    0.610390   0.627778        0.073753       0.065792   
11      NeuralNetFastAI    0.589610   0.672222        0.039866       0.018937   
12        LightGBMLarge    0.576623   0.655556        0.016943       0.012806   
13             CatBoost    0.571429   0.666667        0.083720       0.039867   

     fit_time  pred_time_test_marginal  pred_time_val_marginal  \
0   12.754798                 0.026911                0.013913   
1    2.199846                 0.019933                0.012957   
2   26.106133                 0.038872                0.018937   
3    4.068005                 0.002990                0.000997   
4    1.352519                 0.075747                0.054816   
5    1.113717                 0.050830                0.147239   
6    0.123587                 0.054817                0.033886   
7    1.253270                 0.077321                0.053820   
8    4.396520                 0.074385                0.064918   
9   53.465611                 0.088704                0.104267   
10   3.883219                 0.073753                0.065792   
11   3.063668                 0.039866                0.018937   
12  29.383221                 0.016943                0.012806   
13  67.590781                 0.083720                0.039867   

    fit_time_marginal  stack_level  can_infer  fit_order  
0           12.754798            1       True          4  
1            2.199846            1       True          3  
2           26.106133            1       True         11  
3            0.754443            2       True         14  
4            1.352519            1       True          8  
5            1.113717            1       True          1  
6            0.123587            1       True          2  
7            1.253270            1       True          9  
8            4.396520            1       True          6  
9           53.465611            1       True         12  
10           3.883219            1       True          5  
11           3.063668            1       True         10  
12          29.383221            1       True         13  
13          67.590781            1       True          7

6.4.5 Save Data

# Save the Importance DataFrame to a CSV file
importance.to_csv('../test_TransProPy/data/Insignificant_correlation_Autogluon_TimeLimit_importance.csv', index=False)

# Save the Leaderboard DataFrame to a CSV file
leaderboard.to_csv('../test_TransProPy/data/Insignificant_correlation_Autogluon_TimeLimit_leaderboard.csv', index=False)

6.5 Significant Correlation

Please note:Data characteristics: Features have strong correlation with the classification.

6.5.1 Import the corresponding module

from TransProPy.AutogluonTimeLimit import Autogluon_TimeLimit

6.5.2 Data

import pandas as pd
data_path = '../test_TransProPy/data/four_methods_degs_intersection.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10])

  Unnamed: 0  TCGA-D9-A4Z2-01A  TCGA-ER-A2NH-06A  TCGA-BF-A5EO-01A  \
0        A2M         16.808499         16.506184         17.143433   
1      A2ML1          1.584963          9.517669          7.434628   
2      AADAC          4.000000          2.584963          1.584963   
3    AADACL2          1.000000          1.000000          0.000000   
4     ABCA12          4.523562          4.321928          3.906891   
5    ABCA17P          4.584963          5.169925          3.807355   
6      ABCA9          9.753217          6.906891          3.459432   
7      ABCB4          9.177420          6.700440          5.000000   
8      ABCB5         10.134426          4.169925          9.167418   
9     ABCC11         10.092757          6.491853          5.459432   

   TCGA-D9-A6EA-06A  TCGA-D9-A4Z3-01A  TCGA-GN-A26A-06A  TCGA-D3-A3BZ-06A  \
0         17.760739         14.766839         16.263691         16.035207   
1          2.584963          1.584963          2.584963          5.285402   
2          0.000000          0.000000          0.000000          3.321928   
3          0.000000          1.000000          0.000000          0.000000   
4          3.459432          1.584963          3.000000          4.321928   
5          8.366322          7.228819          7.076816          4.584963   
6          2.584963          6.357552          6.475733          7.330917   
7          9.342075         10.392317          7.383704         11.032735   
8          4.906891         11.340963          3.169925         11.161762   
9          6.807355          4.247928          5.459432          5.977280   

   TCGA-D3-A51G-06A  TCGA-EE-A29R-06A  
0         18.355114         16.959379  
1          2.584963          3.584963  
2          1.000000          4.584963  
3          0.000000          1.000000  
4          4.807355          3.700440  
5          6.409391          7.139551  
6          7.954196          9.177420  
7         10.082149         10.088788  
8          4.643856         12.393927  
9          5.614710          8.233620

import pandas as pd
data_path = '../test_TransProPy/data/class.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10])

         Unnamed: 0  class
0  TCGA-D9-A4Z2-01A      2
1  TCGA-ER-A2NH-06A      2
2  TCGA-BF-A5EO-01A      2
3  TCGA-D9-A6EA-06A      2
4  TCGA-D9-A4Z3-01A      2
5  TCGA-GN-A26A-06A      2
6  TCGA-D3-A3BZ-06A      2
7  TCGA-D3-A51G-06A      2
8  TCGA-EE-A29R-06A      2
9  TCGA-D3-A2JE-06A      2

6.5.3 Autogluon_TimeLimit

importance, leaderboard = Autogluon_TimeLimit(
    gene_data_path='../test_TransProPy/data/four_methods_degs_intersection.csv', 
    class_data_path='../test_TransProPy/data/class.csv', 
    label_column='class',  
    test_size=0.3, 
    threshold=0.9, 
    random_feature=None, 
    num_bag_folds=None, 
    num_stack_levels=None, 
    time_limit=1000, 
    random_state=42
    )

No path specified. Models will be saved in: "AutogluonModels\ag-20240804_045217\"

Beginning AutoGluon training ... Time limit = 1000s

AutoGluon will save models to "AutogluonModels\ag-20240804_045217\"

AutoGluon Version:  0.8.2

Python Version:     3.10.11

Operating System:   Windows

Platform Machine:   AMD64

Platform Version:   10.0.19044

Disk Space Avail:   208.52 GB / 925.93 GB (22.5%)

Train Data Rows:    896

Train Data Columns: 1605

Label Column: class

Preprocessing data ...

AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).

    2 unique label values:  [2, 1]

    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Selected class <--> label mapping:  class 1 = 2, class 0 = 1

    Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive (2) vs negative (1) class.
    To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.

Using Feature Generators to preprocess the data ...

Fitting AutoMLPipelineFeatureGenerator...

    Available Memory:                    13518.08 MB

    Train Data (Original)  Memory Usage: 11.5 MB (0.1% of available memory)

    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.

    Stage 1 Generators:

        Fitting AsTypeFeatureGenerator...

    Stage 2 Generators:

        Fitting FillNaFeatureGenerator...

    Stage 3 Generators:

        Fitting IdentityFeatureGenerator...

    Stage 4 Generators:

        Fitting DropUniqueFeatureGenerator...

    Stage 5 Generators:

        Fitting DropDuplicatesFeatureGenerator...

    Types of features in original data (raw dtype, special dtypes):

        ('float', []) : 1605 | ['A2M', 'A2ML1', 'ABCA12', 'ABCA17P', 'ABCA9', ...]

    Types of features in processed data (raw dtype, special dtypes):

        ('float', []) : 1605 | ['A2M', 'A2ML1', 'ABCA12', 'ABCA17P', 'ABCA9', ...]

    2.2s = Fit runtime

    1605 features in original data used to generate 1605 features in processed data.

    Train Data (Processed) Memory Usage: 11.5 MB (0.1% of available memory)

Data preprocessing and feature engineering runtime = 2.38s ...

AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'

    To change this, specify the eval_metric parameter of Predictor()

Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 716, Val Rows: 180

User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}

Fitting 13 L1 models ...

Fitting model: KNeighborsUnif ... Training model for up to 997.62s of the 997.6s of remaining time.

    1.0  = Validation score   (accuracy)

    0.24s    = Training   runtime

    0.04s    = Validation runtime

Fitting model: KNeighborsDist ... Training model for up to 997.32s of the 997.3s of remaining time.

    1.0  = Validation score   (accuracy)

    0.25s    = Training   runtime

    0.04s    = Validation runtime

Fitting model: LightGBMXT ... Training model for up to 997.01s of the 996.99s of remaining time.

    1.0  = Validation score   (accuracy)

    2.18s    = Training   runtime

    0.01s    = Validation runtime

Fitting model: LightGBM ... Training model for up to 994.78s of the 994.76s of remaining time.

    1.0  = Validation score   (accuracy)

    1.8s     = Training   runtime

    0.01s    = Validation runtime

Fitting model: RandomForestGini ... Training model for up to 992.94s of the 992.92s of remaining time.

    1.0  = Validation score   (accuracy)

    1.21s    = Training   runtime

    0.05s    = Validation runtime

Fitting model: RandomForestEntr ... Training model for up to 991.64s of the 991.62s of remaining time.

    1.0  = Validation score   (accuracy)

    1.27s    = Training   runtime

    0.05s    = Validation runtime

Fitting model: CatBoost ... Training model for up to 990.29s of the 990.27s of remaining time.

    1.0  = Validation score   (accuracy)

    64.37s   = Training   runtime

    0.04s    = Validation runtime

Fitting model: ExtraTreesGini ... Training model for up to 925.85s of the 925.83s of remaining time.

    1.0  = Validation score   (accuracy)

    1.27s    = Training   runtime

    0.05s    = Validation runtime

Fitting model: ExtraTreesEntr ... Training model for up to 924.49s of the 924.47s of remaining time.

    1.0  = Validation score   (accuracy)

    1.17s    = Training   runtime

    0.06s    = Validation runtime

Fitting model: NeuralNetFastAI ... Training model for up to 923.22s of the 923.2s of remaining time.

No improvement since epoch 0: early stopping

    1.0  = Validation score   (accuracy)

    2.03s    = Training   runtime

    0.02s    = Validation runtime

Fitting model: XGBoost ... Training model for up to 921.11s of the 921.09s of remaining time.

    1.0  = Validation score   (accuracy)

    4.6s     = Training   runtime

    0.02s    = Validation runtime

Fitting model: NeuralNetTorch ... Training model for up to 916.46s of the 916.44s of remaining time.

    1.0  = Validation score   (accuracy)

    23.57s   = Training   runtime

    0.1s     = Validation runtime

Fitting model: LightGBMLarge ... Training model for up to 892.76s of the 892.74s of remaining time.

    1.0  = Validation score   (accuracy)

    4.51s    = Training   runtime

    0.01s    = Validation runtime

Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 888.16s of remaining time.

    1.0  = Validation score   (accuracy)

    0.72s    = Training   runtime

    0.0s     = Validation runtime

AutoGluon training complete, total runtime = 112.61s ... Best model: "WeightedEnsemble_L2"

TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels\ag-20240804_045217\")

Computing feature importance via permutation shuffling for 1605 features using 385 rows with 5 shuffle sets...

    176.07s = Expected runtime (35.21s per shuffle set)

    64.21s  = Actual runtime (Completed 5 of 5 shuffle sets)

                  model  score_test  score_val  pred_time_test  pred_time_val   fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0              LightGBM    1.000000        1.0        0.013082       0.012956   1.801648                 0.013082                0.012956           1.801648            1       True          4
1         LightGBMLarge    1.000000        1.0        0.015946       0.012242   4.505086                 0.015946                0.012242           4.505086            1       True         13
2   WeightedEnsemble_L2    1.000000        1.0        0.016943       0.013239   5.222059                 0.000997                0.000998           0.716973            2       True         14
3       NeuralNetFastAI    1.000000        1.0        0.039867       0.019933   2.026086                 0.039867                0.019933           2.026086            1       True         10
4        KNeighborsUnif    1.000000        1.0        0.054827       0.035884   0.238206                 0.054827                0.035884           0.238206            1       True          1
5        KNeighborsDist    1.000000        1.0        0.059789       0.037874   0.248050                 0.059789                0.037874           0.248050            1       True          2
6        ExtraTreesEntr    1.000000        1.0        0.069767       0.058803   1.170361                 0.069767                0.058803           1.170361            1       True          9
7      RandomForestGini    1.000000        1.0        0.070763       0.049529   1.210922                 0.070763                0.049529           1.210922            1       True          5
8        ExtraTreesGini    1.000000        1.0        0.071760       0.053325   1.265165                 0.071760                0.053325           1.265165            1       True          8
9      RandomForestEntr    1.000000        1.0        0.072756       0.048836   1.266010                 0.072756                0.048836           1.266010            1       True          6
10       NeuralNetTorch    1.000000        1.0        0.080730       0.098805  23.569311                 0.080730                0.098805          23.569311            1       True         12
11           LightGBMXT    0.997403        1.0        0.014950       0.011964   2.182093                 0.014950                0.011964           2.182093            1       True          3
12             CatBoost    0.997403        1.0        0.084717       0.037174  64.372088                 0.084717                0.037174          64.372088            1       True          7
13              XGBoost    0.994805        1.0        0.032890       0.016943   4.599743                 0.032890                0.016943           4.599743            1       True         11

6.5.4 Return

# Filtering the DataFrame to get features where importance > 0 and p_value < 0.05
filtered_importance = importance[(importance['importance'] > 0) & (importance['p_value'] < 0.05)]
# Counting the number of such features
number_of_features = filtered_importance.shape[0]
# Printing the result
print(f"Number of features with importance > 0 and p_value < 0.05: {number_of_features}")

Number of features with importance > 0 and p_value < 0.05: 2

# Display the top 20 rows of the Importance DataFrame
print("Top 20 rows of Importance DataFrame:")
print(importance.head(20))

Top 20 rows of Importance DataFrame:
               importance    stddev   p_value  n  p99_high   p99_low
ISY1-RAB43       0.235844  0.017463  0.000004  5    0.2718  0.199888
RP11-231C14.4    0.235844  0.017463  0.000004  5    0.2718  0.199888
PSORS1C1         0.000000  0.000000  0.500000  5    0.0000  0.000000
PSMC1P1          0.000000  0.000000  0.500000  5    0.0000  0.000000
PRTG             0.000000  0.000000  0.500000  5    0.0000  0.000000
PRSS8            0.000000  0.000000  0.500000  5    0.0000  0.000000
PRSS53           0.000000  0.000000  0.500000  5    0.0000  0.000000
PRSS3            0.000000  0.000000  0.500000  5    0.0000  0.000000
PRSS22           0.000000  0.000000  0.500000  5    0.0000  0.000000
PRR19            0.000000  0.000000  0.500000  5    0.0000  0.000000
PRR15L           0.000000  0.000000  0.500000  5    0.0000  0.000000
PROM2            0.000000  0.000000  0.500000  5    0.0000  0.000000
PRODH            0.000000  0.000000  0.500000  5    0.0000  0.000000
A2M              0.000000  0.000000  0.500000  5    0.0000  0.000000
PRKCQ            0.000000  0.000000  0.500000  5    0.0000  0.000000
PRF1             0.000000  0.000000  0.500000  5    0.0000  0.000000
PRAME            0.000000  0.000000  0.500000  5    0.0000  0.000000
PPP2R2C          0.000000  0.000000  0.500000  5    0.0000  0.000000
PPP1R1B          0.000000  0.000000  0.500000  5    0.0000  0.000000
PPP1R14C         0.000000  0.000000  0.500000  5    0.0000  0.000000

# Display the top 20 rows of the Leaderboard DataFrame
print("\nTop 20 rows of Leaderboard DataFrame:")
print(leaderboard.head(20))


Top 20 rows of Leaderboard DataFrame:
                  model  score_test  score_val  pred_time_test  pred_time_val  \
0              LightGBM    1.000000        1.0        0.013082       0.012956   
1         LightGBMLarge    1.000000        1.0        0.015946       0.012242   
2   WeightedEnsemble_L2    1.000000        1.0        0.016943       0.013239   
3       NeuralNetFastAI    1.000000        1.0        0.039867       0.019933   
4        KNeighborsUnif    1.000000        1.0        0.054827       0.035884   
5        KNeighborsDist    1.000000        1.0        0.059789       0.037874   
6        ExtraTreesEntr    1.000000        1.0        0.069767       0.058803   
7      RandomForestGini    1.000000        1.0        0.070763       0.049529   
8        ExtraTreesGini    1.000000        1.0        0.071760       0.053325   
9      RandomForestEntr    1.000000        1.0        0.072756       0.048836   
10       NeuralNetTorch    1.000000        1.0        0.080730       0.098805   
11           LightGBMXT    0.997403        1.0        0.014950       0.011964   
12             CatBoost    0.997403        1.0        0.084717       0.037174   
13              XGBoost    0.994805        1.0        0.032890       0.016943   

     fit_time  pred_time_test_marginal  pred_time_val_marginal  \
0    1.801648                 0.013082                0.012956   
1    4.505086                 0.015946                0.012242   
2    5.222059                 0.000997                0.000998   
3    2.026086                 0.039867                0.019933   
4    0.238206                 0.054827                0.035884   
5    0.248050                 0.059789                0.037874   
6    1.170361                 0.069767                0.058803   
7    1.210922                 0.070763                0.049529   
8    1.265165                 0.071760                0.053325   
9    1.266010                 0.072756                0.048836   
10  23.569311                 0.080730                0.098805   
11   2.182093                 0.014950                0.011964   
12  64.372088                 0.084717                0.037174   
13   4.599743                 0.032890                0.016943   

    fit_time_marginal  stack_level  can_infer  fit_order  
0            1.801648            1       True          4  
1            4.505086            1       True         13  
2            0.716973            2       True         14  
3            2.026086            1       True         10  
4            0.238206            1       True          1  
5            0.248050            1       True          2  
6            1.170361            1       True          9  
7            1.210922            1       True          5  
8            1.265165            1       True          8  
9            1.266010            1       True          6  
10          23.569311            1       True         12  
11           2.182093            1       True          3  
12          64.372088            1       True          7  
13           4.599743            1       True         11

6.5.5 Save Data

# Save the Importance DataFrame to a CSV file
importance.to_csv('../test_TransProPy/data/significant_correlation_Autogluon_TimeLimit_importance.csv', index=False)

# Save the Leaderboard DataFrame to a CSV file
leaderboard.to_csv('../test_TransProPy/data/significant_correlation_Autogluon_TimeLimit_leaderboard.csv', index=False)

6.6 References

6.6.1 Scientific Publications

Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., & Smola, A. (2020). AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. arXiv preprint arXiv:2003.06505.

Fakoor, R., Mueller, J., Erickson, N., Chaudhari, P., & Smola, A. J. (2020). Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation. arXiv preprint arXiv:2006.14284.

Shi, X., Mueller, J., Erickson, N., Li, M., & Smola, A. (2021). Multimodal AutoML on Structured Tables with Text Fields. In AutoML@ICML 2021.

6.6.2 Articles

Prasanna, S. (2020, March 31). Machine learning with AutoGluon, an open source AutoML library. AWS Open Source Blog.

Sun, Y., Wu, C., Zhang, Z., He, T., Mueller, J., & Zhang, H. (n.d.). (2020). Image classification on Kaggle using AutoGluon. Medium.

Erickson, N., Mueller, J., Zhang, H., & Kamakoti, B. (2019). AutoGluon: Deep Learning AutoML. Medium.

6.6.3 Documentation

AutoGluon Predictors – AutoGluon Documentation 0.1.0 documentation