7  AutoGluonSelectML.py

Trains a model using AutoGluon on provided data path and returns feature importance and model leaderboard.

7.1 Parameters

  • gene_data_path (str):
    • Path to the gene expression data CSV file.
    • For example: ‘../data/gene_tpm.csv’
  • class_data_path (str):
    • Path to the class data CSV file.
    • For example: ‘../data/tumor_class.csv’
  • label_column (str):
    • Name of the column in the dataset that is the target label for prediction.
  • test_size (float):
    • Proportion of the data to be used as the test set.
  • threshold (float):
    • The threshold used to filter out rows based on the proportion of non-zero values.
  • hyperparameters (dict, optional):
    • Dictionary of hyperparameters for the models.
    • For example: {‘GBM’: {}, ‘RF’: {}}
  • random_feature (int, optional):
    • The number of random feature to select. If None, no random feature selection is performed.
    • Default is None.
  • num_bag_folds (int, optional)
    • Please note: This parameter annotation source can be referred to the documentation link in References.
    • Number of folds used for bagging of models. When num_bag_folds = k, training time is roughly increased by a factor of k (set = 0 to disable bagging). Disabled by default (0), but we recommend values between 5-10 to maximize predictive performance. Increasing num_bag_folds will result in models with lower bias but that are more prone to overfitting. num_bag_folds = 1 is an invalid value, and will raise a ValueError. Values > 10 may produce diminishing returns, and can even harm overall results due to overfitting. To further improve predictions, avoid increasing num_bag_folds much beyond 10 and instead increase num_bag_sets.
    • default = None
  • num_stack_levels (int, optional)
    • Please note: This parameter annotation source can be referred to the documentation link in References.
    • Number of stacking levels to use in stack ensemble. Roughly increases model training time by factor of num_stack_levels+1 (set = 0 to disable stack ensembling). Disabled by default (0), but we recommend values between 1-3 to maximize predictive performance. To prevent overfitting, num_bag_folds >= 2 must also be set or else a ValueError will be raised.
    • default = None
  • time_limit (int, optional):
    • Time limit for training in seconds.
    • default is 120.
  • random_state (int, optional):
    • The seed used by the random number generator.
    • default is 42.

7.2 Return

  • importance (DataFrame):
    • DataFrame containing feature importance.
  • leaderboard (DataFrame):
    • DataFrame containing model performance on the test data.

7.3 Usage of Autogluon_SelectML

Performing training and prediction tasks on tabular data using Autogluon.

7.3.1 Objectives

7.3.1.1 Model Training and Selection

Autogluon will attempt various models and hyperparameter combinations within a given time limit to find the best-performing model on the test data. During training, Autogluon may output training logs displaying performance metrics and progress information for different models. The goal is to select the best-performing model for use in subsequent prediction tasks.

7.3.1.2 Leaderboard

The leaderboard displays performance scores of different models on the test data, typically including metrics like accuracy, precision, recall, and more. The purpose is to assist users in understanding the performance of different models to choose the most suitable model for predictions.

7.3.1.3 Importance

Feature importance indicates which features are most critical for the model’s prediction performance. The purpose is to help users understand the importance of specific features in the data, which can be used for feature selection or further data analysis.

7.3.2 Note

Please note that Autogluon’s output results may vary depending on your data and task. You can review the generated model leaderboard and feature importance to understand model performance and the significance of specific features in the data. These results can aid you in making better predictions and decisions.

7.4 Insignificant Correlation

  • Please note:Data characteristics: Features have weak correlation with the classification.
  • Randomly shuffling the class labels to a certain extent simulates reducing the correlation.

7.4.1 Import the corresponding module

from TransProPy.AutogluonSelectML import AutoGluon_SelectML

7.4.2 Data

import pandas as pd
data_path = '../test_TransProPy/data/four_methods_degs_intersection.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10]) 
  Unnamed: 0  TCGA-D9-A4Z2-01A  TCGA-ER-A2NH-06A  TCGA-BF-A5EO-01A  \
0        A2M         16.808499         16.506184         17.143433   
1      A2ML1          1.584963          9.517669          7.434628   
2      AADAC          4.000000          2.584963          1.584963   
3    AADACL2          1.000000          1.000000          0.000000   
4     ABCA12          4.523562          4.321928          3.906891   
5    ABCA17P          4.584963          5.169925          3.807355   
6      ABCA9          9.753217          6.906891          3.459432   
7      ABCB4          9.177420          6.700440          5.000000   
8      ABCB5         10.134426          4.169925          9.167418   
9     ABCC11         10.092757          6.491853          5.459432   

   TCGA-D9-A6EA-06A  TCGA-D9-A4Z3-01A  TCGA-GN-A26A-06A  TCGA-D3-A3BZ-06A  \
0         17.760739         14.766839         16.263691         16.035207   
1          2.584963          1.584963          2.584963          5.285402   
2          0.000000          0.000000          0.000000          3.321928   
3          0.000000          1.000000          0.000000          0.000000   
4          3.459432          1.584963          3.000000          4.321928   
5          8.366322          7.228819          7.076816          4.584963   
6          2.584963          6.357552          6.475733          7.330917   
7          9.342075         10.392317          7.383704         11.032735   
8          4.906891         11.340963          3.169925         11.161762   
9          6.807355          4.247928          5.459432          5.977280   

   TCGA-D3-A51G-06A  TCGA-EE-A29R-06A  
0         18.355114         16.959379  
1          2.584963          3.584963  
2          1.000000          4.584963  
3          0.000000          1.000000  
4          4.807355          3.700440  
5          6.409391          7.139551  
6          7.954196          9.177420  
7         10.082149         10.088788  
8          4.643856         12.393927  
9          5.614710          8.233620  

import pandas as pd
data_path = '../test_TransProPy/data/random_classification_class.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10]) 
         Unnamed: 0  class
0  TCGA-D9-A4Z2-01A      2
1  TCGA-ER-A2NH-06A      2
2  TCGA-BF-A5EO-01A      2
3  TCGA-D9-A6EA-06A      2
4  TCGA-D9-A4Z3-01A      1
5  TCGA-GN-A26A-06A      1
6  TCGA-D3-A3BZ-06A      1
7  TCGA-D3-A51G-06A      1
8  TCGA-EE-A29R-06A      1
9  TCGA-D3-A2JE-06A      1

7.4.3 Autogluon_SelectML

  • The core purpose of choosing Autogluon_SelectML — to select a larger feature set in AutoGluon that includes both important and secondary features — is reflected in the following custom hyperparameters configuration. This setup is designed to utilize multiple model types so that the models can consider a broader range of features.
  • This configuration encompasses neural networks (using PyTorch and FastAI), gradient boosting machines (LightGBM, XGBoost, and CatBoost), random forests (RF), extremely randomized trees (XT), K-nearest neighbors (KNN), and linear regression (LR).
importance, leaderboard = AutoGluon_SelectML(
    gene_data_path='../test_TransProPy/data/four_methods_degs_intersection.csv', 
    class_data_path='../test_TransProPy/data/random_classification_class.csv', 
    label_column='class', 
    test_size=0.3, 
    threshold=0.9, 
    hyperparameters={
        'GBM': {}, 
        'RF': {},
        'CAT': {}, 
        'XGB' : {},
        # 'NN_TORCH': {}, 
        # 'FASTAI': {},
        'XT': {}, 
        'KNN': {},
        'LR': {}
        },
    random_feature=None, 
    num_bag_folds=None, 
    num_stack_levels=None, 
    time_limit=1000, 
    random_state=42
    )
No path specified. Models will be saved in: "AutogluonModels\ag-20240804_045523\"
Beginning AutoGluon training ... Time limit = 1000s
AutoGluon will save models to "AutogluonModels\ag-20240804_045523\"
AutoGluon Version:  0.8.2
Python Version:     3.10.11
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.19044
Disk Space Avail:   208.49 GB / 925.93 GB (22.5%)
Train Data Rows:    896
Train Data Columns: 1605
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
    2 unique label values:  [1, 2]
    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 2, class 0 = 1
    Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive (2) vs negative (1) class.
    To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    13766.09 MB
    Train Data (Original)  Memory Usage: 11.5 MB (0.1% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('float', []) : 1605 | ['A2M', 'A2ML1', 'ABCA12', 'ABCA17P', 'ABCA9', ...]
    Types of features in processed data (raw dtype, special dtypes):
        ('float', []) : 1605 | ['A2M', 'A2ML1', 'ABCA12', 'ABCA17P', 'ABCA9', ...]
    1.3s = Fit runtime
    1605 features in original data used to generate 1605 features in processed data.
    Train Data (Processed) Memory Usage: 11.5 MB (0.1% of available memory)
Data preprocessing and feature engineering runtime = 1.37s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 716, Val Rows: 180
User-specified model hyperparameters to be fit:
{
    'GBM': {},
    'RF': {},
    'CAT': {},
    'XGB': {},
    'XT': {},
    'KNN': {},
    'LR': {},
}
Fitting 7 L1 models ...
Fitting model: KNeighbors ... Training model for up to 998.63s of the 998.62s of remaining time.
    0.5722   = Validation score   (accuracy)
    1.09s    = Training   runtime
    0.14s    = Validation runtime
Fitting model: LightGBM ... Training model for up to 997.39s of the 997.37s of remaining time.
    0.6667   = Validation score   (accuracy)
    9.65s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: RandomForest ... Training model for up to 987.68s of the 987.66s of remaining time.
    0.6278   = Validation score   (accuracy)
    3.84s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: CatBoost ... Training model for up to 983.73s of the 983.71s of remaining time.
    0.6667   = Validation score   (accuracy)
    67.73s   = Training   runtime
    0.04s    = Validation runtime
Fitting model: ExtraTrees ... Training model for up to 915.93s of the 915.91s of remaining time.
    0.6167   = Validation score   (accuracy)
    1.25s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: XGBoost ... Training model for up to 914.57s of the 914.55s of remaining time.
    0.6778   = Validation score   (accuracy)
    26.02s   = Training   runtime
    0.02s    = Validation runtime
Fitting model: LinearModel ... Training model for up to 888.5s of the 888.48s of remaining time.
E:\Anaconda\Anaconda\envs\TransPro\lib\site-packages\sklearn\preprocessing\_data.py:2663: UserWarning:

n_quantiles (1000) is greater than the total number of samples (716). n_quantiles is set to n_samples.
    0.6  = Validation score   (accuracy)
    3.03s    = Training   runtime
    0.07s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 885.33s of remaining time.
    0.6833   = Validation score   (accuracy)
    0.41s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 115.13s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels\ag-20240804_045523\")
Computing feature importance via permutation shuffling for 1605 features using 385 rows with 5 shuffle sets...
    984.4s  = Expected runtime (196.88s per shuffle set)
    325.48s = Actual runtime (Completed 5 of 5 shuffle sets)
                 model  score_test  score_val  pred_time_test  pred_time_val   fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0             LightGBM    0.633766   0.666667        0.026909       0.014949   9.649190                 0.026909                0.014949           9.649190            1       True          2
1              XGBoost    0.631169   0.677778        0.037873       0.016792  26.016598                 0.037873                0.016792          26.016598            1       True          6
2           ExtraTrees    0.631169   0.616667        0.080732       0.062928   1.251859                 0.080732                0.062928           1.251859            1       True          5
3  WeightedEnsemble_L2    0.631169   0.683333        0.116609       0.080804  30.268527                 0.001993                0.000997           0.413702            2       True          8
4           KNeighbors    0.623377   0.572222        0.052824       0.137540   1.089551                 0.052824                0.137540           1.089551            1       True          1
5         RandomForest    0.610390   0.627778        0.076743       0.063015   3.838227                 0.076743                0.063015           3.838227            1       True          3
6             CatBoost    0.571429   0.666667        0.083691       0.037873  67.730205                 0.083691                0.037873          67.730205            1       True          4
7          LinearModel    0.558442   0.600000        0.078737       0.068869   3.029000                 0.078737                0.068869           3.029000            1       True          7

7.4.4 Return

# Filtering the DataFrame to get features where importance > 0 and p_value < 0.05
filtered_importance = importance[(importance['importance'] > 0) & (importance['p_value'] < 0.05)]
# Counting the number of such features
number_of_features = filtered_importance.shape[0]
# Printing the result
print(f"Number of features with importance > 0 and p_value < 0.05: {number_of_features}")
Number of features with importance > 0 and p_value < 0.05: 232

# Display the top 20 rows of the Importance DataFrame
print("Top 20 rows of Importance DataFrame:")
print(importance.head(20))
Top 20 rows of Importance DataFrame:
                importance    stddev   p_value  n  p99_high   p99_low
NTRK1             0.020779  0.012044  0.009090  5  0.045577 -0.004019
RP11-641D5.1      0.018701  0.004996  0.000557  5  0.028989  0.008414
HBA2              0.015065  0.006723  0.003718  5  0.028908  0.001222
STMN2             0.013506  0.005631  0.002917  5  0.025101  0.001912
NAIP              0.012987  0.002597  0.000182  5  0.018335  0.007639
HIST2H2BF         0.012468  0.005631  0.003878  5  0.024062  0.000873
SPP1              0.011948  0.005386  0.003852  5  0.023038  0.000858
AC010524.2        0.011429  0.003939  0.001455  5  0.019539  0.003318
CD24              0.010909  0.002845  0.000508  5  0.016768  0.005051
ADAMDEC1          0.010909  0.006723  0.011097  5  0.024752 -0.002934
XIST              0.010909  0.005631  0.006165  5  0.022503 -0.000685
RP11-1212A22.4    0.010390  0.005808  0.008065  5  0.022348 -0.001569
RPS10-NUDT3       0.009351  0.003939  0.003027  5  0.017461  0.001240
JAKMIP3           0.008831  0.002961  0.001314  5  0.014929  0.002733
PPP1R14C          0.008831  0.002961  0.001314  5  0.014929  0.002733
TBC1D3B           0.008831  0.004718  0.006931  5  0.018546 -0.000884
SPINK5            0.008831  0.001423  0.000078  5  0.011760  0.005902
MMP3              0.008831  0.009293  0.050387  5  0.027965 -0.010303
PPP2R2C           0.008831  0.003939  0.003711  5  0.016942  0.000720
SAA2              0.008312  0.002845  0.001419  5  0.014170  0.002453

# Display the top 20 rows of the Leaderboard DataFrame
print("\nTop 20 rows of Leaderboard DataFrame:")
print(leaderboard.head(20))

Top 20 rows of Leaderboard DataFrame:
                 model  score_test  score_val  pred_time_test  pred_time_val  \
0             LightGBM    0.633766   0.666667        0.026909       0.014949   
1              XGBoost    0.631169   0.677778        0.037873       0.016792   
2           ExtraTrees    0.631169   0.616667        0.080732       0.062928   
3  WeightedEnsemble_L2    0.631169   0.683333        0.116609       0.080804   
4           KNeighbors    0.623377   0.572222        0.052824       0.137540   
5         RandomForest    0.610390   0.627778        0.076743       0.063015   
6             CatBoost    0.571429   0.666667        0.083691       0.037873   
7          LinearModel    0.558442   0.600000        0.078737       0.068869   

    fit_time  pred_time_test_marginal  pred_time_val_marginal  \
0   9.649190                 0.026909                0.014949   
1  26.016598                 0.037873                0.016792   
2   1.251859                 0.080732                0.062928   
3  30.268527                 0.001993                0.000997   
4   1.089551                 0.052824                0.137540   
5   3.838227                 0.076743                0.063015   
6  67.730205                 0.083691                0.037873   
7   3.029000                 0.078737                0.068869   

   fit_time_marginal  stack_level  can_infer  fit_order  
0           9.649190            1       True          2  
1          26.016598            1       True          6  
2           1.251859            1       True          5  
3           0.413702            2       True          8  
4           1.089551            1       True          1  
5           3.838227            1       True          3  
6          67.730205            1       True          4  
7           3.029000            1       True          7  

7.4.5 Save Data

# Save the Importance DataFrame to a CSV file
importance.to_csv('../test_TransProPy/data/Insignificant_correlation_Autogluon_SelectML_importance.csv', index=False)

# Save the Leaderboard DataFrame to a CSV file
leaderboard.to_csv('../test_TransProPy/data/Insignificant_correlation_Autogluon_SelectML_leaderboard.csv', index=False)

7.5 Significant Correlation

  • Please note:Data characteristics: Features have strong correlation with the classification.

7.5.1 Import the corresponding module

from TransProPy.AutogluonSelectML import AutoGluon_SelectML

7.5.2 Data

import pandas as pd
data_path = '../test_TransProPy/data/four_methods_degs_intersection.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10]) 
  Unnamed: 0  TCGA-D9-A4Z2-01A  TCGA-ER-A2NH-06A  TCGA-BF-A5EO-01A  \
0        A2M         16.808499         16.506184         17.143433   
1      A2ML1          1.584963          9.517669          7.434628   
2      AADAC          4.000000          2.584963          1.584963   
3    AADACL2          1.000000          1.000000          0.000000   
4     ABCA12          4.523562          4.321928          3.906891   
5    ABCA17P          4.584963          5.169925          3.807355   
6      ABCA9          9.753217          6.906891          3.459432   
7      ABCB4          9.177420          6.700440          5.000000   
8      ABCB5         10.134426          4.169925          9.167418   
9     ABCC11         10.092757          6.491853          5.459432   

   TCGA-D9-A6EA-06A  TCGA-D9-A4Z3-01A  TCGA-GN-A26A-06A  TCGA-D3-A3BZ-06A  \
0         17.760739         14.766839         16.263691         16.035207   
1          2.584963          1.584963          2.584963          5.285402   
2          0.000000          0.000000          0.000000          3.321928   
3          0.000000          1.000000          0.000000          0.000000   
4          3.459432          1.584963          3.000000          4.321928   
5          8.366322          7.228819          7.076816          4.584963   
6          2.584963          6.357552          6.475733          7.330917   
7          9.342075         10.392317          7.383704         11.032735   
8          4.906891         11.340963          3.169925         11.161762   
9          6.807355          4.247928          5.459432          5.977280   

   TCGA-D3-A51G-06A  TCGA-EE-A29R-06A  
0         18.355114         16.959379  
1          2.584963          3.584963  
2          1.000000          4.584963  
3          0.000000          1.000000  
4          4.807355          3.700440  
5          6.409391          7.139551  
6          7.954196          9.177420  
7         10.082149         10.088788  
8          4.643856         12.393927  
9          5.614710          8.233620  

import pandas as pd
data_path = '../test_TransProPy/data/class.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10]) 
         Unnamed: 0  class
0  TCGA-D9-A4Z2-01A      2
1  TCGA-ER-A2NH-06A      2
2  TCGA-BF-A5EO-01A      2
3  TCGA-D9-A6EA-06A      2
4  TCGA-D9-A4Z3-01A      2
5  TCGA-GN-A26A-06A      2
6  TCGA-D3-A3BZ-06A      2
7  TCGA-D3-A51G-06A      2
8  TCGA-EE-A29R-06A      2
9  TCGA-D3-A2JE-06A      2

7.5.3 Autogluon_SelectML

  • The core purpose of choosing Autogluon_SelectML — to select a larger feature set in AutoGluon that includes both important and secondary features — is reflected in the following custom hyperparameters configuration. This setup is designed to utilize multiple model types so that the models can consider a broader range of features.
  • This configuration encompasses neural networks (using PyTorch and FastAI), gradient boosting machines (LightGBM, XGBoost, and CatBoost), random forests (RF), extremely randomized trees (XT), K-nearest neighbors (KNN), and linear regression (LR).
importance, leaderboard = AutoGluon_SelectML(
    gene_data_path='../test_TransProPy/data/four_methods_degs_intersection.csv', 
    class_data_path='../test_TransProPy/data/class.csv', 
    label_column='class', 
    test_size=0.3, 
    threshold=0.9, 
    hyperparameters={
        'GBM': {}, 
        'RF': {},
        'CAT': {}, 
        'XGB' : {},
        # 'NN_TORCH': {}, 
        # 'FASTAI': {},
        'XT': {}, 
        'KNN': {},
        'LR': {}
        },
    random_feature=None, 
    num_bag_folds=None, 
    num_stack_levels=None, 
    time_limit=1000, 
    random_state=42
    )
No path specified. Models will be saved in: "AutogluonModels\ag-20240804_050246\"
Beginning AutoGluon training ... Time limit = 1000s
AutoGluon will save models to "AutogluonModels\ag-20240804_050246\"
AutoGluon Version:  0.8.2
Python Version:     3.10.11
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.19044
Disk Space Avail:   208.48 GB / 925.93 GB (22.5%)
Train Data Rows:    896
Train Data Columns: 1605
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
    2 unique label values:  [2, 1]
    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 2, class 0 = 1
    Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive (2) vs negative (1) class.
    To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    13543.99 MB
    Train Data (Original)  Memory Usage: 11.5 MB (0.1% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('float', []) : 1605 | ['A2M', 'A2ML1', 'ABCA12', 'ABCA17P', 'ABCA9', ...]
    Types of features in processed data (raw dtype, special dtypes):
        ('float', []) : 1605 | ['A2M', 'A2ML1', 'ABCA12', 'ABCA17P', 'ABCA9', ...]
    2.3s = Fit runtime
    1605 features in original data used to generate 1605 features in processed data.
    Train Data (Processed) Memory Usage: 11.5 MB (0.1% of available memory)
Data preprocessing and feature engineering runtime = 2.49s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 716, Val Rows: 180
User-specified model hyperparameters to be fit:
{
    'GBM': {},
    'RF': {},
    'CAT': {},
    'XGB': {},
    'XT': {},
    'KNN': {},
    'LR': {},
}
Fitting 7 L1 models ...
Fitting model: KNeighbors ... Training model for up to 997.51s of the 997.49s of remaining time.
    1.0  = Validation score   (accuracy)
    0.24s    = Training   runtime
    0.03s    = Validation runtime
Fitting model: LightGBM ... Training model for up to 997.21s of the 997.19s of remaining time.
    1.0  = Validation score   (accuracy)
    1.73s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: RandomForest ... Training model for up to 995.44s of the 995.42s of remaining time.
    1.0  = Validation score   (accuracy)
    1.27s    = Training   runtime
    0.05s    = Validation runtime
Fitting model: CatBoost ... Training model for up to 994.08s of the 994.06s of remaining time.
    1.0  = Validation score   (accuracy)
    65.1s    = Training   runtime
    0.04s    = Validation runtime
Fitting model: ExtraTrees ... Training model for up to 928.91s of the 928.89s of remaining time.
    1.0  = Validation score   (accuracy)
    1.11s    = Training   runtime
    0.05s    = Validation runtime
Fitting model: XGBoost ... Training model for up to 927.71s of the 927.69s of remaining time.
    1.0  = Validation score   (accuracy)
    4.56s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: LinearModel ... Training model for up to 923.11s of the 923.09s of remaining time.
E:\Anaconda\Anaconda\envs\TransPro\lib\site-packages\sklearn\preprocessing\_data.py:2663: UserWarning:

n_quantiles (1000) is greater than the total number of samples (716). n_quantiles is set to n_samples.
    1.0  = Validation score   (accuracy)
    2.78s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 920.19s of remaining time.
    1.0  = Validation score   (accuracy)
    0.4s     = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 80.26s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels\ag-20240804_050246\")
Computing feature importance via permutation shuffling for 1605 features using 385 rows with 5 shuffle sets...
    640.26s = Expected runtime (128.05s per shuffle set)
    65.47s  = Actual runtime (Completed 5 of 5 shuffle sets)
                 model  score_test  score_val  pred_time_test  pred_time_val   fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0             LightGBM    1.000000        1.0        0.016944       0.011959   1.733270                 0.016944                0.011959           1.733270            1       True          2
1           KNeighbors    1.000000        1.0        0.050829       0.033886   0.238504                 0.050829                0.033886           0.238504            1       True          1
2         RandomForest    1.000000        1.0        0.070763       0.049834   1.269384                 0.070763                0.049834           1.269384            1       True          3
3           ExtraTrees    1.000000        1.0        0.070764       0.052823   1.109194                 0.070764                0.052823           1.109194            1       True          5
4          LinearModel    1.000000        1.0        0.071759       0.063786   2.781226                 0.071759                0.063786           2.781226            1       True          7
5  WeightedEnsemble_L2    1.000000        1.0        0.072757       0.053820   1.508564                 0.001993                0.000997           0.399371            2       True          8
6             CatBoost    0.997403        1.0        0.084717       0.036279  65.102499                 0.084717                0.036279          65.102499            1       True          4
7              XGBoost    0.994805        1.0        0.031897       0.015007   4.556480                 0.031897                0.015007           4.556480            1       True          6

7.5.4 Return

# Filtering the DataFrame to get features where importance > 0 and p_value < 0.05
filtered_importance = importance[(importance['importance'] > 0) & (importance['p_value'] < 0.05)]
# Counting the number of such features
number_of_features = filtered_importance.shape[0]
# Printing the result
print(f"Number of features with importance > 0 and p_value < 0.05: {number_of_features}")
Number of features with importance > 0 and p_value < 0.05: 0

# Display the top 20 rows of the Importance DataFrame
print("Top 20 rows of Importance DataFrame:")
print(importance.head(20))
Top 20 rows of Importance DataFrame:
          importance  stddev  p_value  n  p99_high  p99_low
A2M              0.0     0.0      0.5  5       0.0      0.0
PROM2            0.0     0.0      0.5  5       0.0      0.0
PSORS1C2         0.0     0.0      0.5  5       0.0      0.0
PSORS1C1         0.0     0.0      0.5  5       0.0      0.0
PSMC1P1          0.0     0.0      0.5  5       0.0      0.0
PRTG             0.0     0.0      0.5  5       0.0      0.0
PRSS8            0.0     0.0      0.5  5       0.0      0.0
PRSS53           0.0     0.0      0.5  5       0.0      0.0
PRSS3            0.0     0.0      0.5  5       0.0      0.0
PRSS22           0.0     0.0      0.5  5       0.0      0.0
PRR19            0.0     0.0      0.5  5       0.0      0.0
PRR15L           0.0     0.0      0.5  5       0.0      0.0
PRODH            0.0     0.0      0.5  5       0.0      0.0
PI16             0.0     0.0      0.5  5       0.0      0.0
PRKCQ            0.0     0.0      0.5  5       0.0      0.0
PRF1             0.0     0.0      0.5  5       0.0      0.0
PRAME            0.0     0.0      0.5  5       0.0      0.0
PPP2R2C          0.0     0.0      0.5  5       0.0      0.0
PPP1R1B          0.0     0.0      0.5  5       0.0      0.0
PPP1R14C         0.0     0.0      0.5  5       0.0      0.0

# Display the top 20 rows of the Leaderboard DataFrame
print("\nTop 20 rows of Leaderboard DataFrame:")
print(leaderboard.head(20))

Top 20 rows of Leaderboard DataFrame:
                 model  score_test  score_val  pred_time_test  pred_time_val  \
0             LightGBM    1.000000        1.0        0.016944       0.011959   
1           KNeighbors    1.000000        1.0        0.050829       0.033886   
2         RandomForest    1.000000        1.0        0.070763       0.049834   
3           ExtraTrees    1.000000        1.0        0.070764       0.052823   
4          LinearModel    1.000000        1.0        0.071759       0.063786   
5  WeightedEnsemble_L2    1.000000        1.0        0.072757       0.053820   
6             CatBoost    0.997403        1.0        0.084717       0.036279   
7              XGBoost    0.994805        1.0        0.031897       0.015007   

    fit_time  pred_time_test_marginal  pred_time_val_marginal  \
0   1.733270                 0.016944                0.011959   
1   0.238504                 0.050829                0.033886   
2   1.269384                 0.070763                0.049834   
3   1.109194                 0.070764                0.052823   
4   2.781226                 0.071759                0.063786   
5   1.508564                 0.001993                0.000997   
6  65.102499                 0.084717                0.036279   
7   4.556480                 0.031897                0.015007   

   fit_time_marginal  stack_level  can_infer  fit_order  
0           1.733270            1       True          2  
1           0.238504            1       True          1  
2           1.269384            1       True          3  
3           1.109194            1       True          5  
4           2.781226            1       True          7  
5           0.399371            2       True          8  
6          65.102499            1       True          4  
7           4.556480            1       True          6  

7.5.5 Save Data

# Save the Importance DataFrame to a CSV file
importance.to_csv('../test_TransProPy/data/significant_correlation_Autogluon_SelectML_importance.csv', index=False)

# Save the Leaderboard DataFrame to a CSV file
leaderboard.to_csv('../test_TransProPy/data/significant_correlation_Autogluon_SelectML_leaderboard.csv', index=False)

7.6 References

7.6.1 Scientific Publications

7.6.2 Articles

7.6.3 Documentation