from TransProPy.AutogluonTimeLimit import Autogluon_TimeLimit
6 AutogluonTimeLimit.py
Trains a model using AutoGluon on provided data path and returns feature importance and model leaderboard.
6.1 Parameters
- gene_data_path (str):
- Path to the gene expression data CSV file.
- For example: ‘../data/gene_tpm.csv’
- class_data_path (str):
- Path to the class data CSV file.
- For example: ‘../data/tumor_class.csv’
- label_column (str):
- Name of the column in the dataset that is the target label for prediction.
- test_size (float):
- Proportion of the data to be used as the test set.
- threshold (float):
- The threshold used to filter out rows based on the proportion of non-zero values.
- random_feature (int, optional):
- The number of random feature to select. If None, no random feature selection is performed.
- Default is None.
- num_bag_folds (int, optional):
- Please note: This parameter annotation source can be referred to the documentation link in References.
- Number of folds used for bagging of models. When
num_bag_folds = k
, training time is roughly increased by a factor ofk
(set = 0 to disable bagging). Disabled by default (0), but we recommend values between 5-10 to maximize predictive performance. Increasing num_bag_folds will result in models with lower bias but that are more prone to overfitting.num_bag_folds = 1
is an invalid value, and will raise a ValueError. Values > 10 may produce diminishing returns, and can even harm overall results due to overfitting. To further improve predictions, avoid increasingnum_bag_folds
much beyond 10 and instead increasenum_bag_sets
.- default = None
- num_stack_levels (int, optional):
- Please note: This parameter annotation source can be referred to the documentation link in References.
- Number of stacking levels to use in stack ensemble. Roughly increases model training time by factor of
num_stack_levels+1
(set = 0 to disable stack ensembling). Disabled by default (0), but we recommend values between 1-3 to maximize predictive performance. To prevent overfitting,num_bag_folds >= 2
must also be set or else a ValueError will be raised.- default = None
- time_limit (int, optional):
- Time limit for training in seconds.
- Default is 120.
- random_state (int, optional):
- The seed used by the random number generator.
- Default is 42.
6.2 Returns
- importance (DataFrame):
- DataFrame containing feature importance.
- leaderboard (DataFrame):
- DataFrame containing model performance on the test data.
6.3 Usage of Autogluon_TimeLimit
Performing training and prediction tasks on tabular data using Autogluon.
6.3.1 Objectives
6.3.1.1 Model Training and Selection
Autogluon will attempt various models and hyperparameter combinations within a given time limit to find the best-performing model on the test data. During training, Autogluon may output training logs displaying performance metrics and progress information for different models. The goal is to select the best-performing model for use in subsequent prediction tasks.
6.3.1.2 Leaderboard
The leaderboard displays performance scores of different models on the test data, typically including metrics like accuracy, precision, recall, and more. The purpose is to assist users in understanding the performance of different models to choose the most suitable model for predictions.
6.3.1.3 Importance
Feature importance indicates which features are most critical for the model’s prediction performance. The purpose is to help users understand the importance of specific features in the data, which can be used for feature selection or further data analysis.
6.3.2 Note
Please note that Autogluon’s output results may vary depending on your data and task. You can review the generated model leaderboard and feature importance to understand model performance and the significance of specific features in the data. These results can aid you in making better predictions and decisions.
6.4 Insignificant Correlation
- Please note:Data characteristics: Features have weak correlation with the classification.
- Randomly shuffling the class labels to a certain extent simulates reducing the correlation.
6.4.1 Import the corresponding module
6.4.2 Data
import pandas as pd
= '../test_TransProPy/data/four_methods_degs_intersection.csv'
data_path = pd.read_csv(data_path)
data print(data.iloc[:10, :10])
Unnamed: 0 TCGA-D9-A4Z2-01A TCGA-ER-A2NH-06A TCGA-BF-A5EO-01A \
0 A2M 16.808499 16.506184 17.143433
1 A2ML1 1.584963 9.517669 7.434628
2 AADAC 4.000000 2.584963 1.584963
3 AADACL2 1.000000 1.000000 0.000000
4 ABCA12 4.523562 4.321928 3.906891
5 ABCA17P 4.584963 5.169925 3.807355
6 ABCA9 9.753217 6.906891 3.459432
7 ABCB4 9.177420 6.700440 5.000000
8 ABCB5 10.134426 4.169925 9.167418
9 ABCC11 10.092757 6.491853 5.459432
TCGA-D9-A6EA-06A TCGA-D9-A4Z3-01A TCGA-GN-A26A-06A TCGA-D3-A3BZ-06A \
0 17.760739 14.766839 16.263691 16.035207
1 2.584963 1.584963 2.584963 5.285402
2 0.000000 0.000000 0.000000 3.321928
3 0.000000 1.000000 0.000000 0.000000
4 3.459432 1.584963 3.000000 4.321928
5 8.366322 7.228819 7.076816 4.584963
6 2.584963 6.357552 6.475733 7.330917
7 9.342075 10.392317 7.383704 11.032735
8 4.906891 11.340963 3.169925 11.161762
9 6.807355 4.247928 5.459432 5.977280
TCGA-D3-A51G-06A TCGA-EE-A29R-06A
0 18.355114 16.959379
1 2.584963 3.584963
2 1.000000 4.584963
3 0.000000 1.000000
4 4.807355 3.700440
5 6.409391 7.139551
6 7.954196 9.177420
7 10.082149 10.088788
8 4.643856 12.393927
9 5.614710 8.233620
import pandas as pd
= '../test_TransProPy/data/random_classification_class.csv'
data_path = pd.read_csv(data_path)
data print(data.iloc[:10, :10])
Unnamed: 0 class
0 TCGA-D9-A4Z2-01A 2
1 TCGA-ER-A2NH-06A 2
2 TCGA-BF-A5EO-01A 2
3 TCGA-D9-A6EA-06A 2
4 TCGA-D9-A4Z3-01A 1
5 TCGA-GN-A26A-06A 1
6 TCGA-D3-A3BZ-06A 1
7 TCGA-D3-A51G-06A 1
8 TCGA-EE-A29R-06A 1
9 TCGA-D3-A2JE-06A 1
6.4.3 Autogluon_TimeLimit
= Autogluon_TimeLimit(
importance, leaderboard ='../test_TransProPy/data/four_methods_degs_intersection.csv',
gene_data_path='../test_TransProPy/data/random_classification_class.csv',
class_data_path='class',
label_column=0.3,
test_size=0.9,
threshold=None,
random_feature=None,
num_bag_folds=None,
num_stack_levels=1000,
time_limit=42
random_state )
No path specified. Models will be saved in: "AutogluonModels\ag-20240804_044508\"
Beginning AutoGluon training ... Time limit = 1000s
AutoGluon will save models to "AutogluonModels\ag-20240804_044508\"
AutoGluon Version: 0.8.2
Python Version: 3.10.11
Operating System: Windows
Platform Machine: AMD64
Platform Version: 10.0.19044
Disk Space Avail: 208.56 GB / 925.93 GB (22.5%)
Train Data Rows: 896
Train Data Columns: 1605
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [1, 2]
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = 2, class 0 = 1
Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive (2) vs negative (1) class.
To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 13791.23 MB
Train Data (Original) Memory Usage: 11.5 MB (0.1% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 1605 | ['A2M', 'A2ML1', 'ABCA12', 'ABCA17P', 'ABCA9', ...]
Types of features in processed data (raw dtype, special dtypes):
('float', []) : 1605 | ['A2M', 'A2ML1', 'ABCA12', 'ABCA17P', 'ABCA9', ...]
1.3s = Fit runtime
1605 features in original data used to generate 1605 features in processed data.
Train Data (Processed) Memory Usage: 11.5 MB (0.1% of available memory)
Data preprocessing and feature engineering runtime = 1.37s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 716, Val Rows: 180
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ... Training model for up to 998.63s of the 998.62s of remaining time.
0.5722 = Validation score (accuracy)
1.11s = Training runtime
0.15s = Validation runtime
Fitting model: KNeighborsDist ... Training model for up to 997.35s of the 997.34s of remaining time.
0.5722 = Validation score (accuracy)
0.12s = Training runtime
0.03s = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 997.17s of the 997.16s of remaining time.
0.7 = Validation score (accuracy)
2.2s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBM ... Training model for up to 994.93s of the 994.91s of remaining time.
0.6667 = Validation score (accuracy)
12.75s = Training runtime
0.01s = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 982.1s of the 982.08s of remaining time.
0.6278 = Validation score (accuracy)
3.88s = Training runtime
0.07s = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 978.11s of the 978.09s of remaining time.
0.6167 = Validation score (accuracy)
4.4s = Training runtime
0.06s = Validation runtime
Fitting model: CatBoost ... Training model for up to 973.57s of the 973.55s of remaining time.
0.6667 = Validation score (accuracy)
67.59s = Training runtime
0.04s = Validation runtime
Fitting model: ExtraTreesGini ... Training model for up to 905.91s of the 905.89s of remaining time.
0.6167 = Validation score (accuracy)
1.35s = Training runtime
0.05s = Validation runtime
Fitting model: ExtraTreesEntr ... Training model for up to 904.46s of the 904.44s of remaining time.
0.6222 = Validation score (accuracy)
1.25s = Training runtime
0.05s = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up to 903.1s of the 903.08s of remaining time.
0.6722 = Validation score (accuracy)
3.06s = Training runtime
0.02s = Validation runtime
Fitting model: XGBoost ... Training model for up to 899.95s of the 899.93s of remaining time.
0.6778 = Validation score (accuracy)
26.11s = Training runtime
0.02s = Validation runtime
Fitting model: NeuralNetTorch ... Training model for up to 873.79s of the 873.77s of remaining time.
0.6889 = Validation score (accuracy)
53.47s = Training runtime
0.1s = Validation runtime
Fitting model: LightGBMLarge ... Training model for up to 820.19s of the 820.17s of remaining time.
0.6556 = Validation score (accuracy)
29.38s = Training runtime
0.01s = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 790.71s of remaining time.
0.7056 = Validation score (accuracy)
0.75s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 210.09s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels\ag-20240804_044508\")
Computing feature importance via permutation shuffling for 1605 features using 385 rows with 5 shuffle sets...
704.28s = Expected runtime (140.86s per shuffle set)
215.75s = Actual runtime (Completed 5 of 5 shuffle sets)
model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 LightGBM 0.633766 0.666667 0.026911 0.013913 12.754798 0.026911 0.013913 12.754798 1 True 4
1 LightGBMXT 0.631169 0.700000 0.019933 0.012957 2.199846 0.019933 0.012957 2.199846 1 True 3
2 XGBoost 0.631169 0.677778 0.038872 0.018937 26.106133 0.038872 0.018937 26.106133 1 True 11
3 WeightedEnsemble_L2 0.631169 0.705556 0.073754 0.161193 4.068005 0.002990 0.000997 0.754443 2 True 14
4 ExtraTreesGini 0.631169 0.616667 0.075747 0.054816 1.352519 0.075747 0.054816 1.352519 1 True 8
5 KNeighborsUnif 0.623377 0.572222 0.050830 0.147239 1.113717 0.050830 0.147239 1.113717 1 True 1
6 KNeighborsDist 0.623377 0.572222 0.054817 0.033886 0.123587 0.054817 0.033886 0.123587 1 True 2
7 ExtraTreesEntr 0.623377 0.622222 0.077321 0.053820 1.253270 0.077321 0.053820 1.253270 1 True 9
8 RandomForestEntr 0.620779 0.616667 0.074385 0.064918 4.396520 0.074385 0.064918 4.396520 1 True 6
9 NeuralNetTorch 0.615584 0.688889 0.088704 0.104267 53.465611 0.088704 0.104267 53.465611 1 True 12
10 RandomForestGini 0.610390 0.627778 0.073753 0.065792 3.883219 0.073753 0.065792 3.883219 1 True 5
11 NeuralNetFastAI 0.589610 0.672222 0.039866 0.018937 3.063668 0.039866 0.018937 3.063668 1 True 10
12 LightGBMLarge 0.576623 0.655556 0.016943 0.012806 29.383221 0.016943 0.012806 29.383221 1 True 13
13 CatBoost 0.571429 0.666667 0.083720 0.039867 67.590781 0.083720 0.039867 67.590781 1 True 7
6.4.4 Return
# Filtering the DataFrame to get features where importance > 0 and p_value < 0.05
= importance[(importance['importance'] > 0) & (importance['p_value'] < 0.05)]
filtered_importance # Counting the number of such features
= filtered_importance.shape[0]
number_of_features # Printing the result
print(f"Number of features with importance > 0 and p_value < 0.05: {number_of_features}")
Number of features with importance > 0 and p_value < 0.05: 703
# Display the top 20 rows of the Importance DataFrame
print("Top 20 rows of Importance DataFrame:")
print(importance.head(20))
Top 20 rows of Importance DataFrame:
importance stddev p_value n p99_high p99_low
RP11-641D5.1 0.012987 0.002597 0.000182 5 0.018335 0.007639
CKM 0.012468 0.002173 0.000106 5 0.016942 0.007993
ZBED3-AS1 0.011948 0.003939 0.001234 5 0.020059 0.003837
PPP1R14C 0.010909 0.005323 0.005082 5 0.021869 -0.000051
RPS10-NUDT3 0.010909 0.006201 0.008525 5 0.023677 -0.001859
AOX1 0.010909 0.002173 0.000179 5 0.015384 0.006435
RP11-411B6.6 0.010909 0.003387 0.000985 5 0.017882 0.003936
NTRK1 0.010390 0.007113 0.015453 5 0.025036 -0.004257
GAPDHP1 0.010390 0.001837 0.000112 5 0.014171 0.006608
SPP1 0.009870 0.003387 0.001431 5 0.016843 0.002897
CLDN1 0.009351 0.004718 0.005705 5 0.019066 -0.000365
ADORA3 0.009351 0.003939 0.003027 5 0.017461 0.001240
AC016292.3 0.008831 0.004718 0.006931 5 0.018546 -0.000884
CORIN 0.008312 0.002845 0.001419 5 0.014170 0.002453
GAS6 0.008312 0.001162 0.000045 5 0.010703 0.005920
STMN2 0.007792 0.001837 0.000344 5 0.011574 0.004011
MAPK13 0.007792 0.001837 0.000344 5 0.011574 0.004011
LYPD6B 0.007792 0.003181 0.002704 5 0.014342 0.001242
TBC1D3B 0.007273 0.002173 0.000853 5 0.011747 0.002798
MT1M 0.007273 0.003853 0.006733 5 0.015205 -0.000660
# Display the top 20 rows of the Leaderboard DataFrame
print("\nTop 20 rows of Leaderboard DataFrame:")
print(leaderboard.head(20))
Top 20 rows of Leaderboard DataFrame:
model score_test score_val pred_time_test pred_time_val \
0 LightGBM 0.633766 0.666667 0.026911 0.013913
1 LightGBMXT 0.631169 0.700000 0.019933 0.012957
2 XGBoost 0.631169 0.677778 0.038872 0.018937
3 WeightedEnsemble_L2 0.631169 0.705556 0.073754 0.161193
4 ExtraTreesGini 0.631169 0.616667 0.075747 0.054816
5 KNeighborsUnif 0.623377 0.572222 0.050830 0.147239
6 KNeighborsDist 0.623377 0.572222 0.054817 0.033886
7 ExtraTreesEntr 0.623377 0.622222 0.077321 0.053820
8 RandomForestEntr 0.620779 0.616667 0.074385 0.064918
9 NeuralNetTorch 0.615584 0.688889 0.088704 0.104267
10 RandomForestGini 0.610390 0.627778 0.073753 0.065792
11 NeuralNetFastAI 0.589610 0.672222 0.039866 0.018937
12 LightGBMLarge 0.576623 0.655556 0.016943 0.012806
13 CatBoost 0.571429 0.666667 0.083720 0.039867
fit_time pred_time_test_marginal pred_time_val_marginal \
0 12.754798 0.026911 0.013913
1 2.199846 0.019933 0.012957
2 26.106133 0.038872 0.018937
3 4.068005 0.002990 0.000997
4 1.352519 0.075747 0.054816
5 1.113717 0.050830 0.147239
6 0.123587 0.054817 0.033886
7 1.253270 0.077321 0.053820
8 4.396520 0.074385 0.064918
9 53.465611 0.088704 0.104267
10 3.883219 0.073753 0.065792
11 3.063668 0.039866 0.018937
12 29.383221 0.016943 0.012806
13 67.590781 0.083720 0.039867
fit_time_marginal stack_level can_infer fit_order
0 12.754798 1 True 4
1 2.199846 1 True 3
2 26.106133 1 True 11
3 0.754443 2 True 14
4 1.352519 1 True 8
5 1.113717 1 True 1
6 0.123587 1 True 2
7 1.253270 1 True 9
8 4.396520 1 True 6
9 53.465611 1 True 12
10 3.883219 1 True 5
11 3.063668 1 True 10
12 29.383221 1 True 13
13 67.590781 1 True 7
6.4.5 Save Data
# Save the Importance DataFrame to a CSV file
'../test_TransProPy/data/Insignificant_correlation_Autogluon_TimeLimit_importance.csv', index=False)
importance.to_csv(
# Save the Leaderboard DataFrame to a CSV file
'../test_TransProPy/data/Insignificant_correlation_Autogluon_TimeLimit_leaderboard.csv', index=False) leaderboard.to_csv(
6.5 Significant Correlation
- Please note:Data characteristics: Features have strong correlation with the classification.
6.5.1 Import the corresponding module
from TransProPy.AutogluonTimeLimit import Autogluon_TimeLimit
6.5.2 Data
import pandas as pd
= '../test_TransProPy/data/four_methods_degs_intersection.csv'
data_path = pd.read_csv(data_path)
data print(data.iloc[:10, :10])
Unnamed: 0 TCGA-D9-A4Z2-01A TCGA-ER-A2NH-06A TCGA-BF-A5EO-01A \
0 A2M 16.808499 16.506184 17.143433
1 A2ML1 1.584963 9.517669 7.434628
2 AADAC 4.000000 2.584963 1.584963
3 AADACL2 1.000000 1.000000 0.000000
4 ABCA12 4.523562 4.321928 3.906891
5 ABCA17P 4.584963 5.169925 3.807355
6 ABCA9 9.753217 6.906891 3.459432
7 ABCB4 9.177420 6.700440 5.000000
8 ABCB5 10.134426 4.169925 9.167418
9 ABCC11 10.092757 6.491853 5.459432
TCGA-D9-A6EA-06A TCGA-D9-A4Z3-01A TCGA-GN-A26A-06A TCGA-D3-A3BZ-06A \
0 17.760739 14.766839 16.263691 16.035207
1 2.584963 1.584963 2.584963 5.285402
2 0.000000 0.000000 0.000000 3.321928
3 0.000000 1.000000 0.000000 0.000000
4 3.459432 1.584963 3.000000 4.321928
5 8.366322 7.228819 7.076816 4.584963
6 2.584963 6.357552 6.475733 7.330917
7 9.342075 10.392317 7.383704 11.032735
8 4.906891 11.340963 3.169925 11.161762
9 6.807355 4.247928 5.459432 5.977280
TCGA-D3-A51G-06A TCGA-EE-A29R-06A
0 18.355114 16.959379
1 2.584963 3.584963
2 1.000000 4.584963
3 0.000000 1.000000
4 4.807355 3.700440
5 6.409391 7.139551
6 7.954196 9.177420
7 10.082149 10.088788
8 4.643856 12.393927
9 5.614710 8.233620
import pandas as pd
= '../test_TransProPy/data/class.csv'
data_path = pd.read_csv(data_path)
data print(data.iloc[:10, :10])
Unnamed: 0 class
0 TCGA-D9-A4Z2-01A 2
1 TCGA-ER-A2NH-06A 2
2 TCGA-BF-A5EO-01A 2
3 TCGA-D9-A6EA-06A 2
4 TCGA-D9-A4Z3-01A 2
5 TCGA-GN-A26A-06A 2
6 TCGA-D3-A3BZ-06A 2
7 TCGA-D3-A51G-06A 2
8 TCGA-EE-A29R-06A 2
9 TCGA-D3-A2JE-06A 2
6.5.3 Autogluon_TimeLimit
= Autogluon_TimeLimit(
importance, leaderboard ='../test_TransProPy/data/four_methods_degs_intersection.csv',
gene_data_path='../test_TransProPy/data/class.csv',
class_data_path='class',
label_column=0.3,
test_size=0.9,
threshold=None,
random_feature=None,
num_bag_folds=None,
num_stack_levels=1000,
time_limit=42
random_state )
No path specified. Models will be saved in: "AutogluonModels\ag-20240804_045217\"
Beginning AutoGluon training ... Time limit = 1000s
AutoGluon will save models to "AutogluonModels\ag-20240804_045217\"
AutoGluon Version: 0.8.2
Python Version: 3.10.11
Operating System: Windows
Platform Machine: AMD64
Platform Version: 10.0.19044
Disk Space Avail: 208.52 GB / 925.93 GB (22.5%)
Train Data Rows: 896
Train Data Columns: 1605
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [2, 1]
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = 2, class 0 = 1
Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive (2) vs negative (1) class.
To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 13518.08 MB
Train Data (Original) Memory Usage: 11.5 MB (0.1% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 1605 | ['A2M', 'A2ML1', 'ABCA12', 'ABCA17P', 'ABCA9', ...]
Types of features in processed data (raw dtype, special dtypes):
('float', []) : 1605 | ['A2M', 'A2ML1', 'ABCA12', 'ABCA17P', 'ABCA9', ...]
2.2s = Fit runtime
1605 features in original data used to generate 1605 features in processed data.
Train Data (Processed) Memory Usage: 11.5 MB (0.1% of available memory)
Data preprocessing and feature engineering runtime = 2.38s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 716, Val Rows: 180
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ... Training model for up to 997.62s of the 997.6s of remaining time.
1.0 = Validation score (accuracy)
0.24s = Training runtime
0.04s = Validation runtime
Fitting model: KNeighborsDist ... Training model for up to 997.32s of the 997.3s of remaining time.
1.0 = Validation score (accuracy)
0.25s = Training runtime
0.04s = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 997.01s of the 996.99s of remaining time.
1.0 = Validation score (accuracy)
2.18s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBM ... Training model for up to 994.78s of the 994.76s of remaining time.
1.0 = Validation score (accuracy)
1.8s = Training runtime
0.01s = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 992.94s of the 992.92s of remaining time.
1.0 = Validation score (accuracy)
1.21s = Training runtime
0.05s = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 991.64s of the 991.62s of remaining time.
1.0 = Validation score (accuracy)
1.27s = Training runtime
0.05s = Validation runtime
Fitting model: CatBoost ... Training model for up to 990.29s of the 990.27s of remaining time.
1.0 = Validation score (accuracy)
64.37s = Training runtime
0.04s = Validation runtime
Fitting model: ExtraTreesGini ... Training model for up to 925.85s of the 925.83s of remaining time.
1.0 = Validation score (accuracy)
1.27s = Training runtime
0.05s = Validation runtime
Fitting model: ExtraTreesEntr ... Training model for up to 924.49s of the 924.47s of remaining time.
1.0 = Validation score (accuracy)
1.17s = Training runtime
0.06s = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up to 923.22s of the 923.2s of remaining time.
No improvement since epoch 0: early stopping
1.0 = Validation score (accuracy)
2.03s = Training runtime
0.02s = Validation runtime
Fitting model: XGBoost ... Training model for up to 921.11s of the 921.09s of remaining time.
1.0 = Validation score (accuracy)
4.6s = Training runtime
0.02s = Validation runtime
Fitting model: NeuralNetTorch ... Training model for up to 916.46s of the 916.44s of remaining time.
1.0 = Validation score (accuracy)
23.57s = Training runtime
0.1s = Validation runtime
Fitting model: LightGBMLarge ... Training model for up to 892.76s of the 892.74s of remaining time.
1.0 = Validation score (accuracy)
4.51s = Training runtime
0.01s = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 888.16s of remaining time.
1.0 = Validation score (accuracy)
0.72s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 112.61s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels\ag-20240804_045217\")
Computing feature importance via permutation shuffling for 1605 features using 385 rows with 5 shuffle sets...
176.07s = Expected runtime (35.21s per shuffle set)
64.21s = Actual runtime (Completed 5 of 5 shuffle sets)
model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 LightGBM 1.000000 1.0 0.013082 0.012956 1.801648 0.013082 0.012956 1.801648 1 True 4
1 LightGBMLarge 1.000000 1.0 0.015946 0.012242 4.505086 0.015946 0.012242 4.505086 1 True 13
2 WeightedEnsemble_L2 1.000000 1.0 0.016943 0.013239 5.222059 0.000997 0.000998 0.716973 2 True 14
3 NeuralNetFastAI 1.000000 1.0 0.039867 0.019933 2.026086 0.039867 0.019933 2.026086 1 True 10
4 KNeighborsUnif 1.000000 1.0 0.054827 0.035884 0.238206 0.054827 0.035884 0.238206 1 True 1
5 KNeighborsDist 1.000000 1.0 0.059789 0.037874 0.248050 0.059789 0.037874 0.248050 1 True 2
6 ExtraTreesEntr 1.000000 1.0 0.069767 0.058803 1.170361 0.069767 0.058803 1.170361 1 True 9
7 RandomForestGini 1.000000 1.0 0.070763 0.049529 1.210922 0.070763 0.049529 1.210922 1 True 5
8 ExtraTreesGini 1.000000 1.0 0.071760 0.053325 1.265165 0.071760 0.053325 1.265165 1 True 8
9 RandomForestEntr 1.000000 1.0 0.072756 0.048836 1.266010 0.072756 0.048836 1.266010 1 True 6
10 NeuralNetTorch 1.000000 1.0 0.080730 0.098805 23.569311 0.080730 0.098805 23.569311 1 True 12
11 LightGBMXT 0.997403 1.0 0.014950 0.011964 2.182093 0.014950 0.011964 2.182093 1 True 3
12 CatBoost 0.997403 1.0 0.084717 0.037174 64.372088 0.084717 0.037174 64.372088 1 True 7
13 XGBoost 0.994805 1.0 0.032890 0.016943 4.599743 0.032890 0.016943 4.599743 1 True 11
6.5.4 Return
# Filtering the DataFrame to get features where importance > 0 and p_value < 0.05
= importance[(importance['importance'] > 0) & (importance['p_value'] < 0.05)]
filtered_importance # Counting the number of such features
= filtered_importance.shape[0]
number_of_features # Printing the result
print(f"Number of features with importance > 0 and p_value < 0.05: {number_of_features}")
Number of features with importance > 0 and p_value < 0.05: 2
# Display the top 20 rows of the Importance DataFrame
print("Top 20 rows of Importance DataFrame:")
print(importance.head(20))
Top 20 rows of Importance DataFrame:
importance stddev p_value n p99_high p99_low
ISY1-RAB43 0.235844 0.017463 0.000004 5 0.2718 0.199888
RP11-231C14.4 0.235844 0.017463 0.000004 5 0.2718 0.199888
PSORS1C1 0.000000 0.000000 0.500000 5 0.0000 0.000000
PSMC1P1 0.000000 0.000000 0.500000 5 0.0000 0.000000
PRTG 0.000000 0.000000 0.500000 5 0.0000 0.000000
PRSS8 0.000000 0.000000 0.500000 5 0.0000 0.000000
PRSS53 0.000000 0.000000 0.500000 5 0.0000 0.000000
PRSS3 0.000000 0.000000 0.500000 5 0.0000 0.000000
PRSS22 0.000000 0.000000 0.500000 5 0.0000 0.000000
PRR19 0.000000 0.000000 0.500000 5 0.0000 0.000000
PRR15L 0.000000 0.000000 0.500000 5 0.0000 0.000000
PROM2 0.000000 0.000000 0.500000 5 0.0000 0.000000
PRODH 0.000000 0.000000 0.500000 5 0.0000 0.000000
A2M 0.000000 0.000000 0.500000 5 0.0000 0.000000
PRKCQ 0.000000 0.000000 0.500000 5 0.0000 0.000000
PRF1 0.000000 0.000000 0.500000 5 0.0000 0.000000
PRAME 0.000000 0.000000 0.500000 5 0.0000 0.000000
PPP2R2C 0.000000 0.000000 0.500000 5 0.0000 0.000000
PPP1R1B 0.000000 0.000000 0.500000 5 0.0000 0.000000
PPP1R14C 0.000000 0.000000 0.500000 5 0.0000 0.000000
# Display the top 20 rows of the Leaderboard DataFrame
print("\nTop 20 rows of Leaderboard DataFrame:")
print(leaderboard.head(20))
Top 20 rows of Leaderboard DataFrame:
model score_test score_val pred_time_test pred_time_val \
0 LightGBM 1.000000 1.0 0.013082 0.012956
1 LightGBMLarge 1.000000 1.0 0.015946 0.012242
2 WeightedEnsemble_L2 1.000000 1.0 0.016943 0.013239
3 NeuralNetFastAI 1.000000 1.0 0.039867 0.019933
4 KNeighborsUnif 1.000000 1.0 0.054827 0.035884
5 KNeighborsDist 1.000000 1.0 0.059789 0.037874
6 ExtraTreesEntr 1.000000 1.0 0.069767 0.058803
7 RandomForestGini 1.000000 1.0 0.070763 0.049529
8 ExtraTreesGini 1.000000 1.0 0.071760 0.053325
9 RandomForestEntr 1.000000 1.0 0.072756 0.048836
10 NeuralNetTorch 1.000000 1.0 0.080730 0.098805
11 LightGBMXT 0.997403 1.0 0.014950 0.011964
12 CatBoost 0.997403 1.0 0.084717 0.037174
13 XGBoost 0.994805 1.0 0.032890 0.016943
fit_time pred_time_test_marginal pred_time_val_marginal \
0 1.801648 0.013082 0.012956
1 4.505086 0.015946 0.012242
2 5.222059 0.000997 0.000998
3 2.026086 0.039867 0.019933
4 0.238206 0.054827 0.035884
5 0.248050 0.059789 0.037874
6 1.170361 0.069767 0.058803
7 1.210922 0.070763 0.049529
8 1.265165 0.071760 0.053325
9 1.266010 0.072756 0.048836
10 23.569311 0.080730 0.098805
11 2.182093 0.014950 0.011964
12 64.372088 0.084717 0.037174
13 4.599743 0.032890 0.016943
fit_time_marginal stack_level can_infer fit_order
0 1.801648 1 True 4
1 4.505086 1 True 13
2 0.716973 2 True 14
3 2.026086 1 True 10
4 0.238206 1 True 1
5 0.248050 1 True 2
6 1.170361 1 True 9
7 1.210922 1 True 5
8 1.265165 1 True 8
9 1.266010 1 True 6
10 23.569311 1 True 12
11 2.182093 1 True 3
12 64.372088 1 True 7
13 4.599743 1 True 11
6.5.5 Save Data
# Save the Importance DataFrame to a CSV file
'../test_TransProPy/data/significant_correlation_Autogluon_TimeLimit_importance.csv', index=False)
importance.to_csv(
# Save the Leaderboard DataFrame to a CSV file
'../test_TransProPy/data/significant_correlation_Autogluon_TimeLimit_leaderboard.csv', index=False) leaderboard.to_csv(