3 UtilsFunction3
This section serves as the helper function for the AutoFeatureSelection.py function.
3.1 LoadFilterTranspose.py
Remove samples with high zero expression.
3.1.1 Parameters
- data_path: string:
- For example: ‘../data/gene_tpm.csv’
- Please note: The input data matrix should have genes as rows and samples as columns.
- threshold: float:
- For example: 0.9
- The set threshold indicates the proportion of non-zero value samples to all samples in each feature.
3.1.2 Returns
- X (pandas.core.frame.DataFrame):
3.1.3 Usage
load_filter_transpose(=0.9,
threshold='../data/gene_tpm.csv'
data_path )
3.2 LoadEncodeLabels.py
Reads a CSV file containing labels and encodes categorical labels in the specified column to numeric labels.
3.2.1 Parameters
- file_path (str):
- Path to the CSV file containing labels.
- column_name (str):
- Name of the column to be encoded.
3.2.2 Returns
- Y (pd.DataFrame):
- A DataFrame containing the encoded numeric labels.
3.2.3 Usage
load_encode_labels(='../data/class.csv',
file_path='class'
column_name )
3.3 ExtractCommonSamples.py
Extracts common samples (rows) from two DataFrames based on their indices.
3.3.1 Parameters
- X (pd.DataFrame):
- First DataFrame.
- Y (pd.DataFrame):
- Second DataFrame.
3.3.2 Returns
- X_common, Y_common (pd.DataFrame):
- Two DataFrames containing only the rows that are common in both.
3.3.3 Usage
extract_common_samples(
X,
Y )
3.4 LoadAndPreprocessData.py
Load and preprocess the data.
3.4.1 Parameters
- feature_file: str:
- Path to the feature data file.
- label_file: str:
- Path to the label data file.
- label_column: str:
- Column name of the labels in the label file.
- threshold: float:
- Threshold for filtering in load_filter_transpose function.
3.4.2 Returns
- X (DataFrame):
- Preprocessed feature data.
- Y (ndarray):
- Preprocessed label data.
3.4.3 Usage
load_and_preprocess_data(
feature_file,
label_file,
label_column,
threshold )
3.5 SetupLoggingAndProgressBar.py
Set up logging and initialize a tqdm progress bar.
3.5.1 Parameters
- n_iter (int):
- Number of iterations for RandomizedSearchCV.
- n_cv (int):
- Number of cross-validation folds.
3.5.2 Returns
- tqdm object
- An initialized tqdm progress bar.
3.5.3 Usage
setup_logging_and_progress_bar(
n_iter,
n_cv )
3.6 UpdateProgressBar.py
Read the number of log entries in the log file and update the tqdm progress bar.
3.6.1 Parameters
- pbar (tqdm):
- The tqdm progress bar object.
- log_file (str):
- Path to the log file, default is ‘progress.log’.
3.6.2 Usage
update_progress_bar(
pbar, ='progress.log'
log_file )
3.7 LoggingCustomScorer.py
Creates a custom scorer function for use in model evaluation processes. This scorer logs both the accuracy score and the time taken for each call.
3.7.1 Parameters
- n_iter (int):
- Number of iterations for the search process. Default is 10.
- n_cv (int):
- Number of cross-validation splits. Default is 5.
3.7.2 Returns
- custom_scorer(function)
- A custom scorer function that logs the accuracy score and time taken for each call.
3.7.3 Usage
logging_custom_scorer(=10,
n_iter=5
n_cv )
3.8 TqdmCustomScorer.py
Creates a custom scorer for model evaluation, integrating a progress bar with
tqdm
.
3.8.1 Parameters
- n_iter: int (optional):
- Number of iterations for the search process. Default is 10.
- n_cv: int (optional):
- Number of cross-validation splits. Default is 5.
3.8.2 Returns
- function:
- A custom scorer function that can be used with model evaluation methods like
RandomizedSearchCV
.
3.8.3 Description
The
tqdm_custom_scorer
function creates a scorer for model evaluation, incorporating atqdm
progress bar to monitor the evaluation process. This scorer is especially useful in processes likeRandomizedSearchCV
, where it provides real-time feedback on the number of iterations and cross-validation steps completed.
3.8.4 Usage
= tqdm_custom_scorer(n_iter=10, n_cv=5)
custom_scorer # Use this scorer in RandomizedSearchCV or similar methods
3.9 TrainModel.py
Set up and run the model training process.
3.9.1 Parameters
- X: DataFrame:
- feature data.
- Y: ndarray:
- label data.
- feature_selection:
- FeatureUnion, the feature selection process.
- parameters: dict:
- parameters for RandomizedSearchCV.
- n_iter: int:
- number of iterations for RandomizedSearchCV.
- n_cv: int:
- number of cross-validation folds.
- n_jobs: int:
- number of jobs to run in parallel (default is 9).
3.9.2 Returns
- clf
- RandomizedSearchCV object after fitting.
3.9.3 Usage
train_model(
X,
Y,
feature_selection,
parameters,
n_iter,
n_cv, =9
n_jobs )
3.10 EnsembleForRFE.py
Set up and run the Ensemble model for Recursive Feature Elimination.
3.10.1 Parameters
- svm_C: float:
- Regularization parameter for SVM.
- tree_max_depth: int:
- Maximum depth of the decision tree.
- tree_min_samples_split: int:
- Minimum number of samples required to split an internal node.
- gbm_learning_rate: float:
- Learning rate for gradient boosting.
- gbm_n_estimators: int:
- Number of boosting stages for gradient boosting.
3.10.2 Attributes
- feature_importances_:
- Array of feature importances after fitting the model.
3.10.3 Methods
- fit(X, y):
- Fit the model to data matrix X and target(s) y.
- predict(X):
- Predict class labels for samples in X.
- set_params(**params):
- Set parameters for the ensemble estimator.
3.11 SetupFeatureSelection.py
Set up the feature selection process in
TransProPy.UtilsFunction3
. This function is particularly useful for setting up a feature selection pipeline, especially in models that benefit from ensemble methods and mutual information-based feature selection.
3.11.1 Returns
- feature_selection: FeatureUnion:
- A combined feature selection process.
3.11.2 Description
The
setup_feature_selection
function initializes and returns aFeatureUnion
object for feature selection. This union includes: -RFECV
: Utilizes anEnsembleForRFE
estimator withStratifiedKFold(5)
for cross-validation, focusing on accuracy. -SelectKBest
: Appliesmutual_info_classif
for feature scoring. The combination of these techniques provides a robust approach to feature selection in machine learning models.
3.11.3 Usage
= setup_feature_selection() feature_selection
3.12 PrintBoxedText.py
Prints a title in a boxed format in the console output.
3.12.1 Parameters
- title: str:
- The text to be displayed inside the box.
3.12.2 Returns
- None. This function directly prints the formatted title to the console.
3.12.3 Description
This function creates a box around the given title text using hash (#) and equals (=) symbols. It prints the title with a border on the top and bottom, making it stand out in the console output. The border line consists of a hash symbol, followed by equals symbols the length of the title plus two (for padding), and then another hash symbol.
3.12.4 Usage Example
"Example Title") print_boxed_text(
3.13 ExtractAndSaveResults.py
The function uses matplotlib for plotting, pandas for data handling, and a custom
print_boxed_text
function for formatted output.
3.13.1 Parameters
- clf: trained model (RandomizedSearchCV object):
- The classifier object after training.
- X: DataFrame:
- Feature data used for training.
- save_path: str:
- Base path for saving results.
- show_plot: bool (optional):
- Whether to display the plot. Default is False.
- use_tkagg: bool (optional):
- Whether to use ‘TkAgg’ backend for matplotlib. Generally, choose False when using in PyCharm IDE, and choose True when rendering file.qmd to an HTML file.
3.13.2 Description
This function performs a comprehensive analysis and extraction of results from a trained model. It includes: - Extracting and plotting cross-validation results. - Identifying and printing features selected by RFECV and SelectKBest. - Combining and saving selected features in a CSV file. - Extracting and saving feature importances from EnsembleForRFE. - Extracting and saving scores from SelectKBest.
3.13.3 Usage
"path/to/save/", show_plot=True) extract_and_save_results(clf, X,