3 UtilsFunction3

This section serves as the helper function for the AutoFeatureSelection.py function.

3.1 LoadFilterTranspose.py

Remove samples with high zero expression.

3.1.1 Parameters

data_path: string:

For example: ‘../data/gene_tpm.csv’

Please note: The input data matrix should have genes as rows and samples as columns.

threshold: float:

For example: 0.9

The set threshold indicates the proportion of non-zero value samples to all samples in each feature.

3.1.2 Returns

X (pandas.core.frame.DataFrame):

3.1.3 Usage

load_filter_transpose(
    threshold=0.9, 
    data_path='../data/gene_tpm.csv'
    )

3.2 LoadEncodeLabels.py

Reads a CSV file containing labels and encodes categorical labels in the specified column to numeric labels.

3.2.1 Parameters

file_path (str):

Path to the CSV file containing labels.

column_name (str):

Name of the column to be encoded.

3.2.2 Returns

Y (pd.DataFrame):

A DataFrame containing the encoded numeric labels.

3.2.3 Usage

load_encode_labels(
    file_path='../data/class.csv', 
    column_name='class'
    )

3.3 ExtractCommonSamples.py

Extracts common samples (rows) from two DataFrames based on their indices.

3.3.1 Parameters

X (pd.DataFrame):

First DataFrame.

Y (pd.DataFrame):

Second DataFrame.

3.3.2 Returns

X_common, Y_common (pd.DataFrame):

Two DataFrames containing only the rows that are common in both.

3.3.3 Usage

extract_common_samples(
    X, 
    Y
    )

3.4 LoadAndPreprocessData.py

Load and preprocess the data.

3.4.1 Parameters

feature_file: str:

Path to the feature data file.

label_file: str:

Path to the label data file.

label_column: str:

Column name of the labels in the label file.

threshold: float:

Threshold for filtering in load_filter_transpose function.

3.4.2 Returns

X (DataFrame):

Preprocessed feature data.

Y (ndarray):

Preprocessed label data.

3.4.3 Usage

load_and_preprocess_data(
    feature_file, 
    label_file, 
    label_column, 
    threshold
    )

3.5 SetupLoggingAndProgressBar.py

Set up logging and initialize a tqdm progress bar.

3.5.1 Parameters

n_iter (int):

Number of iterations for RandomizedSearchCV.

n_cv (int):

Number of cross-validation folds.

3.5.2 Returns

tqdm object

An initialized tqdm progress bar.

3.5.3 Usage

setup_logging_and_progress_bar(
    n_iter, 
    n_cv
    )

3.6 UpdateProgressBar.py

Read the number of log entries in the log file and update the tqdm progress bar.

3.6.1 Parameters

pbar (tqdm):

The tqdm progress bar object.

log_file (str):

Path to the log file, default is ‘progress.log’.

3.6.2 Usage

update_progress_bar(
    pbar, 
    log_file='progress.log'
    )

3.7 LoggingCustomScorer.py

Creates a custom scorer function for use in model evaluation processes. This scorer logs both the accuracy score and the time taken for each call.

3.7.1 Parameters

n_iter (int):

Number of iterations for the search process. Default is 10.

n_cv (int):

Number of cross-validation splits. Default is 5.

3.7.2 Returns

custom_scorer(function)

A custom scorer function that logs the accuracy score and time taken for each call.

3.7.3 Usage

logging_custom_scorer(
    n_iter=10, 
    n_cv=5
    )

3.8 TqdmCustomScorer.py

Creates a custom scorer for model evaluation, integrating a progress bar with tqdm.

3.8.1 Parameters

n_iter: int (optional):

Number of iterations for the search process. Default is 10.

n_cv: int (optional):

Number of cross-validation splits. Default is 5.

3.8.2 Returns

function:

A custom scorer function that can be used with model evaluation methods like RandomizedSearchCV.

3.8.3 Description

The tqdm_custom_scorer function creates a scorer for model evaluation, incorporating a tqdm progress bar to monitor the evaluation process. This scorer is especially useful in processes like RandomizedSearchCV, where it provides real-time feedback on the number of iterations and cross-validation steps completed.

3.8.4 Usage

custom_scorer = tqdm_custom_scorer(n_iter=10, n_cv=5)
# Use this scorer in RandomizedSearchCV or similar methods

3.9 TrainModel.py

Set up and run the model training process.

3.9.1 Parameters

X: DataFrame:

feature data.

Y: ndarray:

label data.

feature_selection:

FeatureUnion, the feature selection process.

parameters: dict:

parameters for RandomizedSearchCV.

n_iter: int:

number of iterations for RandomizedSearchCV.

n_cv: int:

number of cross-validation folds.

n_jobs: int:

number of jobs to run in parallel (default is 9).

3.9.2 Returns

clf

RandomizedSearchCV object after fitting.

3.9.3 Usage

train_model(
    X, 
    Y, 
    feature_selection, 
    parameters, 
    n_iter, 
    n_cv, 
    n_jobs=9
    )

3.10 EnsembleForRFE.py

Set up and run the Ensemble model for Recursive Feature Elimination.

3.10.1 Parameters

svm_C: float:

Regularization parameter for SVM.

tree_max_depth: int:

Maximum depth of the decision tree.

tree_min_samples_split: int:

Minimum number of samples required to split an internal node.

gbm_learning_rate: float:

Learning rate for gradient boosting.

gbm_n_estimators: int:

Number of boosting stages for gradient boosting.

3.10.2 Attributes

feature_importances_:

Array of feature importances after fitting the model.

3.10.3 Methods

fit(X, y):

Fit the model to data matrix X and target(s) y.

predict(X):

Predict class labels for samples in X.

set_params(**params):

Set parameters for the ensemble estimator.

3.11 SetupFeatureSelection.py

Set up the feature selection process in TransProPy.UtilsFunction3. This function is particularly useful for setting up a feature selection pipeline, especially in models that benefit from ensemble methods and mutual information-based feature selection.

3.11.1 Returns

feature_selection: FeatureUnion:

A combined feature selection process.

3.11.2 Description

The setup_feature_selection function initializes and returns a FeatureUnion object for feature selection. This union includes: - RFECV: Utilizes an EnsembleForRFE estimator with StratifiedKFold(5) for cross-validation, focusing on accuracy. - SelectKBest: Applies mutual_info_classif for feature scoring. The combination of these techniques provides a robust approach to feature selection in machine learning models.

3.11.3 Usage

feature_selection = setup_feature_selection()

3.12 PrintBoxedText.py

Prints a title in a boxed format in the console output.

3.12.1 Parameters

title: str:

The text to be displayed inside the box.

3.12.2 Returns

None. This function directly prints the formatted title to the console.

3.12.3 Description

This function creates a box around the given title text using hash (#) and equals (=) symbols. It prints the title with a border on the top and bottom, making it stand out in the console output. The border line consists of a hash symbol, followed by equals symbols the length of the title plus two (for padding), and then another hash symbol.

3.12.4 Usage Example

print_boxed_text("Example Title")

3.13 ExtractAndSaveResults.py

The function uses matplotlib for plotting, pandas for data handling, and a custom print_boxed_text function for formatted output.

3.13.1 Parameters

clf: trained model (RandomizedSearchCV object):

The classifier object after training.

X: DataFrame:

Feature data used for training.

save_path: str:

Base path for saving results.

show_plot: bool (optional):

Whether to display the plot. Default is False.

use_tkagg: bool (optional):

Whether to use ‘TkAgg’ backend for matplotlib. Generally, choose False when using in PyCharm IDE, and choose True when rendering file.qmd to an HTML file.

3.13.2 Description

This function performs a comprehensive analysis and extraction of results from a trained model. It includes: - Extracting and plotting cross-validation results. - Identifying and printing features selected by RFECV and SelectKBest. - Combining and saving selected features in a CSV file. - Extracting and saving feature importances from EnsembleForRFE. - Extracting and saving scores from SelectKBest.

3.13.3 Usage

extract_and_save_results(clf, X, "path/to/save/", show_plot=True)