3  UtilsFunction3

This section serves as the helper function for the AutoFeatureSelection.py function.

3.1 LoadFilterTranspose.py

Remove samples with high zero expression.

3.1.1 Parameters

  • data_path: string:
    • For example: ‘../data/gene_tpm.csv’
    • Please note: The input data matrix should have genes as rows and samples as columns.
  • threshold: float:
    • For example: 0.9
    • The set threshold indicates the proportion of non-zero value samples to all samples in each feature.

3.1.2 Returns

  • X (pandas.core.frame.DataFrame):

3.1.3 Usage

load_filter_transpose(
    threshold=0.9, 
    data_path='../data/gene_tpm.csv'
    )

3.2 LoadEncodeLabels.py

Reads a CSV file containing labels and encodes categorical labels in the specified column to numeric labels.

3.2.1 Parameters

  • file_path (str):
    • Path to the CSV file containing labels.
  • column_name (str):
    • Name of the column to be encoded.

3.2.2 Returns

  • Y (pd.DataFrame):
    • A DataFrame containing the encoded numeric labels.

3.2.3 Usage

load_encode_labels(
    file_path='../data/class.csv', 
    column_name='class'
    )

3.3 ExtractCommonSamples.py

Extracts common samples (rows) from two DataFrames based on their indices.

3.3.1 Parameters

  • X (pd.DataFrame):
    • First DataFrame.
  • Y (pd.DataFrame):
    • Second DataFrame.

3.3.2 Returns

  • X_common, Y_common (pd.DataFrame):
    • Two DataFrames containing only the rows that are common in both.

3.3.3 Usage

extract_common_samples(
    X, 
    Y
    )

3.4 LoadAndPreprocessData.py

Load and preprocess the data.

3.4.1 Parameters

  • feature_file: str:
    • Path to the feature data file.
  • label_file: str:
    • Path to the label data file.
  • label_column: str:
    • Column name of the labels in the label file.
  • threshold: float:
    • Threshold for filtering in load_filter_transpose function.

3.4.2 Returns

  • X (DataFrame):
    • Preprocessed feature data.
  • Y (ndarray):
    • Preprocessed label data.

3.4.3 Usage

load_and_preprocess_data(
    feature_file, 
    label_file, 
    label_column, 
    threshold
    )

3.5 SetupLoggingAndProgressBar.py

Set up logging and initialize a tqdm progress bar.

3.5.1 Parameters

  • n_iter (int):
    • Number of iterations for RandomizedSearchCV.
  • n_cv (int):
    • Number of cross-validation folds.

3.5.2 Returns

  • tqdm object
    • An initialized tqdm progress bar.

3.5.3 Usage

setup_logging_and_progress_bar(
    n_iter, 
    n_cv
    )

3.6 UpdateProgressBar.py

Read the number of log entries in the log file and update the tqdm progress bar.

3.6.1 Parameters

  • pbar (tqdm):
    • The tqdm progress bar object.
  • log_file (str):
    • Path to the log file, default is ‘progress.log’.

3.6.2 Usage

update_progress_bar(
    pbar, 
    log_file='progress.log'
    )

3.7 LoggingCustomScorer.py

Creates a custom scorer function for use in model evaluation processes. This scorer logs both the accuracy score and the time taken for each call.

3.7.1 Parameters

  • n_iter (int):
    • Number of iterations for the search process. Default is 10.
  • n_cv (int):
    • Number of cross-validation splits. Default is 5.

3.7.2 Returns

  • custom_scorer(function)
    • A custom scorer function that logs the accuracy score and time taken for each call.

3.7.3 Usage

logging_custom_scorer(
    n_iter=10, 
    n_cv=5
    )

3.8 TqdmCustomScorer.py

Creates a custom scorer for model evaluation, integrating a progress bar with tqdm.

3.8.1 Parameters

  • n_iter: int (optional):
    • Number of iterations for the search process. Default is 10.
  • n_cv: int (optional):
    • Number of cross-validation splits. Default is 5.

3.8.2 Returns

  • function:
    • A custom scorer function that can be used with model evaluation methods like RandomizedSearchCV.

3.8.3 Description

The tqdm_custom_scorer function creates a scorer for model evaluation, incorporating a tqdm progress bar to monitor the evaluation process. This scorer is especially useful in processes like RandomizedSearchCV, where it provides real-time feedback on the number of iterations and cross-validation steps completed.

3.8.4 Usage

custom_scorer = tqdm_custom_scorer(n_iter=10, n_cv=5)
# Use this scorer in RandomizedSearchCV or similar methods

3.9 TrainModel.py

Set up and run the model training process.

3.9.1 Parameters

  • X: DataFrame:
    • feature data.
  • Y: ndarray:
    • label data.
  • feature_selection:
    • FeatureUnion, the feature selection process.
  • parameters: dict:
    • parameters for RandomizedSearchCV.
  • n_iter: int:
    • number of iterations for RandomizedSearchCV.
  • n_cv: int:
    • number of cross-validation folds.
  • n_jobs: int:
    • number of jobs to run in parallel (default is 9).

3.9.2 Returns

  • clf
    • RandomizedSearchCV object after fitting.

3.9.3 Usage

train_model(
    X, 
    Y, 
    feature_selection, 
    parameters, 
    n_iter, 
    n_cv, 
    n_jobs=9
    )

3.10 EnsembleForRFE.py

Set up and run the Ensemble model for Recursive Feature Elimination.

3.10.1 Parameters

  • svm_C: float:
  • Regularization parameter for SVM.
  • tree_max_depth: int:
  • Maximum depth of the decision tree.
  • tree_min_samples_split: int:
  • Minimum number of samples required to split an internal node.
  • gbm_learning_rate: float:
  • Learning rate for gradient boosting.
  • gbm_n_estimators: int:
  • Number of boosting stages for gradient boosting.

3.10.2 Attributes

  • feature_importances_:
  • Array of feature importances after fitting the model.

3.10.3 Methods

  • fit(X, y):
  • Fit the model to data matrix X and target(s) y.
  • predict(X):
  • Predict class labels for samples in X.
  • set_params(**params):
  • Set parameters for the ensemble estimator.

3.11 SetupFeatureSelection.py

Set up the feature selection process in TransProPy.UtilsFunction3. This function is particularly useful for setting up a feature selection pipeline, especially in models that benefit from ensemble methods and mutual information-based feature selection.

3.11.1 Returns

  • feature_selection: FeatureUnion:
    • A combined feature selection process.

3.11.2 Description

The setup_feature_selection function initializes and returns a FeatureUnion object for feature selection. This union includes: - RFECV: Utilizes an EnsembleForRFE estimator with StratifiedKFold(5) for cross-validation, focusing on accuracy. - SelectKBest: Applies mutual_info_classif for feature scoring. The combination of these techniques provides a robust approach to feature selection in machine learning models.

3.11.3 Usage

feature_selection = setup_feature_selection()

3.12 PrintBoxedText.py

Prints a title in a boxed format in the console output.

3.12.1 Parameters

  • title: str:
    • The text to be displayed inside the box.

3.12.2 Returns

  • None. This function directly prints the formatted title to the console.

3.12.3 Description

This function creates a box around the given title text using hash (#) and equals (=) symbols. It prints the title with a border on the top and bottom, making it stand out in the console output. The border line consists of a hash symbol, followed by equals symbols the length of the title plus two (for padding), and then another hash symbol.

3.12.4 Usage Example

print_boxed_text("Example Title")

3.13 ExtractAndSaveResults.py

The function uses matplotlib for plotting, pandas for data handling, and a custom print_boxed_text function for formatted output.

3.13.1 Parameters

  • clf: trained model (RandomizedSearchCV object):
    • The classifier object after training.
  • X: DataFrame:
    • Feature data used for training.
  • save_path: str:
    • Base path for saving results.
  • show_plot: bool (optional):
    • Whether to display the plot. Default is False.
  • use_tkagg: bool (optional):
    • Whether to use ‘TkAgg’ backend for matplotlib. Generally, choose False when using in PyCharm IDE, and choose True when rendering file.qmd to an HTML file.

3.13.2 Description

This function performs a comprehensive analysis and extraction of results from a trained model. It includes: - Extracting and plotting cross-validation results. - Identifying and printing features selected by RFECV and SelectKBest. - Combining and saving selected features in a CSV file. - Extracting and saving feature importances from EnsembleForRFE. - Extracting and saving scores from SelectKBest.

3.13.3 Usage

extract_and_save_results(clf, X, "path/to/save/", show_plot=True)