3 UtilsFunction3
This section serves as the helper function for the AutoFeatureSelection.py function.
3.1 LoadFilterTranspose.py
Remove samples with high zero expression.
3.1.1 Parameters
- data_path: string:
- For example: ‘../data/gene_tpm.csv’
- Please note: The input data matrix should have genes as rows and samples as columns.
- threshold: float:
- For example: 0.9
- The set threshold indicates the proportion of non-zero value samples to all samples in each feature.
3.1.2 Returns
- X (pandas.core.frame.DataFrame):
3.1.3 Usage
load_filter_transpose(
threshold=0.9,
data_path='../data/gene_tpm.csv'
)3.2 LoadEncodeLabels.py
Reads a CSV file containing labels and encodes categorical labels in the specified column to numeric labels.
3.2.1 Parameters
- file_path (str):
- Path to the CSV file containing labels.
- column_name (str):
- Name of the column to be encoded.
3.2.2 Returns
- Y (pd.DataFrame):
- A DataFrame containing the encoded numeric labels.
3.2.3 Usage
load_encode_labels(
file_path='../data/class.csv',
column_name='class'
)3.3 ExtractCommonSamples.py
Extracts common samples (rows) from two DataFrames based on their indices.
3.3.1 Parameters
- X (pd.DataFrame):
- First DataFrame.
- Y (pd.DataFrame):
- Second DataFrame.
3.3.2 Returns
- X_common, Y_common (pd.DataFrame):
- Two DataFrames containing only the rows that are common in both.
3.3.3 Usage
extract_common_samples(
X,
Y
)3.4 LoadAndPreprocessData.py
Load and preprocess the data.
3.4.1 Parameters
- feature_file: str:
- Path to the feature data file.
- label_file: str:
- Path to the label data file.
- label_column: str:
- Column name of the labels in the label file.
- threshold: float:
- Threshold for filtering in load_filter_transpose function.
3.4.2 Returns
- X (DataFrame):
- Preprocessed feature data.
- Y (ndarray):
- Preprocessed label data.
3.4.3 Usage
load_and_preprocess_data(
feature_file,
label_file,
label_column,
threshold
)3.5 SetupLoggingAndProgressBar.py
Set up logging and initialize a tqdm progress bar.
3.5.1 Parameters
- n_iter (int):
- Number of iterations for RandomizedSearchCV.
- n_cv (int):
- Number of cross-validation folds.
3.5.2 Returns
- tqdm object
- An initialized tqdm progress bar.
3.5.3 Usage
setup_logging_and_progress_bar(
n_iter,
n_cv
)3.6 UpdateProgressBar.py
Read the number of log entries in the log file and update the tqdm progress bar.
3.6.1 Parameters
- pbar (tqdm):
- The tqdm progress bar object.
- log_file (str):
- Path to the log file, default is ‘progress.log’.
3.6.2 Usage
update_progress_bar(
pbar,
log_file='progress.log'
)3.7 LoggingCustomScorer.py
Creates a custom scorer function for use in model evaluation processes. This scorer logs both the accuracy score and the time taken for each call.
3.7.1 Parameters
- n_iter (int):
- Number of iterations for the search process. Default is 10.
- n_cv (int):
- Number of cross-validation splits. Default is 5.
3.7.2 Returns
- custom_scorer(function)
- A custom scorer function that logs the accuracy score and time taken for each call.
3.7.3 Usage
logging_custom_scorer(
n_iter=10,
n_cv=5
)3.8 TqdmCustomScorer.py
Creates a custom scorer for model evaluation, integrating a progress bar with
tqdm.
3.8.1 Parameters
- n_iter: int (optional):
- Number of iterations for the search process. Default is 10.
- n_cv: int (optional):
- Number of cross-validation splits. Default is 5.
3.8.2 Returns
- function:
- A custom scorer function that can be used with model evaluation methods like
RandomizedSearchCV.
3.8.3 Description
The
tqdm_custom_scorerfunction creates a scorer for model evaluation, incorporating atqdmprogress bar to monitor the evaluation process. This scorer is especially useful in processes likeRandomizedSearchCV, where it provides real-time feedback on the number of iterations and cross-validation steps completed.
3.8.4 Usage
custom_scorer = tqdm_custom_scorer(n_iter=10, n_cv=5)
# Use this scorer in RandomizedSearchCV or similar methods3.9 TrainModel.py
Set up and run the model training process.
3.9.1 Parameters
- X: DataFrame:
- feature data.
- Y: ndarray:
- label data.
- feature_selection:
- FeatureUnion, the feature selection process.
- parameters: dict:
- parameters for RandomizedSearchCV.
- n_iter: int:
- number of iterations for RandomizedSearchCV.
- n_cv: int:
- number of cross-validation folds.
- n_jobs: int:
- number of jobs to run in parallel (default is 9).
3.9.2 Returns
- clf
- RandomizedSearchCV object after fitting.
3.9.3 Usage
train_model(
X,
Y,
feature_selection,
parameters,
n_iter,
n_cv,
n_jobs=9
)3.10 EnsembleForRFE.py
Set up and run the Ensemble model for Recursive Feature Elimination.
3.10.1 Parameters
- svm_C: float:
- Regularization parameter for SVM.
- tree_max_depth: int:
- Maximum depth of the decision tree.
- tree_min_samples_split: int:
- Minimum number of samples required to split an internal node.
- gbm_learning_rate: float:
- Learning rate for gradient boosting.
- gbm_n_estimators: int:
- Number of boosting stages for gradient boosting.
3.10.2 Attributes
- feature_importances_:
- Array of feature importances after fitting the model.
3.10.3 Methods
- fit(X, y):
- Fit the model to data matrix X and target(s) y.
- predict(X):
- Predict class labels for samples in X.
- set_params(**params):
- Set parameters for the ensemble estimator.
3.11 SetupFeatureSelection.py
Set up the feature selection process in
TransProPy.UtilsFunction3. This function is particularly useful for setting up a feature selection pipeline, especially in models that benefit from ensemble methods and mutual information-based feature selection.
3.11.1 Returns
- feature_selection: FeatureUnion:
- A combined feature selection process.
3.11.2 Description
The
setup_feature_selectionfunction initializes and returns aFeatureUnionobject for feature selection. This union includes: -RFECV: Utilizes anEnsembleForRFEestimator withStratifiedKFold(5)for cross-validation, focusing on accuracy. -SelectKBest: Appliesmutual_info_classiffor feature scoring. The combination of these techniques provides a robust approach to feature selection in machine learning models.
3.11.3 Usage
feature_selection = setup_feature_selection()
3.12 PrintBoxedText.py
Prints a title in a boxed format in the console output.
3.12.1 Parameters
- title: str:
- The text to be displayed inside the box.
3.12.2 Returns
- None. This function directly prints the formatted title to the console.
3.12.3 Description
This function creates a box around the given title text using hash (#) and equals (=) symbols. It prints the title with a border on the top and bottom, making it stand out in the console output. The border line consists of a hash symbol, followed by equals symbols the length of the title plus two (for padding), and then another hash symbol.
3.12.4 Usage Example
print_boxed_text("Example Title")
3.13 ExtractAndSaveResults.py
The function uses matplotlib for plotting, pandas for data handling, and a custom
print_boxed_textfunction for formatted output.
3.13.1 Parameters
- clf: trained model (RandomizedSearchCV object):
- The classifier object after training.
- X: DataFrame:
- Feature data used for training.
- save_path: str:
- Base path for saving results.
- show_plot: bool (optional):
- Whether to display the plot. Default is False.
- use_tkagg: bool (optional):
- Whether to use ‘TkAgg’ backend for matplotlib. Generally, choose False when using in PyCharm IDE, and choose True when rendering file.qmd to an HTML file.
3.13.2 Description
This function performs a comprehensive analysis and extraction of results from a trained model. It includes: - Extracting and plotting cross-validation results. - Identifying and printing features selected by RFECV and SelectKBest. - Combining and saving selected features in a CSV file. - Extracting and saving feature importances from EnsembleForRFE. - Extracting and saving scores from SelectKBest.
3.13.3 Usage
extract_and_save_results(clf, X, "path/to/save/", show_plot=True)