1  UtilsFunction1

This section serves as the helper function for the MACFCmain function.

1.1 Auc.py

Assists the MACFCmain function in calculating AUC, obtaining Feature Frequency, and performing sorting.

1.1.1 Introduction

  • In this function, features that appear with high frequency indicate their presence in multiple optimal feature sets.
  • Each optimal feature set is determined by calculating its Area Under the Receiver Operating Characteristic (ROC) Curve (AUC), which is a common measure for evaluating classifier performance.
  • During each iteration of the loop, an optimal feature set with the highest average AUC value is selected.
  • Features from this set are then added to a rank list, known as ‘ranklist,’ and when necessary, also to a set named ‘rankset’.

1.1.2 Usage

auc(tlofe, ne, n0, n1)

1.2 AutoNorm.py

Normalization Function The auto_norm function is designed to normalize a two-dimensional array (matrix). The purpose of normalization is generally to bring all features into the same numerical range, facilitating subsequent analysis or model training.

1.2.1 Parameters

  • data: ndarray
    • Order Requirements for Input Data:
    • 1.This function does indeed have specific requirements for the row and column order of the input matrix data. Rows should represent individual samples, and columns should represent different features. In other words, each row vector represents a sample containing multiple features.
    • 2.Each column of the matrix will be independently normalized, so different features should be placed in separate columns.

1.2.2 Returns

  • norm_data: ndarray
    • It is the normalized data.

1.2.3 Usage

auto_norm(data)

1.3 FeatureRanking.py

1.3.1 Introduction

  • High-Frequency Features and Performance: Because features in each set are chosen based on their contribution to classifier performance, high-frequency features are likely to perform well. In other words, if a feature appears in multiple optimal feature sets, it may have a significant impact on the performance of the classifier.
  • Note on Low-Frequency Features: However, it’s important to note that a low frequency of a feature does not necessarily mean it is unimportant. The importance of a feature may depend on how it combines with other features. Additionally, the outcome of feature selection may be influenced by the characteristics of the dataset and random factors. Therefore, the frequency provided by this function should only be used as a reference and is not an absolute indicator of feature performance.

1.3.2 Returns

  • FName
  • Fauc
  • rankset
  • ranklist

1.3.3 Usage

feature_ranking(f, c, max_rank, pos, neg, n0, n1)

1.4 NewFeatureRanking.py

1.4.1 Change Summary

  • To store features with AUC greater than AUC_threshold and their AUC values
  • Exclude features with AUC greater than AUC_threshold from the original set.
  • Sort and process the remaining features

1.4.2 Returns

  • high_auc_features
  • FName
  • Fauc
  • rankset
  • ranklist

1.4.3 Usage

feature_ranking(f, c, AUC_threshold, max_rank, pos, neg, n0, n1)

1.5 LoadData.py

Data Reading and Transformation.

1.5.1 Introduction

  • Data normalization for constant value.
  • Extract matrix data and categorical data.

1.5.2 Parameters

  • lable_name: string
    • For example: gender, age, altitude, temperature, quality, and other categorical variable names.
  • data_path: string
    • For example: ‘../data/gene_tpm.csv’
    • Please note: Preprocess the input data in advance to remove samples that contain too many missing values or zeros.
    • The input data matrix should have genes as rows and samples as columns.
  • label_path: string
    • For example: ‘../data/tumor_class.csv’
    • Please note: The input CSV data should have rows representing sample names and columns representing class names.
    • The input sample categories must be in a numerical binary format, such as: 1,2,1,1,2,2,1.
    • In this case, the numerical values represent the following classifications: 1: male; 2: female.
  • threshold: float
    • For example: 0.9
    • The set threshold indicates the proportion of non-zero value samples to all samples in each feature.

1.5.3 Returns

  • transpose(f): ndarray
    • A transposed feature-sample matrix.
  • c: ndarray
    • A NumPy array containing classification labels.

1.5.4 Usage

load_data(
    lable_name, 
    threshold, 
    data_path='../data/gene_tpm.csv', 
    label_path='../data/tumor_class.csv'
    )

1.6 PrintResults.py

1.6.1 Returns

  • fr: list of strings
    • Representing ranked features.
  • fre1: dictionary
    • Feature names as keys and their frequencies as values.
  • frequency: list of tuples
    • Feature names and their frequencies.
  • len(FName): integer
    • Count of AUC values greater than 0.5.
  • FName: array of strings
    • Feature names after ranking with AUC > 0.5.
  • Fauc: array of floats
    • AUC values corresponding to the ranked feature names.

1.6.2 Usage

 print_results(fr, fre1, frequency, len_FName, FName, Fauc)

1.7 FilterSamples.py

Remove samples with high zero expression.

1.7.1 Parameters

  • data_path: string
    • For example: ‘../data/gene_tpm.csv’
    • Please note: The input data matrix should have genes as rows and samples as columns.
  • threshold: float
    • For example: 0.9
    • The set threshold indicates the proportion of non-zero value samples to all samples in each feature.

1.7.2 Return

  • X: pandas.core.frame.DataFrame

1.7.3 Usage

filter_samples(threshold, data_path='../data/gene_tpm.csv')

1.8 GeneNames.py

Extract gene_names data.

1.8.1 Parameters

  • data_path: string
    • For example: ‘../data/gene_tpm.csv’
    • Please note: Preprocess the input data in advance to remove samples that contain too many missing values or zeros.
    • The input data matrix should have genes as rows and samples as columns.

1.8.2 Return

  • gene_names: list

1.8.3 Usage

gene_name(data_path='../data/gene_tpm.csv')

1.9 GeneToFeatureMapping.py

gene map feature.

1.9.1 Parameters

  • gene_names: list
    • For example: [‘GeneA’, ‘GeneB’, ‘GeneC’, ‘GeneD’, ‘GeneE’]
    • containing strings
  • ranked_features: list
    • For example: [2, 0, 1]
    • containing integers

1.9.2 Return

  • gene_to_feature_mapping: dictionary
    • gene_to_feature_mapping is a Python dictionary type. It is used to map gene names to their corresponding feature (or ranked feature) names.

1.9.3 Usage

gene_map_feature(gene_names, ranked_features)

1.10 References