4 MACFCmain.py

Applying the MACFC selection for relevant feature genes in classification.

4.1 Parameters

max_rank: int

The total number of gene combinations you want to obtain.

lable_name: string

For example: gender, age, altitude, temperature, quality, and other categorical variable names.

data_path: string

For example: ‘../data/gene_tpm.csv’

Please note: Preprocess the input data in advance to remove samples that contain too many missing values or zeros.

The input data matrix should have genes as rows and samples as columns.

label_path: string

For example: ‘../data/tumor_class.csv’

Please note: The input sample categories must be in a numerical binary format, such as: 1,2,1,1,2,2,1.

In this case, the numerical values represent the following classifications: 1: male; 2: female.

threshold: float

For example: 0.9

The set threshold indicates the proportion of non-zero value samples to all samples in each feature.

4.2 Returns

fr: list of strings

Representing ranked features.

fre1: dictionary

Feature names as keys and their frequencies as values.

frequency: list of tuples

Feature names and their frequencies.

The frequency outputs a list sorted by occurrence frequency (in descending order). This list includes only those elements from the dictionary fre1 (which represents the counted frequencies of elements in the original data) that have an occurrence frequency greater than once, along with their frequencies.

len(FName): integer

Count of AUC values greater than 0.5.

FName: array of strings

Feature names after ranking with AUC > 0.5.

Fauc: array of floats

AUC values corresponding to the ranked feature names.

4.3 Function Principle Explanation

Feature Frequency and AUC: In this function, features that appear with high frequency indicate their presence in multiple optimal feature sets. Each optimal feature set is determined by calculating its Area Under the Receiver Operating Characteristic (ROC) Curve (AUC), which is a common measure for evaluating classifier performance. During each iteration of the loop, an optimal feature set with the highest average AUC value is selected. Features from this set are then added to a rank list, known as ‘ranklist,’ and when necessary, also to a set named ‘rankset’.
High-Frequency Features and Performance: Because features in each set are chosen based on their contribution to classifier performance, high-frequency features are likely to perform well. In other words, if a feature appears in multiple optimal feature sets, it may have a significant impact on the performance of the classifier.
Note on Low-Frequency Features: However, it’s important to note that a low frequency of a feature does not necessarily mean it is unimportant. The importance of a feature may depend on how it combines with other features. Additionally, the outcome of feature selection may be influenced by the characteristics of the dataset and random factors. Therefore, the frequency provided by this function should only be used as a reference and is not an absolute indicator of feature performance.
Further Evaluation Methods: If you wish to explore feature performance more deeply, you may need to employ other methods for assessing feature importance. This could include model-based importance metrics or statistical tests to evaluate the relationship between features and the target variable.

4.4 Usage Workflow

FName is a list of feature names sorted based on their AUC (Area Under the Curve) values. In this sorting method, the primary consideration is the AUC value, followed by the feature name. All features included in FName have an AUC value greater than 0.5.

fr is the result of another sorting method. In this method, the primary consideration is the “combined” AUC of the features, followed by their individual AUC values. This means that some features, despite having lower individual AUC values, may produce a higher combined AUC when paired with other features. Therefore, their position in the fr list may be higher than in the FName list.

The code for fr employs a more complex logic to select and combine features to optimize their combined AUC values. In this process, features are not solely selected and sorted based on their individual AUC values; the effect of their combination with other features is also considered. Consequently, the sorting logic for fr (or rankset) differs from that of FName.

Please note: While the code takes into account both individual AUC values and combined AUC values, the sorting of the fr list (i.e., rankset) initially starts based on individual AUC values. This is because at the beginning of each external loop iteration, the first element of fs is the next feature sorted by its individual AUC value. The list is then further optimized by evaluating the combination effects with other features.

4.5 Usage of MACFCmain (Significant correlation)

This function uses the MACFC method to select feature genes relevant to classification and ranks them based on their corresponding weights.

Please note:Data characteristics: Features have strong correlation with the classification.

4.5.1 Import the corresponding module

import TransProPy.MACFCmain as Tr
import TransProPy.UtilsFunction1.GeneNames as TUG
import TransProPy.UtilsFunction1.GeneToFeatureMapping as TUGM

4.5.2 Data

import pandas as pd
data_path = '../test_TransProPy/data/four_methods_degs_intersection.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10])

  Unnamed: 0  TCGA-D9-A4Z2-01A  TCGA-ER-A2NH-06A  TCGA-BF-A5EO-01A  \
0        A2M         16.808499         16.506184         17.143433   
1      A2ML1          1.584963          9.517669          7.434628   
2      AADAC          4.000000          2.584963          1.584963   
3    AADACL2          1.000000          1.000000          0.000000   
4     ABCA12          4.523562          4.321928          3.906891   
5    ABCA17P          4.584963          5.169925          3.807355   
6      ABCA9          9.753217          6.906891          3.459432   
7      ABCB4          9.177420          6.700440          5.000000   
8      ABCB5         10.134426          4.169925          9.167418   
9     ABCC11         10.092757          6.491853          5.459432   

   TCGA-D9-A6EA-06A  TCGA-D9-A4Z3-01A  TCGA-GN-A26A-06A  TCGA-D3-A3BZ-06A  \
0         17.760739         14.766839         16.263691         16.035207   
1          2.584963          1.584963          2.584963          5.285402   
2          0.000000          0.000000          0.000000          3.321928   
3          0.000000          1.000000          0.000000          0.000000   
4          3.459432          1.584963          3.000000          4.321928   
5          8.366322          7.228819          7.076816          4.584963   
6          2.584963          6.357552          6.475733          7.330917   
7          9.342075         10.392317          7.383704         11.032735   
8          4.906891         11.340963          3.169925         11.161762   
9          6.807355          4.247928          5.459432          5.977280   

   TCGA-D3-A51G-06A  TCGA-EE-A29R-06A  
0         18.355114         16.959379  
1          2.584963          3.584963  
2          1.000000          4.584963  
3          0.000000          1.000000  
4          4.807355          3.700440  
5          6.409391          7.139551  
6          7.954196          9.177420  
7         10.082149         10.088788  
8          4.643856         12.393927  
9          5.614710          8.233620

import pandas as pd
data_path = '../test_TransProPy/data/class.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10])

         Unnamed: 0  class
0  TCGA-D9-A4Z2-01A      2
1  TCGA-ER-A2NH-06A      2
2  TCGA-BF-A5EO-01A      2
3  TCGA-D9-A6EA-06A      2
4  TCGA-D9-A4Z3-01A      2
5  TCGA-GN-A26A-06A      2
6  TCGA-D3-A3BZ-06A      2
7  TCGA-D3-A51G-06A      2
8  TCGA-EE-A29R-06A      2
9  TCGA-D3-A2JE-06A      2

4.5.3 MACFCmain

ranked_features, fre1, frequency, len_FName, FName, Fauc = Tr.MACFCmain(
    100, 
    "class", 
    0.95, 
    data_path='../test_TransProPy/data/four_methods_degs_intersection.csv', 
    label_path='../test_TransProPy/data/class.csv'
    )

4.5.4 Result

# Print the first 20 Ranked Features
print("\nFirst 20 Ranked Features:")
for i, feature in enumerate(ranked_features[:20], 1):
    print(f"{i}. {feature}")


First 20 Ranked Features:
1. 355
2. 68
3. 867
4. 78
5. 97
6. 90
7. 432
8. 313
9. 497
10. 511
11. 66
12. 172
13. 544
14. 1162
15. 487
16. 317
17. 1283
18. 930
19. 1290
20. 1170

# Print the first 20 Feature Frequencies (fre1)
print("\nFirst 20 Feature Frequencies:")
for i, (feature, freq) in enumerate(list(fre1.items())[:20], 1):
    print(f"{i}. Feature: {feature}, Frequency: {freq}")


First 20 Feature Frequencies:
1. Feature: 355, Frequency: 1
2. Feature: 68, Frequency: 1
3. Feature: 867, Frequency: 1
4. Feature: 78, Frequency: 1
5. Feature: 97, Frequency: 1
6. Feature: 90, Frequency: 1
7. Feature: 432, Frequency: 1
8. Feature: 313, Frequency: 1
9. Feature: 497, Frequency: 1
10. Feature: 511, Frequency: 1
11. Feature: 66, Frequency: 1
12. Feature: 172, Frequency: 1
13. Feature: 544, Frequency: 1
14. Feature: 1162, Frequency: 1
15. Feature: 487, Frequency: 1
16. Feature: 317, Frequency: 1
17. Feature: 1283, Frequency: 1
18. Feature: 930, Frequency: 1
19. Feature: 1290, Frequency: 1
20. Feature: 1170, Frequency: 1

# Print the Features with a frequency greater than 1 
print("\nFeatures with a frequency greater than 1 :")
for i, (feature, freq) in enumerate(frequency[:20], 1):
    print(f"{i}. Feature: {feature}, Frequency: {freq}")


Features with a frequency greater than 1 :

# Print the length of FName (len_FName)
print("\nCount of Features with AUC > 0.5 (len_FName):")
print(len_FName)


Count of Features with AUC > 0.5 (len_FName):
1

# Print the first 10 Features with AUC > 0.5 (FName)
print("\nFirst few Features with AUC > 0.5:")
for i, feature in enumerate(FName[:20], 1):
    print(f"{i}. {feature}")


First few Features with AUC > 0.5:
1. 355

# Print the first 10 AUC Values for Ranked Features (Fauc)
print("\nFirst few AUC Values for Ranked Features:")
for i, auc in enumerate(Fauc[:20], 1):
    print(f"{i}. AUC: {auc}")


First few AUC Values for Ranked Features:
1. AUC: 1.0

4.5.5 gene_name

gene_names = TUG.gene_name(data_path='../test_TransProPy/data/four_methods_degs_intersection.csv')

# Print the first 20 gene names
print("First 20 Gene Names:")
for i, gene_name in enumerate(gene_names[:20], 1):
    print(f"{i}. {gene_name}")

First 20 Gene Names:
1. A2M
2. A2ML1
3. AADAC
4. AADACL2
5. ABCA12
6. ABCA17P
7. ABCA9
8. ABCB4
9. ABCB5
10. ABCC11
11. ABCC3
12. ABCD1
13. ABI3BP
14. AC002116.8
15. AC002398.9
16. AC004057.1
17. AC004231.2
18. AC004540.5
19. AC004623.3
20. AC004951.5

4.5.6 gene_map_feature

gene_to_feature_mapping = TUGM.gene_map_feature(gene_names, ranked_features)

4.5.6.1 AUC>0.5

import numpy as np
# Generating gene_to_feature_mapping
gene_to_feature_mapping = {}
for gene, feature in zip(gene_names, FName):
    # Find the index of the feature in FName
    index = np.where(FName == feature)[0][0]
    # Find the corresponding AUC value using the index
    auc_value = Fauc[index]
    # Store the gene name, feature name, and AUC value in the mapping
    gene_to_feature_mapping[gene] = (feature, auc_value)

# Print the first 20 mappings
print("\nFirst 20 Gene to Feature Mappings with AUC Values:")
for i, (gene, (feature, auc)) in enumerate(list(gene_to_feature_mapping.items())[:20], 1):
    print(f"{i}. Gene: {gene}, Feature: {feature}, AUC: {auc}")


First 20 Gene to Feature Mappings with AUC Values:
1. Gene: A2M, Feature: 355, AUC: 1.0

4.6 Usage of MACFCmain (Insignificant correlation)

This function uses the MACFC method to select feature genes relevant to classification and ranks them based on their corresponding weights.

Please note:Data characteristics: Features have weak correlation with the classification.

Randomly shuffling the class labels to a certain extent simulates reducing the correlation.

4.6.1 Data

import pandas as pd
data_path = '../test_TransProPy/data/four_methods_degs_intersection.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10])

  Unnamed: 0  TCGA-D9-A4Z2-01A  TCGA-ER-A2NH-06A  TCGA-BF-A5EO-01A  \
0        A2M         16.808499         16.506184         17.143433   
1      A2ML1          1.584963          9.517669          7.434628   
2      AADAC          4.000000          2.584963          1.584963   
3    AADACL2          1.000000          1.000000          0.000000   
4     ABCA12          4.523562          4.321928          3.906891   
5    ABCA17P          4.584963          5.169925          3.807355   
6      ABCA9          9.753217          6.906891          3.459432   
7      ABCB4          9.177420          6.700440          5.000000   
8      ABCB5         10.134426          4.169925          9.167418   
9     ABCC11         10.092757          6.491853          5.459432   

   TCGA-D9-A6EA-06A  TCGA-D9-A4Z3-01A  TCGA-GN-A26A-06A  TCGA-D3-A3BZ-06A  \
0         17.760739         14.766839         16.263691         16.035207   
1          2.584963          1.584963          2.584963          5.285402   
2          0.000000          0.000000          0.000000          3.321928   
3          0.000000          1.000000          0.000000          0.000000   
4          3.459432          1.584963          3.000000          4.321928   
5          8.366322          7.228819          7.076816          4.584963   
6          2.584963          6.357552          6.475733          7.330917   
7          9.342075         10.392317          7.383704         11.032735   
8          4.906891         11.340963          3.169925         11.161762   
9          6.807355          4.247928          5.459432          5.977280   

   TCGA-D3-A51G-06A  TCGA-EE-A29R-06A  
0         18.355114         16.959379  
1          2.584963          3.584963  
2          1.000000          4.584963  
3          0.000000          1.000000  
4          4.807355          3.700440  
5          6.409391          7.139551  
6          7.954196          9.177420  
7         10.082149         10.088788  
8          4.643856         12.393927  
9          5.614710          8.233620

import pandas as pd
data_path = '../test_TransProPy/data/random_classification_class.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10])

         Unnamed: 0  class
0  TCGA-D9-A4Z2-01A      2
1  TCGA-ER-A2NH-06A      2
2  TCGA-BF-A5EO-01A      2
3  TCGA-D9-A6EA-06A      2
4  TCGA-D9-A4Z3-01A      1
5  TCGA-GN-A26A-06A      1
6  TCGA-D3-A3BZ-06A      1
7  TCGA-D3-A51G-06A      1
8  TCGA-EE-A29R-06A      1
9  TCGA-D3-A2JE-06A      1

4.6.2 MACFCmain

ranked_features, fre1, frequency, len_FName, FName, Fauc = Tr.MACFCmain(
    100, 
    "class", 
    0.95, 
    data_path='../test_TransProPy/data/four_methods_degs_intersection.csv', 
    label_path='../test_TransProPy/data/random_classification_class.csv'
    )

4.6.3 Result

# Print the first 20 Ranked Features
print("\nFirst 20 Ranked Features:")
for i, feature in enumerate(ranked_features[:20], 1):
    print(f"{i}. {feature}")


First 20 Ranked Features:
1. 1147
2. 605
3. 140
4. 845
5. 546
6. 1052
7. 188
8. 431
9. 182
10. 120
11. 362
12. 998
13. 1122
14. 246
15. 23
16. 383
17. 258
18. 189
19. 746
20. 1064

# Print the first 20 Feature Frequencies (fre1)
print("\nFirst 20 Feature Frequencies:")
for i, (feature, freq) in enumerate(list(fre1.items())[:20], 1):
    print(f"{i}. Feature: {feature}, Frequency: {freq}")


First 20 Feature Frequencies:
1. Feature: 1147, Frequency: 1
2. Feature: 605, Frequency: 1
3. Feature: 140, Frequency: 1
4. Feature: 845, Frequency: 1
5. Feature: 546, Frequency: 1
6. Feature: 1052, Frequency: 1
7. Feature: 188, Frequency: 1
8. Feature: 431, Frequency: 1
9. Feature: 182, Frequency: 1
10. Feature: 120, Frequency: 1
11. Feature: 362, Frequency: 1
12. Feature: 998, Frequency: 1
13. Feature: 1122, Frequency: 1
14. Feature: 246, Frequency: 1
15. Feature: 23, Frequency: 1
16. Feature: 383, Frequency: 1
17. Feature: 258, Frequency: 1
18. Feature: 189, Frequency: 1
19. Feature: 746, Frequency: 1
20. Feature: 1064, Frequency: 1

# Print the Features with a frequency greater than 1 
print("\nFeatures with a frequency greater than 1 :")
for i, (feature, freq) in enumerate(frequency[:20], 1):
    print(f"{i}. Feature: {feature}, Frequency: {freq}")


Features with a frequency greater than 1 :

# Print the length of FName (len_FName)
print("\nCount of Features with AUC > 0.5 (len_FName):")
print(len_FName)


Count of Features with AUC > 0.5 (len_FName):
757

# Print the first 10 Features with AUC > 0.5 (FName)
print("\nFirst few Features with AUC > 0.5:")
for i, feature in enumerate(FName[:20], 1):
    print(f"{i}. {feature}")


First few Features with AUC > 0.5:
1. 1147
2. 140
3. 605
4. 518
5. 1080
6. 826
7. 541
8. 695
9. 1266
10. 0
11. 864
12. 188
13. 842
14. 344
15. 824
16. 1208
17. 1086
18. 602
19. 295
20. 1261

# Print the first 10 AUC Values for Ranked Features (Fauc)
print("\nFirst few AUC Values for Ranked Features:")
for i, auc in enumerate(Fauc[:20], 1):
    print(f"{i}. AUC: {auc}")


First few AUC Values for Ranked Features:
1. AUC: 0.6469530885995243
2. AUC: 0.6465975134260281
3. AUC: 0.6447509747664238
4. AUC: 0.6415704161455651
5. AUC: 0.6405110473528042
6. AUC: 0.6403761740111332
7. AUC: 0.6400941661149121
8. AUC: 0.6398293239167219
9. AUC: 0.6387871208219917
10. AUC: 0.6381814169057604
11. AUC: 0.6376174011133181
12. AUC: 0.6373378454596729
13. AUC: 0.6371956153902744
14. AUC: 0.6371931631476986
15. AUC: 0.6370018882267834
16. AUC: 0.6367713774246548
17. AUC: 0.636531057652223
18. AUC: 0.6361656735084235
19. AUC: 0.6359670418597808
20. AUC: 0.6358419774884132

4.6.4 gene_name

gene_names = TUG.gene_name(data_path='../test_TransProPy/data/four_methods_degs_intersection.csv')

# Print the first 20 gene names
print("First 20 Gene Names:")
for i, gene_name in enumerate(gene_names[:20], 1):
    print(f"{i}. {gene_name}")

First 20 Gene Names:
1. A2M
2. A2ML1
3. AADAC
4. AADACL2
5. ABCA12
6. ABCA17P
7. ABCA9
8. ABCB4
9. ABCB5
10. ABCC11
11. ABCC3
12. ABCD1
13. ABI3BP
14. AC002116.8
15. AC002398.9
16. AC004057.1
17. AC004231.2
18. AC004540.5
19. AC004623.3
20. AC004951.5

4.6.5 gene_map_feature

gene_to_feature_mapping = TUGM.gene_map_feature(gene_names, ranked_features)

4.6.5.1 AUC>0.5

import numpy as np
# Generating gene_to_feature_mapping
gene_to_feature_mapping = {}
for gene, feature in zip(gene_names, FName):
    # Find the index of the feature in FName
    index = np.where(FName == feature)[0][0]
    # Find the corresponding AUC value using the index
    auc_value = Fauc[index]
    # Store the gene name, feature name, and AUC value in the mapping
    gene_to_feature_mapping[gene] = (feature, auc_value)

# Print the first 20 mappings
print("\nFirst 20 Gene to Feature Mappings with AUC Values:")
for i, (gene, (feature, auc)) in enumerate(list(gene_to_feature_mapping.items())[:20], 1):
    print(f"{i}. Gene: {gene}, Feature: {feature}, AUC: {auc}")


First 20 Gene to Feature Mappings with AUC Values:
1. Gene: A2M, Feature: 1147, AUC: 0.6469530885995243
2. Gene: A2ML1, Feature: 140, AUC: 0.6465975134260281
3. Gene: AADAC, Feature: 605, AUC: 0.6447509747664238
4. Gene: AADACL2, Feature: 518, AUC: 0.6415704161455651
5. Gene: ABCA12, Feature: 1080, AUC: 0.6405110473528042
6. Gene: ABCA17P, Feature: 826, AUC: 0.6403761740111332
7. Gene: ABCA9, Feature: 541, AUC: 0.6400941661149121
8. Gene: ABCB4, Feature: 695, AUC: 0.6398293239167219
9. Gene: ABCB5, Feature: 1266, AUC: 0.6387871208219917
10. Gene: ABCC11, Feature: 0, AUC: 0.6381814169057604
11. Gene: ABCC3, Feature: 864, AUC: 0.6376174011133181
12. Gene: ABCD1, Feature: 188, AUC: 0.6373378454596729
13. Gene: ABI3BP, Feature: 842, AUC: 0.6371956153902744
14. Gene: AC002116.8, Feature: 344, AUC: 0.6371931631476986
15. Gene: AC002398.9, Feature: 824, AUC: 0.6370018882267834
16. Gene: AC004057.1, Feature: 1208, AUC: 0.6367713774246548
17. Gene: AC004231.2, Feature: 1086, AUC: 0.636531057652223
18. Gene: AC004540.5, Feature: 602, AUC: 0.6361656735084235
19. Gene: AC004623.3, Feature: 295, AUC: 0.6359670418597808
20. Gene: AC004951.5, Feature: 1261, AUC: 0.6358419774884132

4.7 References

Su,Y., Du,K., Wang,J., Wei,J. and Liu,J. (2022) Multi-variable AUC for sifting complementary features and its biomedical application. Briefings in Bioinformatics, 23, bbac029.