5  NewMACFCmain.py

Applying the MACFC selection for relevant feature genes in classification.

Change Focus: In practical situations, when a feature has already achieved an AUC of 1, adding more features may not improve the AUC value, but it can help with the interpretability or robustness of the model. Although features with an AUC of 1 demonstrate perfect classification performance, other features may be more important in different data subsets or under different conditions. Therefore, exploring combinations of these features still holds value.

Selection Guideline: If the data has a weaker association between features and classification, the MACFCmain function can be chosen. For data with a strong association between features and classification, or data that has been processed to enhance this association, New_MACFCmain should be selected.

5.1 Parameters

  • AUC_threshold: float
    • AUC threshold for feature selection. Features with AUC values higher than this threshold are recorded but not used in subsequent calculations.
  • max_rank: int
    • The total number of gene combinations you want to obtain.
  • lable_name: string
    • For example: gender, age, altitude, temperature, quality, and other categorical variable names.
  • threshold: float
    • For example: 0.9
    • The set threshold indicates the proportion of non-zero value samples to all samples in each feature.
  • data_path: string
    • For example: ‘../data/gene_tpm.csv’
    • Please note: Preprocess the input data in advance to remove samples that contain too many missing values or zeros.
    • The input data matrix should have genes as rows and samples as columns.
  • label_path: string
    • For example: ‘../data/tumor_class.csv’
    • Please note: The input sample categories must be in a numerical binary format, such as: 1,2,1,1,2,2,1.
    • In this case, the numerical values represent the following classifications: 1: male; 2: female.

5.2 Returns

  • high_auc_features: list of tuples
    • This list contains tuples of feature indices and their corresponding AUC values, where the AUC value is greater than AUC_threshold. Each tuple consists of the feature’s index in string format and its AUC value as a float. This signifies that these features are highly predictive, with a strong ability to distinguish between different classes in the classification task.
  • fr: list of strings
    • Representing ranked features.
  • fre1: dictionary
    • Feature names as keys and their frequencies as values.
  • frequency: list of tuples
    • Feature names and their frequencies.
    • The frequency outputs a list sorted by occurrence frequency (in descending order). This list includes only those elements from the dictionary fre1 (which represents the counted frequencies of elements in the original data) that have an occurrence frequency greater than once, along with their frequencies.
  • len(FName): integer
    • Count of AUC values greater than 0.5.
  • FName: array of strings
    • Feature names after ranking with AUC > 0.5.
  • Fauc: array of floats
    • AUC values corresponding to the ranked feature names.

5.3 Function Principle Explanation

  1. Feature Frequency and AUC: In this function, features that appear with high frequency indicate their presence in multiple optimal feature sets. Each optimal feature set is determined by calculating its Area Under the Receiver Operating Characteristic (ROC) Curve (AUC), which is a common measure for evaluating classifier performance. During each iteration of the loop, an optimal feature set with the highest average AUC value is selected. Features from this set are then added to a rank list, known as ‘ranklist,’ and when necessary, also to a set named ‘rankset’.
  2. High-Frequency Features and Performance: Because features in each set are chosen based on their contribution to classifier performance, high-frequency features are likely to perform well. In other words, if a feature appears in multiple optimal feature sets, it may have a significant impact on the performance of the classifier.
  3. Note on Low-Frequency Features: However, it’s important to note that a low frequency of a feature does not necessarily mean it is unimportant. The importance of a feature may depend on how it combines with other features. Additionally, the outcome of feature selection may be influenced by the characteristics of the dataset and random factors. Therefore, the frequency provided by this function should only be used as a reference and is not an absolute indicator of feature performance.
  4. Further Evaluation Methods: If you wish to explore feature performance more deeply, you may need to employ other methods for assessing feature importance. This could include model-based importance metrics or statistical tests to evaluate the relationship between features and the target variable.

5.4 Usage Workflow

  • FName is a list of feature names sorted based on their AUC (Area Under the Curve) values. In this sorting method, the primary consideration is the AUC value, followed by the feature name. All features included in FName have an AUC value greater than 0.5.
  • fr is the result of another sorting method. In this method, the primary consideration is the “combined” AUC of the features, followed by their individual AUC values. This means that some features, despite having lower individual AUC values, may produce a higher combined AUC when paired with other features. Therefore, their position in the fr list may be higher than in the FName list.
  • The code for fr employs a more complex logic to select and combine features to optimize their combined AUC values. In this process, features are not solely selected and sorted based on their individual AUC values; the effect of their combination with other features is also considered. Consequently, the sorting logic for fr (or rankset) differs from that of FName.
  • Please note: While the code takes into account both individual AUC values and combined AUC values, the sorting of the fr list (i.e., rankset) initially starts based on individual AUC values. This is because at the beginning of each external loop iteration, the first element of fs is the next feature sorted by its individual AUC value. The list is then further optimized by evaluating the combination effects with other features.

5.5 Usage of New_MACFCmain (four_methods_degs_union)

This function uses the MACFC method to select feature genes relevant to classification and ranks them based on their corresponding weights.

5.5.1 Import the corresponding module

import TransProPy.NewMACFCmain as TN
import TransProPy.UtilsFunction1.GeneNames as TUG
import TransProPy.UtilsFunction1.GeneToFeatureMapping as TUGM

5.5.2 Data

import pandas as pd
data_path = '../test_TransProPy/data/four_methods_degs_union.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10]) 
            Unnamed: 0  TCGA-D9-A4Z2-01A  TCGA-ER-A2NH-06A  TCGA-BF-A5EO-01A  \
0                 A1BG          6.754888          4.000000          5.044394   
1                  A2M         16.808499         16.506184         17.143433   
2                A2ML1          1.584963          9.517669          7.434628   
3                AADAC          4.000000          2.584963          1.584963   
4              AADACL2          1.000000          1.000000          0.000000   
5              AADACL3          0.000000          0.000000          0.000000   
6              AADACL4          0.000000          0.000000          0.000000   
7          AB019440.50          0.000000          0.000000          0.000000   
8          AB019441.29          6.392317          4.954196          6.629357   
9  ABC12-47964100C23.1          0.000000          0.000000          0.000000   

   TCGA-D9-A6EA-06A  TCGA-D9-A4Z3-01A  TCGA-GN-A26A-06A  TCGA-D3-A3BZ-06A  \
0          5.247928          5.977280          5.044394          5.491853   
1         17.760739         14.766839         16.263691         16.035207   
2          2.584963          1.584963          2.584963          5.285402   
3          0.000000          0.000000          0.000000          3.321928   
4          0.000000          1.000000          0.000000          0.000000   
5          0.000000          1.000000          0.000000          0.000000   
6          0.000000          0.000000          0.000000          0.000000   
7          0.000000          0.000000          0.000000          0.000000   
8          6.988685          8.625709          6.614710          6.845490   
9          0.000000          0.000000          0.000000          0.000000   

   TCGA-D3-A51G-06A  TCGA-EE-A29R-06A  
0          5.754888          6.357552  
1         18.355114         16.959379  
2          2.584963          3.584963  
3          1.000000          4.584963  
4          0.000000          1.000000  
5          4.169925          0.000000  
6          0.000000          0.000000  
7          0.000000          0.000000  
8          7.845490          6.507795  
9          1.584963          0.000000  

import pandas as pd
data_path = '../test_TransProPy/data/class.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10]) 
         Unnamed: 0  class
0  TCGA-D9-A4Z2-01A      2
1  TCGA-ER-A2NH-06A      2
2  TCGA-BF-A5EO-01A      2
3  TCGA-D9-A6EA-06A      2
4  TCGA-D9-A4Z3-01A      2
5  TCGA-GN-A26A-06A      2
6  TCGA-D3-A3BZ-06A      2
7  TCGA-D3-A51G-06A      2
8  TCGA-EE-A29R-06A      2
9  TCGA-D3-A2JE-06A      2

5.5.3 New_MACFCmain

high_auc_features, ranked_features, fre1, frequency, len_FName, FName, Fauc = TN.New_MACFCmain(
    0.9,
    100, 
    "class", 
    0.9, 
    data_path='../test_TransProPy/data/four_methods_degs_union.csv', 
    label_path='../test_TransProPy/data/class.csv'
    )

5.5.4 Result

5.5.4.1 AUC greater than 0.9 and their AUC values

# Print features with AUC greater than 0.9 and their AUC values
print('\nFeatures with AUC greater than 0.9:')
total_features = len(high_auc_features)
print(f"Total features: {total_features}")

# Determine the number of features to display
num_to_display = min(total_features, 20)

for i in range(num_to_display):
    feature, auc_value = high_auc_features[i]
    print(f"Feature: {feature}, AUC: {auc_value}")

Features with AUC greater than 0.9:
Total features: 1421
Feature: 26, AUC: 1.0
Feature: 704, AUC: 1.0
Feature: 717, AUC: 1.0
Feature: 1172, AUC: 1.0
Feature: 1899, AUC: 1.0
Feature: 1948, AUC: 1.0
Feature: 2338, AUC: 1.0
Feature: 2596, AUC: 0.9999973764986752
Feature: 582, AUC: 0.9999947529973503
Feature: 2453, AUC: 0.9999895059947005
Feature: 1563, AUC: 0.9999868824933756
Feature: 786, AUC: 0.9999842589920508
Feature: 2419, AUC: 0.9999842589920508
Feature: 204, AUC: 0.9999763884880762
Feature: 1002, AUC: 0.9999763884880762
Feature: 291, AUC: 0.9999658944827767
Feature: 237, AUC: 0.9999370359682032
Feature: 124, AUC: 0.9999239184615788
Feature: 1561, AUC: 0.9999081774536296
Feature: 2171, AUC: 0.9999081774536296

5.5.4.2 New_MACFCmain

# Print the first 20 Ranked Features
print("\nFirst 20 Ranked Features:")
for i, feature in enumerate(ranked_features[:20], 1):
    print(f"{i}. {feature}")

First 20 Ranked Features:
1. 2440
2. 2460
3. 2096
4. 482
5. 2223
6. 848
7. 1501
8. 519
9. 1417
10. 1939
11. 1914
12. 937
13. 1340
14. 100
15. 1978
16. 1558
17. 413
18. 1809
19. 2031
20. 1466

# Print the first 20 Feature Frequencies (fre1)
print("\nFirst 20 Feature Frequencies:")
for i, (feature, freq) in enumerate(list(fre1.items())[:20], 1):
    print(f"{i}. Feature: {feature}, Frequency: {freq}")

First 20 Feature Frequencies:
1. Feature: 2440, Frequency: 1
2. Feature: 2460, Frequency: 16
3. Feature: 2096, Frequency: 26
4. Feature: 482, Frequency: 1
5. Feature: 2223, Frequency: 13
6. Feature: 848, Frequency: 1
7. Feature: 1501, Frequency: 1
8. Feature: 519, Frequency: 1
9. Feature: 1417, Frequency: 2
10. Feature: 1939, Frequency: 1
11. Feature: 1914, Frequency: 1
12. Feature: 937, Frequency: 1
13. Feature: 1340, Frequency: 1
14. Feature: 100, Frequency: 1
15. Feature: 1978, Frequency: 1
16. Feature: 1558, Frequency: 1
17. Feature: 413, Frequency: 1
18. Feature: 1809, Frequency: 1
19. Feature: 2031, Frequency: 1
20. Feature: 1466, Frequency: 11

# Print the Features with a frequency greater than 1 
print("\nFeatures with a frequency greater than 1 :")
for i, (feature, freq) in enumerate(frequency[:20], 1):
    print(f"{i}. Feature: {feature}, Frequency: {freq}")

Features with a frequency greater than 1 :
1. Feature: 2096, Frequency: 26
2. Feature: 2460, Frequency: 16
3. Feature: 2223, Frequency: 13
4. Feature: 1466, Frequency: 11
5. Feature: 1773, Frequency: 3
6. Feature: 900, Frequency: 2
7. Feature: 2620, Frequency: 2
8. Feature: 176, Frequency: 2
9. Feature: 1417, Frequency: 2
10. Feature: 1136, Frequency: 2
11. Feature: 1080, Frequency: 2

# Print the length of FName (len_FName)
print("\nCount of Features with AUC > 0.5 (len_FName):")
print(len_FName)

Count of Features with AUC > 0.5 (len_FName):
809

# Print the first 10 Features with AUC > 0.5 (FName)
print("\nFirst few Features with AUC > 0.5:")
for i, feature in enumerate(FName[:20], 1):
    print(f"{i}. {feature}")

First few Features with AUC > 0.5:
1. 2440
2. 482
3. 848
4. 1501
5. 519
6. 1939
7. 1914
8. 937
9. 1340
10. 100
11. 1978
12. 1558
13. 413
14. 1809
15. 2031
16. 780
17. 712
18. 1362
19. 1136
20. 2486

# Print the first 10 AUC Values for Ranked Features (Fauc)
print("\nFirst few AUC Values for Ranked Features:")
for i, auc in enumerate(Fauc[:20], 1):
    print(f"{i}. AUC: {auc}")

First few AUC Values for Ranked Features:
1. AUC: 0.8999134244562793
2. AUC: 0.8998950599470052
3. AUC: 0.899879318939056
4. AUC: 0.8995041582496
5. AUC: 0.8994700527323767
6. AUC: 0.8994595587270772
7. AUC: 0.8992864076396359
8. AUC: 0.8992785371356613
9. AUC: 0.8990555395230475
10. AUC: 0.8990424220164231
11. AUC: 0.8989715874806516
12. AUC: 0.8988351654117586
13. AUC: 0.8988325419104337
14. AUC: 0.8987433428653882
15. AUC: 0.8987276018574389
16. AUC: 0.8987249783561141
17. AUC: 0.8986725083296168
18. AUC: 0.8985649447752971
19. AUC: 0.8985150982501247
20. AUC: 0.898449510717003

5.5.5 gene_name

gene_names = TUG.gene_name(data_path='../test_TransProPy/data/four_methods_degs_union.csv')

# Print the first 20 gene names
print("First 20 Gene Names:")
for i, gene_name in enumerate(gene_names[:20], 1):
    print(f"{i}. {gene_name}")
First 20 Gene Names:
1. A1BG
2. A2M
3. A2ML1
4. AADAC
5. AADACL2
6. AADACL3
7. AADACL4
8. AB019440.50
9. AB019441.29
10. ABC12-47964100C23.1
11. ABC12-49244600F4.4
12. ABCA10
13. ABCA12
14. ABCA17P
15. ABCA6
16. ABCA8
17. ABCA9
18. ABCB11
19. ABCB4
20. ABCB5

5.5.6 gene_map_feature

5.5.6.1 high_auc_features_result(AUC>0.9)

# Extract feature indices from high_auc_features
high_ranked_features = [feature for feature, auc_value in high_auc_features]

# Utilize the TUGM.gene_map_feature function
gene_to_feature_mapping_0_9 = TUGM.gene_map_feature(gene_names, high_ranked_features)

# Creating a dictionary to store gene, feature, and AUC mapping
gene_feature_auc_mapping = {}

# Iterate over each gene and its corresponding feature
for gene, feature in gene_to_feature_mapping_0_9.items():
    feature = str(feature)  # Adjust this based on your data format
    # Find the corresponding AUC value for the feature
    auc_value = next((auc for feat, auc in high_auc_features if str(feat) == feature), None)
    # Store the gene, feature, and AUC in the mapping
    gene_feature_auc_mapping[gene] = (feature, auc_value)

# Print the first 20 gene to feature mappings along with AUC values
print("\nFirst 20 Gene to Feature Mappings with AUC Values:")
for i, (gene, (feature, auc)) in enumerate(list(gene_feature_auc_mapping.items())[:20], 1):
    print(f"{i}. Gene: {gene}, Feature: {feature}, AUC: {auc}")

First 20 Gene to Feature Mappings with AUC Values:
1. Gene: ABCD1, Feature: 26, AUC: 1.0
2. Gene: ANGPTL5, Feature: 704, AUC: 1.0
3. Gene: ANKRD20A10P, Feature: 717, AUC: 1.0
4. Gene: CAPN8, Feature: 1172, AUC: 1.0
5. Gene: CTD-2340E1.2, Feature: 1899, AUC: 1.0
6. Gene: CTD-2562J17.2, Feature: 1948, AUC: 1.0
7. Gene: EPGN, Feature: 2338, AUC: 1.0
8. Gene: FOSB, Feature: 2596, AUC: 0.9999973764986752
9. Gene: AF064858.8, Feature: 582, AUC: 0.9999947529973503
10. Gene: FAM229A, Feature: 2453, AUC: 0.9999895059947005
11. Gene: COL9A1, Feature: 1563, AUC: 0.9999868824933756
12. Gene: AP001107.1, Feature: 786, AUC: 0.9999842589920508
13. Gene: FAM155B, Feature: 2419, AUC: 0.9999842589920508
14. Gene: AC010547.9, Feature: 204, AUC: 0.9999763884880762
15. Gene: bP-21201H5.1, Feature: 1002, AUC: 0.9999763884880762
16. Gene: AC023590.1, Feature: 291, AUC: 0.9999658944827767
17. Gene: AC012512.1, Feature: 237, AUC: 0.9999370359682032
18. Gene: AC006486.10, Feature: 124, AUC: 0.9999239184615788
19. Gene: COL7A1, Feature: 1561, AUC: 0.9999081774536296
20. Gene: DNAH2, Feature: 2171, AUC: 0.9999081774536296

5.5.6.2 NewMACFCmain_result(0.9>AUC>0.5)

import numpy as np
# Generating gene_to_feature_mapping
gene_to_feature_mapping = {}
for gene, feature in zip(gene_names, FName):
    # Find the index of the feature in FName
    index = np.where(FName == feature)[0][0]
    # Find the corresponding AUC value using the index
    auc_value = Fauc[index]
    # Store the gene name, feature name, and AUC value in the mapping
    gene_to_feature_mapping[gene] = (feature, auc_value)

# Print the first 20 mappings
print("\nFirst 20 Gene to Feature Mappings with AUC Values:")
for i, (gene, (feature, auc)) in enumerate(list(gene_to_feature_mapping.items())[:20], 1):
    print(f"{i}. Gene: {gene}, Feature: {feature}, AUC: {auc}")

First 20 Gene to Feature Mappings with AUC Values:
1. Gene: A1BG, Feature: 2440, AUC: 0.8999134244562793
2. Gene: A2M, Feature: 482, AUC: 0.8998950599470052
3. Gene: A2ML1, Feature: 848, AUC: 0.899879318939056
4. Gene: AADAC, Feature: 1501, AUC: 0.8995041582496
5. Gene: AADACL2, Feature: 519, AUC: 0.8994700527323767
6. Gene: AADACL3, Feature: 1939, AUC: 0.8994595587270772
7. Gene: AADACL4, Feature: 1914, AUC: 0.8992864076396359
8. Gene: AB019440.50, Feature: 937, AUC: 0.8992785371356613
9. Gene: AB019441.29, Feature: 1340, AUC: 0.8990555395230475
10. Gene: ABC12-47964100C23.1, Feature: 100, AUC: 0.8990424220164231
11. Gene: ABC12-49244600F4.4, Feature: 1978, AUC: 0.8989715874806516
12. Gene: ABCA10, Feature: 1558, AUC: 0.8988351654117586
13. Gene: ABCA12, Feature: 413, AUC: 0.8988325419104337
14. Gene: ABCA17P, Feature: 1809, AUC: 0.8987433428653882
15. Gene: ABCA6, Feature: 2031, AUC: 0.8987276018574389
16. Gene: ABCA8, Feature: 780, AUC: 0.8987249783561141
17. Gene: ABCA9, Feature: 712, AUC: 0.8986725083296168
18. Gene: ABCB11, Feature: 1362, AUC: 0.8985649447752971
19. Gene: ABCB4, Feature: 1136, AUC: 0.8985150982501247
20. Gene: ABCB5, Feature: 2486, AUC: 0.898449510717003

5.5.7 Save Data

5.5.7.1 high_auc_features_result (AUC>0.9)

import pandas as pd
# Convert gene_feature_auc_mapping to a DataFrame
gene_feature_auc_df = pd.DataFrame.from_dict(gene_feature_auc_mapping, orient='index', columns=['Feature', 'AUC'])

# Reset the index to make gene names a separate column
gene_feature_auc_df.reset_index(inplace=True)
gene_feature_auc_df.rename(columns={'index': 'Gene'}, inplace=True)

# Save the DataFrame to a CSV file
gene_feature_auc_df.to_csv('../test_TransProPy/result/all_degs_count_exp_gene_feature_auc_mapping_0.9.csv', index=False)

5.5.7.2 NewMACFCmain_result (0.9>AUC>0.5)

import pandas as pd
# Convert gene_to_feature_mapping to a DataFrame
gene_to_feature_df = pd.DataFrame.from_dict(gene_to_feature_mapping, orient='index', columns=['Feature', 'AUC'])

# Reset the index to make gene names a separate column
gene_to_feature_df.reset_index(inplace=True)
gene_to_feature_df.rename(columns={'index': 'Gene'}, inplace=True)

# Save the DataFrame to a CSV file
gene_to_feature_df.to_csv('../test_TransProPy/result/all_degs_count_exp_gene_feature_auc_mapping_0.5_0.9.csv', index=False)

5.6 Usage of New_MACFCmain (all_count_exp)

This function uses the MACFC method to select feature genes relevant to classification and ranks them based on their corresponding weights.

5.6.1 Import the corresponding module

import TransProPy.NewMACFCmain as TN
import TransProPy.UtilsFunction1.GeneNames as TUG
import TransProPy.UtilsFunction1.GeneToFeatureMapping as TUGM

5.6.2 Data

import pandas as pd
data_path = '../test_TransProPy/data/all_count_exp.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10]) 
  Unnamed: 0  TCGA-D9-A4Z2-01A  TCGA-ER-A2NH-06A  TCGA-BF-A5EO-01A  \
0  5_8S_rRNA                 0                 0                 0   
1    5S_rRNA                 0                 0                 0   
2        7SK                 0                 0                 0   
3       A1BG               107                15                32   
4   A1BG-AS1               373               112               363   
5       A1CF                10                 0                 0   
6        A2M            114778             93079            144772   
7    A2M-AS1               538               140                50   
8      A2ML1                 2               732               172   
9  A2ML1-AS1                 0                 0                 0   

   TCGA-D9-A6EA-06A  TCGA-D9-A4Z3-01A  TCGA-GN-A26A-06A  TCGA-D3-A3BZ-06A  \
0                 0                 0                 0                 0   
1                 0                 0                 0                 0   
2                 0                 0                 0                 0   
3                37                62                32                44   
4               347               222               114               225   
5                 1                 3                 0                 0   
6            222082             27877             78678             67154   
7               211                28               332               136   
8                 5                 2                 5                38   
9                 0                 3                 0                 0   

   TCGA-D3-A51G-06A  TCGA-EE-A29R-06A  
0                 0                 0  
1                 0                 0  
2                 0                 0  
3                53                81  
4               284               396  
5                 0                 1  
6            335304            127432  
7                54               225  
8                 5                11  
9                 0                 1  

import pandas as pd
data_path = '../test_TransProPy/data/class.csv'  
data = pd.read_csv(data_path)
print(data.iloc[:10, :10]) 
         Unnamed: 0  class
0  TCGA-D9-A4Z2-01A      2
1  TCGA-ER-A2NH-06A      2
2  TCGA-BF-A5EO-01A      2
3  TCGA-D9-A6EA-06A      2
4  TCGA-D9-A4Z3-01A      2
5  TCGA-GN-A26A-06A      2
6  TCGA-D3-A3BZ-06A      2
7  TCGA-D3-A51G-06A      2
8  TCGA-EE-A29R-06A      2
9  TCGA-D3-A2JE-06A      2

5.6.3 New_MACFCmain

high_auc_features, ranked_features, fre1, frequency, len_FName, FName, Fauc = TN.New_MACFCmain(
    0.9,
    100, 
    "class", 
    0.9, 
    data_path='../test_TransProPy/data/all_count_exp.csv', 
    label_path='../test_TransProPy/data/class.csv'
    )

5.6.4 Result

5.6.4.1 AUC greater than 0.9 and their AUC values

# Print features with AUC greater than 0.9 and their AUC values
print('\nFeatures with AUC greater than 0.9:')
total_features = len(high_auc_features)
print(f"Total features: {total_features}")

# Determine the number of features to display
num_to_display = min(total_features, 20)

for i in range(num_to_display):
    feature, auc_value = high_auc_features[i]
    print(f"Feature: {feature}, AUC: {auc_value}")

Features with AUC greater than 0.9:
Total features: 4903
Feature: 129, AUC: 1.0
Feature: 4701, AUC: 1.0
Feature: 4788, AUC: 1.0
Feature: 7317, AUC: 1.0
Feature: 12536, AUC: 1.0
Feature: 12784, AUC: 1.0
Feature: 15361, AUC: 1.0
Feature: 17372, AUC: 0.9999973764986752
Feature: 3731, AUC: 0.9999947529973503
Feature: 16018, AUC: 0.9999921294960255
Feature: 16355, AUC: 0.9999895059947005
Feature: 9932, AUC: 0.9999868824933756
Feature: 5222, AUC: 0.9999842589920508
Feature: 1242, AUC: 0.9999763884880762
Feature: 6418, AUC: 0.9999763884880762
Feature: 1915, AUC: 0.9999658944827767
Feature: 1491, AUC: 0.9999370359682032
Feature: 724, AUC: 0.9999239184615788
Feature: 9930, AUC: 0.9999081774536296
Feature: 14212, AUC: 0.9999081774536296

5.6.4.2 New_MACFCmain

# Print the first 20 Ranked Features
print("\nFirst 20 Ranked Features:")
for i, feature in enumerate(ranked_features[:20], 1):
    print(f"{i}. {feature}")

First 20 Ranked Features:
1. 15431
2. 13741
3. 6034
4. 12954
5. 16231
6. 14367
7. 10409
8. 8777
9. 4031
10. 5595
11. 3066
12. 15959
13. 3436
14. 12999
15. 15583
16. 6533
17. 173
18. 6943
19. 8390
20. 5298

# Print the first 20 Feature Frequencies (fre1)
print("\nFirst 20 Feature Frequencies:")
for i, (feature, freq) in enumerate(list(fre1.items())[:20], 1):
    print(f"{i}. Feature: {feature}, Frequency: {freq}")

First 20 Feature Frequencies:
1. Feature: 15431, Frequency: 1
2. Feature: 13741, Frequency: 62
3. Feature: 6034, Frequency: 51
4. Feature: 12954, Frequency: 1
5. Feature: 16231, Frequency: 2
6. Feature: 14367, Frequency: 12
7. Feature: 10409, Frequency: 2
8. Feature: 8777, Frequency: 1
9. Feature: 4031, Frequency: 3
10. Feature: 5595, Frequency: 1
11. Feature: 3066, Frequency: 1
12. Feature: 15959, Frequency: 1
13. Feature: 3436, Frequency: 1
14. Feature: 12999, Frequency: 1
15. Feature: 15583, Frequency: 2
16. Feature: 6533, Frequency: 4
17. Feature: 173, Frequency: 2
18. Feature: 6943, Frequency: 2
19. Feature: 8390, Frequency: 2
20. Feature: 5298, Frequency: 1

# Print the Features with a frequency greater than 1 
print("\nFeatures with a frequency greater than 1 :")
for i, (feature, freq) in enumerate(frequency[:20], 1):
    print(f"{i}. Feature: {feature}, Frequency: {freq}")

Features with a frequency greater than 1 :
1. Feature: 13741, Frequency: 62
2. Feature: 6034, Frequency: 51
3. Feature: 14367, Frequency: 12
4. Feature: 10844, Frequency: 8
5. Feature: 16078, Frequency: 7
6. Feature: 6533, Frequency: 4
7. Feature: 4031, Frequency: 3
8. Feature: 8390, Frequency: 2
9. Feature: 6943, Frequency: 2
10. Feature: 586, Frequency: 2
11. Feature: 5249, Frequency: 2
12. Feature: 173, Frequency: 2
13. Feature: 16231, Frequency: 2
14. Feature: 16178, Frequency: 2
15. Feature: 15583, Frequency: 2
16. Feature: 10409, Frequency: 2

# Print the length of FName (len_FName)
print("\nCount of Features with AUC > 0.5 (len_FName):")
print(len_FName)

Count of Features with AUC > 0.5 (len_FName):
7726

# Print the first 10 Features with AUC > 0.5 (FName)
print("\nFirst few Features with AUC > 0.5:")
for i, feature in enumerate(FName[:20], 1):
    print(f"{i}. {feature}")

First few Features with AUC > 0.5:
1. 15431
2. 12954
3. 16231
4. 8777
5. 5595
6. 3066
7. 15959
8. 3436
9. 12999
10. 17823
11. 17634
12. 14506
13. 2797
14. 5249
15. 17163
16. 10844
17. 16206
18. 5614
19. 5193
20. 4601

# Print the first 10 AUC Values for Ranked Features (Fauc)
print("\nFirst few AUC Values for Ranked Features:")
for i, auc in enumerate(Fauc[:20], 1):
    print(f"{i}. AUC: {auc}")

First few AUC Values for Ranked Features:
1. AUC: 0.8999763884880762
2. AUC: 0.8999685179841016
3. AUC: 0.8999658944827768
4. AUC: 0.8999580239788021
5. AUC: 0.8998924364456804
6. AUC: 0.8998740719364063
7. AUC: 0.899860954429782
8. AUC: 0.8998583309284571
9. AUC: 0.8998557074271323
10. AUC: 0.8998058609019598
11. AUC: 0.8997953668966603
12. AUC: 0.8997560143767872
13. AUC: 0.8997035443502899
14. AUC: 0.8996930503449905
15. AUC: 0.8996878033423407
16. AUC: 0.8996510743237925
17. AUC: 0.8996064748012698
18. AUC: 0.8996064748012698
19. AUC: 0.8995933572946454
20. AUC: 0.8995382637668232

5.6.5 gene_name

gene_names = TUG.gene_name(data_path='../test_TransProPy/data/all_count_exp.csv')

# Print the first 20 gene names
print("First 20 Gene Names:")
for i, gene_name in enumerate(gene_names[:20], 1):
    print(f"{i}. {gene_name}")
First 20 Gene Names:
1. 5_8S_rRNA
2. 5S_rRNA
3. 7SK
4. A1BG
5. A1BG-AS1
6. A1CF
7. A2M
8. A2M-AS1
9. A2ML1
10. A2ML1-AS1
11. A2ML1-AS2
12. A2MP1
13. A3GALT2
14. A4GALT
15. A4GNT
16. AA06
17. AAAS
18. AACS
19. AACSP1
20. AADAC

5.6.6 gene_map_feature

5.6.6.1 high_auc_features_result (AUC>0.9)

# Extract feature indices from high_auc_features
high_ranked_features = [feature for feature, auc_value in high_auc_features]

# Utilize the TUGM.gene_map_feature function
gene_to_feature_mapping_0_9 = TUGM.gene_map_feature(gene_names, high_ranked_features)

# Creating a dictionary to store gene, feature, and AUC mapping
gene_feature_auc_mapping = {}

# Iterate over each gene and its corresponding feature
for gene, feature in gene_to_feature_mapping_0_9.items():
    feature = str(feature)  # Adjust this based on your data format
    # Find the corresponding AUC value for the feature
    auc_value = next((auc for feat, auc in high_auc_features if str(feat) == feature), None)
    # Store the gene, feature, and AUC in the mapping
    gene_feature_auc_mapping[gene] = (feature, auc_value)

# Print the first 20 gene to feature mappings along with AUC values
print("\nFirst 20 Gene to Feature Mappings with AUC Values:")
for i, (gene, (feature, auc)) in enumerate(list(gene_feature_auc_mapping.items())[:20], 1):
    print(f"{i}. Gene: {gene}, Feature: {feature}, AUC: {auc}")

First 20 Gene to Feature Mappings with AUC Values:
1. Gene: ABHD15, Feature: 129, AUC: 1.0
2. Gene: AF064858.11, Feature: 4701, AUC: 1.0
3. Gene: AFF4, Feature: 4788, AUC: 1.0
4. Gene: AY269186.1, Feature: 7317, AUC: 1.0
5. Gene: CTD-2530H12.5, Feature: 12536, AUC: 1.0
6. Gene: CTD-2650P22.2, Feature: 12784, AUC: 1.0
7. Gene: FAM197Y8, Feature: 15361, AUC: 1.0
8. Gene: GS1-25M2.1, Feature: 17372, AUC: 0.9999973764986752
9. Gene: AC107016.1, Feature: 3731, AUC: 0.9999947529973503
10. Gene: FLJ42102, Feature: 16018, AUC: 0.9999921294960255
11. Gene: FUT9, Feature: 16355, AUC: 0.9999895059947005
12. Gene: CICP26, Feature: 9932, AUC: 0.9999868824933756
13. Gene: AL133475.1, Feature: 5222, AUC: 0.9999842589920508
14. Gene: AC008703.2, Feature: 1242, AUC: 0.9999763884880762
15. Gene: AP001464.4, Feature: 6418, AUC: 0.9999763884880762
16. Gene: AC016561.1, Feature: 1915, AUC: 0.9999658944827767
17. Gene: AC010148.1, Feature: 1491, AUC: 0.9999370359682032
18. Gene: AC005822.1, Feature: 724, AUC: 0.9999239184615788
19. Gene: CICP23, Feature: 9930, AUC: 0.9999081774536296
20. Gene: DUSP12P1, Feature: 14212, AUC: 0.9999081774536296

5.6.6.2 NewMACFCmain_result (0.9>AUC>0.5)

import numpy as np
# Generating gene_to_feature_mapping
gene_to_feature_mapping = {}
for gene, feature in zip(gene_names, FName):
    # Find the index of the feature in FName
    index = np.where(FName == feature)[0][0]
    # Find the corresponding AUC value using the index
    auc_value = Fauc[index]
    # Store the gene name, feature name, and AUC value in the mapping
    gene_to_feature_mapping[gene] = (feature, auc_value)

# Print the first 20 mappings
print("\nFirst 20 Gene to Feature Mappings with AUC Values:")
for i, (gene, (feature, auc)) in enumerate(list(gene_to_feature_mapping.items())[:20], 1):
    print(f"{i}. Gene: {gene}, Feature: {feature}, AUC: {auc}")

First 20 Gene to Feature Mappings with AUC Values:
1. Gene: 5_8S_rRNA, Feature: 15431, AUC: 0.8999763884880762
2. Gene: 5S_rRNA, Feature: 12954, AUC: 0.8999685179841016
3. Gene: 7SK, Feature: 16231, AUC: 0.8999658944827768
4. Gene: A1BG, Feature: 8777, AUC: 0.8999580239788021
5. Gene: A1BG-AS1, Feature: 5595, AUC: 0.8998924364456804
6. Gene: A1CF, Feature: 3066, AUC: 0.8998740719364063
7. Gene: A2M, Feature: 15959, AUC: 0.899860954429782
8. Gene: A2M-AS1, Feature: 3436, AUC: 0.8998583309284571
9. Gene: A2ML1, Feature: 12999, AUC: 0.8998557074271323
10. Gene: A2ML1-AS1, Feature: 17823, AUC: 0.8998058609019598
11. Gene: A2ML1-AS2, Feature: 17634, AUC: 0.8997953668966603
12. Gene: A2MP1, Feature: 14506, AUC: 0.8997560143767872
13. Gene: A3GALT2, Feature: 2797, AUC: 0.8997035443502899
14. Gene: A4GALT, Feature: 5249, AUC: 0.8996930503449905
15. Gene: A4GNT, Feature: 17163, AUC: 0.8996878033423407
16. Gene: AA06, Feature: 10844, AUC: 0.8996510743237925
17. Gene: AAAS, Feature: 16206, AUC: 0.8996064748012698
18. Gene: AACS, Feature: 5614, AUC: 0.8996064748012698
19. Gene: AACSP1, Feature: 5193, AUC: 0.8995933572946454
20. Gene: AADAC, Feature: 4601, AUC: 0.8995382637668232

5.6.7 Save Data

5.6.7.1 high_auc_features_result (AUC>0.9)

import pandas as pd
# Convert gene_feature_auc_mapping to a DataFrame
gene_feature_auc_df = pd.DataFrame.from_dict(gene_feature_auc_mapping, orient='index', columns=['Feature', 'AUC'])

# Reset the index to make gene names a separate column
gene_feature_auc_df.reset_index(inplace=True)
gene_feature_auc_df.rename(columns={'index': 'Gene'}, inplace=True)

# Save the DataFrame to a CSV file
gene_feature_auc_df.to_csv('../test_TransProPy/result/all_count_exp_gene_feature_auc_mapping_0.9.csv', index=False)

5.6.7.2 NewMACFCmain_result (0.9>AUC>0.5)

import pandas as pd
# Convert gene_to_feature_mapping to a DataFrame
gene_to_feature_df = pd.DataFrame.from_dict(gene_to_feature_mapping, orient='index', columns=['Feature', 'AUC'])

# Reset the index to make gene names a separate column
gene_to_feature_df.reset_index(inplace=True)
gene_to_feature_df.rename(columns={'index': 'Gene'}, inplace=True)

# Save the DataFrame to a CSV file
gene_to_feature_df.to_csv('../test_TransProPy/result/all_count_exp_gene_feature_auc_mapping_0.5_0.9.csv', index=False)

5.7 References