import TransProPy.MACFCmain as Tr
import TransProPy.UtilsFunction1.GeneNames as TUG
import TransProPy.UtilsFunction1.GeneToFeatureMapping as TUGM
4 MACFCmain.py
Applying the MACFC selection for relevant feature genes in classification.
4.1 Parameters
- max_rank: int
- The total number of gene combinations you want to obtain.
- lable_name: string
- For example: gender, age, altitude, temperature, quality, and other categorical variable names.
- data_path: string
- For example: ‘../data/gene_tpm.csv’
- Please note: Preprocess the input data in advance to remove samples that contain too many missing values or zeros.
- The input data matrix should have genes as rows and samples as columns.
- label_path: string
- For example: ‘../data/tumor_class.csv’
- Please note: The input sample categories must be in a numerical binary format, such as: 1,2,1,1,2,2,1.
- In this case, the numerical values represent the following classifications: 1: male; 2: female.
- threshold: float
- For example: 0.9
- The set threshold indicates the proportion of non-zero value samples to all samples in each feature.
4.2 Returns
- fr: list of strings
- Representing ranked features.
- fre1: dictionary
- Feature names as keys and their frequencies as values.
- frequency: list of tuples
- Feature names and their frequencies.
- The frequency outputs a list sorted by occurrence frequency (in descending order). This list includes only those elements from the dictionary fre1 (which represents the counted frequencies of elements in the original data) that have an occurrence frequency greater than once, along with their frequencies.
- len(FName): integer
- Count of AUC values greater than 0.5.
- FName: array of strings
- Feature names after ranking with AUC > 0.5.
- Fauc: array of floats
- AUC values corresponding to the ranked feature names.
4.3 Function Principle Explanation
- Feature Frequency and AUC: In this function, features that appear with high frequency indicate their presence in multiple optimal feature sets. Each optimal feature set is determined by calculating its Area Under the Receiver Operating Characteristic (ROC) Curve (AUC), which is a common measure for evaluating classifier performance. During each iteration of the loop, an optimal feature set with the highest average AUC value is selected. Features from this set are then added to a rank list, known as ‘ranklist,’ and when necessary, also to a set named ‘rankset’.
- High-Frequency Features and Performance: Because features in each set are chosen based on their contribution to classifier performance, high-frequency features are likely to perform well. In other words, if a feature appears in multiple optimal feature sets, it may have a significant impact on the performance of the classifier.
- Note on Low-Frequency Features: However, it’s important to note that a low frequency of a feature does not necessarily mean it is unimportant. The importance of a feature may depend on how it combines with other features. Additionally, the outcome of feature selection may be influenced by the characteristics of the dataset and random factors. Therefore, the frequency provided by this function should only be used as a reference and is not an absolute indicator of feature performance.
- Further Evaluation Methods: If you wish to explore feature performance more deeply, you may need to employ other methods for assessing feature importance. This could include model-based importance metrics or statistical tests to evaluate the relationship between features and the target variable.
4.4 Usage Workflow
FName
is a list of feature names sorted based on their AUC (Area Under the Curve) values. In this sorting method, the primary consideration is the AUC value, followed by the feature name. All features included inFName
have an AUC value greater than 0.5.fr
is the result of another sorting method. In this method, the primary consideration is the “combined” AUC of the features, followed by their individual AUC values. This means that some features, despite having lower individual AUC values, may produce a higher combined AUC when paired with other features. Therefore, their position in thefr
list may be higher than in theFName
list.- The code for
fr
employs a more complex logic to select and combine features to optimize their combined AUC values. In this process, features are not solely selected and sorted based on their individual AUC values; the effect of their combination with other features is also considered. Consequently, the sorting logic forfr
(orrankset
) differs from that ofFName
.- Please note: While the code takes into account both individual AUC values and combined AUC values, the sorting of the
fr
list (i.e.,rankset
) initially starts based on individual AUC values. This is because at the beginning of each external loop iteration, the first element offs
is the next feature sorted by its individual AUC value. The list is then further optimized by evaluating the combination effects with other features.
4.5 Usage of MACFCmain (Significant correlation)
This function uses the MACFC method to select feature genes relevant to classification and ranks them based on their corresponding weights.
- Please note:Data characteristics: Features have strong correlation with the classification.
4.5.1 Import the corresponding module
4.5.2 Data
import pandas as pd
= '../test_TransProPy/data/four_methods_degs_intersection.csv'
data_path = pd.read_csv(data_path)
data print(data.iloc[:10, :10])
Unnamed: 0 TCGA-D9-A4Z2-01A TCGA-ER-A2NH-06A TCGA-BF-A5EO-01A \
0 A2M 16.808499 16.506184 17.143433
1 A2ML1 1.584963 9.517669 7.434628
2 AADAC 4.000000 2.584963 1.584963
3 AADACL2 1.000000 1.000000 0.000000
4 ABCA12 4.523562 4.321928 3.906891
5 ABCA17P 4.584963 5.169925 3.807355
6 ABCA9 9.753217 6.906891 3.459432
7 ABCB4 9.177420 6.700440 5.000000
8 ABCB5 10.134426 4.169925 9.167418
9 ABCC11 10.092757 6.491853 5.459432
TCGA-D9-A6EA-06A TCGA-D9-A4Z3-01A TCGA-GN-A26A-06A TCGA-D3-A3BZ-06A \
0 17.760739 14.766839 16.263691 16.035207
1 2.584963 1.584963 2.584963 5.285402
2 0.000000 0.000000 0.000000 3.321928
3 0.000000 1.000000 0.000000 0.000000
4 3.459432 1.584963 3.000000 4.321928
5 8.366322 7.228819 7.076816 4.584963
6 2.584963 6.357552 6.475733 7.330917
7 9.342075 10.392317 7.383704 11.032735
8 4.906891 11.340963 3.169925 11.161762
9 6.807355 4.247928 5.459432 5.977280
TCGA-D3-A51G-06A TCGA-EE-A29R-06A
0 18.355114 16.959379
1 2.584963 3.584963
2 1.000000 4.584963
3 0.000000 1.000000
4 4.807355 3.700440
5 6.409391 7.139551
6 7.954196 9.177420
7 10.082149 10.088788
8 4.643856 12.393927
9 5.614710 8.233620
import pandas as pd
= '../test_TransProPy/data/class.csv'
data_path = pd.read_csv(data_path)
data print(data.iloc[:10, :10])
Unnamed: 0 class
0 TCGA-D9-A4Z2-01A 2
1 TCGA-ER-A2NH-06A 2
2 TCGA-BF-A5EO-01A 2
3 TCGA-D9-A6EA-06A 2
4 TCGA-D9-A4Z3-01A 2
5 TCGA-GN-A26A-06A 2
6 TCGA-D3-A3BZ-06A 2
7 TCGA-D3-A51G-06A 2
8 TCGA-EE-A29R-06A 2
9 TCGA-D3-A2JE-06A 2
4.5.3 MACFCmain
= Tr.MACFCmain(
ranked_features, fre1, frequency, len_FName, FName, Fauc 100,
"class",
0.95,
='../test_TransProPy/data/four_methods_degs_intersection.csv',
data_path='../test_TransProPy/data/class.csv'
label_path )
4.5.4 Result
# Print the first 20 Ranked Features
print("\nFirst 20 Ranked Features:")
for i, feature in enumerate(ranked_features[:20], 1):
print(f"{i}. {feature}")
First 20 Ranked Features:
1. 355
2. 68
3. 867
4. 78
5. 97
6. 90
7. 432
8. 313
9. 497
10. 511
11. 66
12. 172
13. 544
14. 1162
15. 487
16. 317
17. 1283
18. 930
19. 1290
20. 1170
# Print the first 20 Feature Frequencies (fre1)
print("\nFirst 20 Feature Frequencies:")
for i, (feature, freq) in enumerate(list(fre1.items())[:20], 1):
print(f"{i}. Feature: {feature}, Frequency: {freq}")
First 20 Feature Frequencies:
1. Feature: 355, Frequency: 1
2. Feature: 68, Frequency: 1
3. Feature: 867, Frequency: 1
4. Feature: 78, Frequency: 1
5. Feature: 97, Frequency: 1
6. Feature: 90, Frequency: 1
7. Feature: 432, Frequency: 1
8. Feature: 313, Frequency: 1
9. Feature: 497, Frequency: 1
10. Feature: 511, Frequency: 1
11. Feature: 66, Frequency: 1
12. Feature: 172, Frequency: 1
13. Feature: 544, Frequency: 1
14. Feature: 1162, Frequency: 1
15. Feature: 487, Frequency: 1
16. Feature: 317, Frequency: 1
17. Feature: 1283, Frequency: 1
18. Feature: 930, Frequency: 1
19. Feature: 1290, Frequency: 1
20. Feature: 1170, Frequency: 1
# Print the Features with a frequency greater than 1
print("\nFeatures with a frequency greater than 1 :")
for i, (feature, freq) in enumerate(frequency[:20], 1):
print(f"{i}. Feature: {feature}, Frequency: {freq}")
Features with a frequency greater than 1 :
# Print the length of FName (len_FName)
print("\nCount of Features with AUC > 0.5 (len_FName):")
print(len_FName)
Count of Features with AUC > 0.5 (len_FName):
1
# Print the first 10 Features with AUC > 0.5 (FName)
print("\nFirst few Features with AUC > 0.5:")
for i, feature in enumerate(FName[:20], 1):
print(f"{i}. {feature}")
First few Features with AUC > 0.5:
1. 355
# Print the first 10 AUC Values for Ranked Features (Fauc)
print("\nFirst few AUC Values for Ranked Features:")
for i, auc in enumerate(Fauc[:20], 1):
print(f"{i}. AUC: {auc}")
First few AUC Values for Ranked Features:
1. AUC: 1.0
4.5.5 gene_name
= TUG.gene_name(data_path='../test_TransProPy/data/four_methods_degs_intersection.csv') gene_names
# Print the first 20 gene names
print("First 20 Gene Names:")
for i, gene_name in enumerate(gene_names[:20], 1):
print(f"{i}. {gene_name}")
First 20 Gene Names:
1. A2M
2. A2ML1
3. AADAC
4. AADACL2
5. ABCA12
6. ABCA17P
7. ABCA9
8. ABCB4
9. ABCB5
10. ABCC11
11. ABCC3
12. ABCD1
13. ABI3BP
14. AC002116.8
15. AC002398.9
16. AC004057.1
17. AC004231.2
18. AC004540.5
19. AC004623.3
20. AC004951.5
4.5.6 gene_map_feature
= TUGM.gene_map_feature(gene_names, ranked_features) gene_to_feature_mapping
4.5.6.1 AUC>0.5
import numpy as np
# Generating gene_to_feature_mapping
= {}
gene_to_feature_mapping for gene, feature in zip(gene_names, FName):
# Find the index of the feature in FName
= np.where(FName == feature)[0][0]
index # Find the corresponding AUC value using the index
= Fauc[index]
auc_value # Store the gene name, feature name, and AUC value in the mapping
= (feature, auc_value)
gene_to_feature_mapping[gene]
# Print the first 20 mappings
print("\nFirst 20 Gene to Feature Mappings with AUC Values:")
for i, (gene, (feature, auc)) in enumerate(list(gene_to_feature_mapping.items())[:20], 1):
print(f"{i}. Gene: {gene}, Feature: {feature}, AUC: {auc}")
First 20 Gene to Feature Mappings with AUC Values:
1. Gene: A2M, Feature: 355, AUC: 1.0
4.6 Usage of MACFCmain (Insignificant correlation)
This function uses the MACFC method to select feature genes relevant to classification and ranks them based on their corresponding weights.
- Please note:Data characteristics: Features have weak correlation with the classification.
- Randomly shuffling the class labels to a certain extent simulates reducing the correlation.
4.6.1 Data
import pandas as pd
= '../test_TransProPy/data/four_methods_degs_intersection.csv'
data_path = pd.read_csv(data_path)
data print(data.iloc[:10, :10])
Unnamed: 0 TCGA-D9-A4Z2-01A TCGA-ER-A2NH-06A TCGA-BF-A5EO-01A \
0 A2M 16.808499 16.506184 17.143433
1 A2ML1 1.584963 9.517669 7.434628
2 AADAC 4.000000 2.584963 1.584963
3 AADACL2 1.000000 1.000000 0.000000
4 ABCA12 4.523562 4.321928 3.906891
5 ABCA17P 4.584963 5.169925 3.807355
6 ABCA9 9.753217 6.906891 3.459432
7 ABCB4 9.177420 6.700440 5.000000
8 ABCB5 10.134426 4.169925 9.167418
9 ABCC11 10.092757 6.491853 5.459432
TCGA-D9-A6EA-06A TCGA-D9-A4Z3-01A TCGA-GN-A26A-06A TCGA-D3-A3BZ-06A \
0 17.760739 14.766839 16.263691 16.035207
1 2.584963 1.584963 2.584963 5.285402
2 0.000000 0.000000 0.000000 3.321928
3 0.000000 1.000000 0.000000 0.000000
4 3.459432 1.584963 3.000000 4.321928
5 8.366322 7.228819 7.076816 4.584963
6 2.584963 6.357552 6.475733 7.330917
7 9.342075 10.392317 7.383704 11.032735
8 4.906891 11.340963 3.169925 11.161762
9 6.807355 4.247928 5.459432 5.977280
TCGA-D3-A51G-06A TCGA-EE-A29R-06A
0 18.355114 16.959379
1 2.584963 3.584963
2 1.000000 4.584963
3 0.000000 1.000000
4 4.807355 3.700440
5 6.409391 7.139551
6 7.954196 9.177420
7 10.082149 10.088788
8 4.643856 12.393927
9 5.614710 8.233620
import pandas as pd
= '../test_TransProPy/data/random_classification_class.csv'
data_path = pd.read_csv(data_path)
data print(data.iloc[:10, :10])
Unnamed: 0 class
0 TCGA-D9-A4Z2-01A 2
1 TCGA-ER-A2NH-06A 2
2 TCGA-BF-A5EO-01A 2
3 TCGA-D9-A6EA-06A 2
4 TCGA-D9-A4Z3-01A 1
5 TCGA-GN-A26A-06A 1
6 TCGA-D3-A3BZ-06A 1
7 TCGA-D3-A51G-06A 1
8 TCGA-EE-A29R-06A 1
9 TCGA-D3-A2JE-06A 1
4.6.2 MACFCmain
= Tr.MACFCmain(
ranked_features, fre1, frequency, len_FName, FName, Fauc 100,
"class",
0.95,
='../test_TransProPy/data/four_methods_degs_intersection.csv',
data_path='../test_TransProPy/data/random_classification_class.csv'
label_path )
4.6.3 Result
# Print the first 20 Ranked Features
print("\nFirst 20 Ranked Features:")
for i, feature in enumerate(ranked_features[:20], 1):
print(f"{i}. {feature}")
First 20 Ranked Features:
1. 1147
2. 605
3. 140
4. 845
5. 546
6. 1052
7. 188
8. 431
9. 182
10. 120
11. 362
12. 998
13. 1122
14. 246
15. 23
16. 383
17. 258
18. 189
19. 746
20. 1064
# Print the first 20 Feature Frequencies (fre1)
print("\nFirst 20 Feature Frequencies:")
for i, (feature, freq) in enumerate(list(fre1.items())[:20], 1):
print(f"{i}. Feature: {feature}, Frequency: {freq}")
First 20 Feature Frequencies:
1. Feature: 1147, Frequency: 1
2. Feature: 605, Frequency: 1
3. Feature: 140, Frequency: 1
4. Feature: 845, Frequency: 1
5. Feature: 546, Frequency: 1
6. Feature: 1052, Frequency: 1
7. Feature: 188, Frequency: 1
8. Feature: 431, Frequency: 1
9. Feature: 182, Frequency: 1
10. Feature: 120, Frequency: 1
11. Feature: 362, Frequency: 1
12. Feature: 998, Frequency: 1
13. Feature: 1122, Frequency: 1
14. Feature: 246, Frequency: 1
15. Feature: 23, Frequency: 1
16. Feature: 383, Frequency: 1
17. Feature: 258, Frequency: 1
18. Feature: 189, Frequency: 1
19. Feature: 746, Frequency: 1
20. Feature: 1064, Frequency: 1
# Print the Features with a frequency greater than 1
print("\nFeatures with a frequency greater than 1 :")
for i, (feature, freq) in enumerate(frequency[:20], 1):
print(f"{i}. Feature: {feature}, Frequency: {freq}")
Features with a frequency greater than 1 :
# Print the length of FName (len_FName)
print("\nCount of Features with AUC > 0.5 (len_FName):")
print(len_FName)
Count of Features with AUC > 0.5 (len_FName):
757
# Print the first 10 Features with AUC > 0.5 (FName)
print("\nFirst few Features with AUC > 0.5:")
for i, feature in enumerate(FName[:20], 1):
print(f"{i}. {feature}")
First few Features with AUC > 0.5:
1. 1147
2. 140
3. 605
4. 518
5. 1080
6. 826
7. 541
8. 695
9. 1266
10. 0
11. 864
12. 188
13. 842
14. 344
15. 824
16. 1208
17. 1086
18. 602
19. 295
20. 1261
# Print the first 10 AUC Values for Ranked Features (Fauc)
print("\nFirst few AUC Values for Ranked Features:")
for i, auc in enumerate(Fauc[:20], 1):
print(f"{i}. AUC: {auc}")
First few AUC Values for Ranked Features:
1. AUC: 0.6469530885995243
2. AUC: 0.6465975134260281
3. AUC: 0.6447509747664238
4. AUC: 0.6415704161455651
5. AUC: 0.6405110473528042
6. AUC: 0.6403761740111332
7. AUC: 0.6400941661149121
8. AUC: 0.6398293239167219
9. AUC: 0.6387871208219917
10. AUC: 0.6381814169057604
11. AUC: 0.6376174011133181
12. AUC: 0.6373378454596729
13. AUC: 0.6371956153902744
14. AUC: 0.6371931631476986
15. AUC: 0.6370018882267834
16. AUC: 0.6367713774246548
17. AUC: 0.636531057652223
18. AUC: 0.6361656735084235
19. AUC: 0.6359670418597808
20. AUC: 0.6358419774884132
4.6.4 gene_name
= TUG.gene_name(data_path='../test_TransProPy/data/four_methods_degs_intersection.csv') gene_names
# Print the first 20 gene names
print("First 20 Gene Names:")
for i, gene_name in enumerate(gene_names[:20], 1):
print(f"{i}. {gene_name}")
First 20 Gene Names:
1. A2M
2. A2ML1
3. AADAC
4. AADACL2
5. ABCA12
6. ABCA17P
7. ABCA9
8. ABCB4
9. ABCB5
10. ABCC11
11. ABCC3
12. ABCD1
13. ABI3BP
14. AC002116.8
15. AC002398.9
16. AC004057.1
17. AC004231.2
18. AC004540.5
19. AC004623.3
20. AC004951.5
4.6.5 gene_map_feature
= TUGM.gene_map_feature(gene_names, ranked_features) gene_to_feature_mapping
4.6.5.1 AUC>0.5
import numpy as np
# Generating gene_to_feature_mapping
= {}
gene_to_feature_mapping for gene, feature in zip(gene_names, FName):
# Find the index of the feature in FName
= np.where(FName == feature)[0][0]
index # Find the corresponding AUC value using the index
= Fauc[index]
auc_value # Store the gene name, feature name, and AUC value in the mapping
= (feature, auc_value)
gene_to_feature_mapping[gene]
# Print the first 20 mappings
print("\nFirst 20 Gene to Feature Mappings with AUC Values:")
for i, (gene, (feature, auc)) in enumerate(list(gene_to_feature_mapping.items())[:20], 1):
print(f"{i}. Gene: {gene}, Feature: {feature}, AUC: {auc}")
First 20 Gene to Feature Mappings with AUC Values:
1. Gene: A2M, Feature: 1147, AUC: 0.6469530885995243
2. Gene: A2ML1, Feature: 140, AUC: 0.6465975134260281
3. Gene: AADAC, Feature: 605, AUC: 0.6447509747664238
4. Gene: AADACL2, Feature: 518, AUC: 0.6415704161455651
5. Gene: ABCA12, Feature: 1080, AUC: 0.6405110473528042
6. Gene: ABCA17P, Feature: 826, AUC: 0.6403761740111332
7. Gene: ABCA9, Feature: 541, AUC: 0.6400941661149121
8. Gene: ABCB4, Feature: 695, AUC: 0.6398293239167219
9. Gene: ABCB5, Feature: 1266, AUC: 0.6387871208219917
10. Gene: ABCC11, Feature: 0, AUC: 0.6381814169057604
11. Gene: ABCC3, Feature: 864, AUC: 0.6376174011133181
12. Gene: ABCD1, Feature: 188, AUC: 0.6373378454596729
13. Gene: ABI3BP, Feature: 842, AUC: 0.6371956153902744
14. Gene: AC002116.8, Feature: 344, AUC: 0.6371931631476986
15. Gene: AC002398.9, Feature: 824, AUC: 0.6370018882267834
16. Gene: AC004057.1, Feature: 1208, AUC: 0.6367713774246548
17. Gene: AC004231.2, Feature: 1086, AUC: 0.636531057652223
18. Gene: AC004540.5, Feature: 602, AUC: 0.6361656735084235
19. Gene: AC004623.3, Feature: 295, AUC: 0.6359670418597808
20. Gene: AC004951.5, Feature: 1261, AUC: 0.6358419774884132