TCGA-BRCA Demo
Dataset Source
Omics Data: FireHose BRCA
Clinical and PAM50 Data: TCGAbiolinks
Dataset Overview
Original Data:
Methylation: 20,107 × 885
mRNA: 18,321 × 1,212
miRNA: 503 × 1,189
PAM50: 1,087 × 1
Clinical: 1,098 × 101
Note: Omics matrices are features × samples; clinical matrices are samples × fields.
PAM50 Subtype Counts (Original)
LumA: 419
LumB: 140
Basal: 130
Her2: 46
Normal: 34
Patients in Every Dataset
Total patients present in methylation, mRNA, miRNA, PAM50, and clinical: 769
Final Shapes (Per-Patient)
After aggregating multiple aliquots by mean, all modalities align on 769 patients:
Methylation: 769 × 20,107
mRNA: 769 × 20,531
miRNA: 769 × 503
PAM50: 769 × 1
Clinical: 769 × 119
Data Summary Table
Stage |
Clinical |
Methylation |
miRNA |
mRNA |
PAM50 (Subtype Counts) |
Notes |
|---|---|---|---|---|---|---|
Original Raw Data |
1,098 × 101 |
20,107 × 885 |
503 × 1,189 |
18,321 × 1,212 |
LumA: 509LumB: 209Basal: 192Her2: 82Normal: 40 |
Raw FireHose & TCGAbiolinks files |
Patient-Level Intersection |
769 × 101 |
769 × 20,107 |
769 × 1,046 |
769 × 20,531 |
LumA: 419LumB: 140Basal: 130Her2: 46Normal: 34 |
Patients with complete data in all sets |
Reference Links
Lets take a look at the data from FireHose directly after download
[1]:
import pandas as pd
from pathlib import Path
root = Path("/home/vicente/Github/BioNeuralNet/TCGA_BRCA_DATA")
mirna_raw = pd.read_csv(root/"BRCA.miRseq_RPKM_log2.txt", sep="\t",index_col=0,low_memory=False)
rna_raw = pd.read_csv(root / "BRCA.uncv2.mRNAseq_RSEM_normalized_log2.txt", sep="\t",index_col=0,low_memory=False)
meth_raw = pd.read_csv(root/"BRCA.meth.by_mean.data.txt", sep='\t',index_col=0,low_memory=False)
clinical_raw = pd.read_csv(root / "BRCA.clin.merged.picked.txt",sep="\t", index_col=0, low_memory=False)
print(f"mirna shape: {mirna_raw.shape}, rna shape: {rna_raw.shape}, meth shape: {meth_raw.shape}, clinical shape: {clinical_raw.shape}")
display(mirna_raw.head())
display(rna_raw.head())
display(meth_raw.head())
display(clinical_raw.head())
mirna shape: (503, 1189), rna shape: (18321, 1212), meth shape: (20107, 885), clinical shape: (18, 1097)
| TCGA-3C-AAAU-01 | TCGA-3C-AALI-01 | TCGA-3C-AALJ-01 | TCGA-3C-AALK-01 | TCGA-4H-AAAK-01 | TCGA-5L-AAT0-01 | TCGA-5L-AAT1-01 | TCGA-5T-A9QA-01 | TCGA-A1-A0SB-01 | TCGA-A1-A0SD-01 | ... | TCGA-BH-A0WA-01 | TCGA-E2-A105-01 | TCGA-E2-A106-01 | TCGA-E2-A107-01 | TCGA-E2-A108-01 | TCGA-E2-A109-01 | TCGA-E2-A10B-01 | TCGA-E2-A10C-01 | TCGA-E2-A10E-01 | TCGA-E2-A10F-01 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gene | |||||||||||||||||||||
| hsa-let-7a-1 | 13.129765 | 12.918069 | 13.012033 | 13.144697 | 13.411684 | 13.316301 | 13.445230 | 13.727850 | 13.601504 | 13.598739 | ... | 12.225132 | 13.938134 | 13.609853 | 13.508290 | 13.406359 | 13.730647 | 13.198426 | 12.793350 | 14.060268 | 12.990403 |
| hsa-let-7a-2 | 14.117933 | 13.922300 | 14.010002 | 14.141721 | 14.413518 | 14.310917 | 14.448556 | 14.714551 | 14.608693 | 14.606942 | ... | 13.235065 | 14.930021 | 14.603389 | 14.525026 | 14.402735 | 14.719166 | 14.200523 | 13.796623 | 15.047592 | 14.006035 |
| hsa-let-7a-3 | 13.147714 | 12.913194 | 13.028483 | 13.151281 | 13.420481 | 13.327144 | 13.446806 | 13.736891 | 13.613105 | 13.606224 | ... | 12.261971 | 13.972011 | 13.643274 | 13.549981 | 13.438737 | 13.732070 | 13.212367 | 12.793350 | 14.074978 | 13.018659 |
| hsa-let-7b | 14.595135 | 14.512657 | 13.419612 | 14.667196 | 14.438548 | 14.576493 | 14.611137 | 15.098805 | 16.505758 | 15.638855 | ... | 14.684912 | 15.230457 | 15.357655 | 15.112011 | 15.040315 | 15.806771 | 15.645910 | 14.724106 | 16.370741 | 15.439239 |
| hsa-let-7c | 8.414890 | 9.646536 | 9.312455 | 11.511431 | 11.693927 | 11.138419 | 11.284446 | 9.197514 | 13.392164 | 11.419823 | ... | 10.565698 | 10.483745 | 11.159056 | 12.473340 | 12.405828 | 10.613712 | 11.395452 | 9.087202 | 10.885520 | 11.385638 |
5 rows × 1189 columns
| TCGA-3C-AAAU-01 | TCGA-3C-AALI-01 | TCGA-3C-AALJ-01 | TCGA-3C-AALK-01 | TCGA-4H-AAAK-01 | TCGA-5L-AAT0-01 | TCGA-5L-AAT1-01 | TCGA-5T-A9QA-01 | TCGA-A1-A0SB-01 | TCGA-A1-A0SD-01 | ... | TCGA-UL-AAZ6-01 | TCGA-UU-A93S-01 | TCGA-V7-A7HQ-01 | TCGA-W8-A86G-01 | TCGA-WT-AB41-01 | TCGA-WT-AB44-01 | TCGA-XX-A899-01 | TCGA-XX-A89A-01 | TCGA-Z7-A8R5-01 | TCGA-Z7-A8R6-01 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gene | |||||||||||||||||||||
| ?|100133144 | 4.032489 | 3.211931 | 3.538886 | 3.595671 | 2.775430 | 1.995991 | NaN | 0.550310 | 3.939189 | 3.250628 | ... | -1.324816 | 2.108558 | NaN | 2.475707 | NaN | NaN | 3.846574 | 4.480524 | 1.178747 | 2.783771 |
| ?|100134869 | 3.692829 | 4.119273 | 3.206237 | 3.469873 | 3.850979 | 3.766489 | 3.405298 | 3.169252 | 3.847346 | 3.501324 | ... | 3.845189 | 3.443978 | 1.622556 | 3.845099 | 2.657434 | 1.703987 | 4.422294 | 4.769476 | 2.866572 | 4.631075 |
| ?|10357 | 5.704604 | 6.124231 | 7.269570 | 7.168565 | 6.395968 | 6.836141 | 6.857961 | 6.749035 | 6.862786 | 5.913201 | ... | 7.083470 | 7.088829 | 4.906766 | 7.003547 | 5.744909 | 5.401368 | 7.106177 | 6.003213 | 6.410173 | 7.388457 |
| ?|10431 | 8.672694 | 9.139279 | 10.410275 | 9.757450 | 9.581922 | 9.657753 | 10.114256 | 10.472185 | 9.360367 | 9.933569 | ... | 10.616682 | 11.495054 | 10.749770 | 9.446410 | 10.282241 | 10.874534 | 9.350400 | 9.497295 | 10.155173 | 9.970921 |
| ?|155060 | 10.213110 | 9.011343 | 9.209506 | 9.110487 | 8.027083 | 8.110023 | 7.704865 | 6.254741 | 8.128052 | 6.387132 | ... | 8.052478 | 7.516236 | 9.280761 | 9.631306 | 8.137225 | 9.460539 | 8.738651 | 8.556414 | 7.977670 | 7.894918 |
5 rows × 1212 columns
| TCGA-3C-AAAU-01 | TCGA-3C-AALI-01 | TCGA-3C-AALJ-01 | TCGA-3C-AALK-01 | TCGA-4H-AAAK-01 | TCGA-5L-AAT0-01 | TCGA-5L-AAT1-01 | TCGA-5T-A9QA-01 | TCGA-A1-A0SB-01 | TCGA-A1-A0SE-01 | ... | TCGA-UL-AAZ6-01 | TCGA-UU-A93S-01 | TCGA-V7-A7HQ-01 | TCGA-W8-A86G-01 | TCGA-WT-AB41-01 | TCGA-WT-AB44-01 | TCGA-XX-A899-01 | TCGA-XX-A89A-01 | TCGA-Z7-A8R5-01 | TCGA-Z7-A8R6-01 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Hybridization REF | |||||||||||||||||||||
| Composite Element REF | Beta_Value | Beta_Value | Beta_Value | Beta_Value | Beta_Value | Beta_Value | Beta_Value | Beta_Value | Beta_Value | Beta_Value | ... | Beta_Value | Beta_Value | Beta_Value | Beta_Value | Beta_Value | Beta_Value | Beta_Value | Beta_Value | Beta_Value | Beta_Value |
| A1BG | 0.483716119676 | 0.637191226131 | 0.656092398242 | 0.615194471357 | 0.612080370511 | 0.469600740678 | 0.582188239422 | 0.66617073097 | 0.659965611959 | 0.641701155202 | ... | 0.631413241724 | 0.64952294395 | 0.596585169597 | 0.615558357651 | 0.580837880262 | 0.615814023324 | 0.589897794957 | 0.572606636128 | 0.617859586161 | 0.568150149265 |
| A1CF | 0.295827203492 | 0.458972998571 | 0.489725289638 | 0.625765223243 | 0.507736509665 | 0.514770866326 | 0.549850958729 | 0.381038654448 | 0.826312156393 | 0.606699429409 | ... | 0.383469192855 | 0.183354853938 | 0.403909161312 | 0.716980255014 | 0.613131295074 | 0.665043713213 | 0.705153725375 | 0.494848686021 | 0.691835387189 | 0.224696596211 |
| A2BP1 | 0.187699869591 | 0.240515847704 | 0.279087851226 | 0.488888510474 | 0.463845494635 | 0.504450855353 | 0.480885816745 | 0.622832399216 | 0.474678831563 | 0.339829506578 | ... | 0.130529915536 | 0.319855310743 | 0.335517456053 | 0.512185396638 | 0.563519806811 | 0.507364324635 | 0.520542747167 | 0.412562068574 | 0.522169978143 | 0.33955834608 |
| A2LD1 | 0.62958551322 | 0.666272288675 | 0.755630499986 | 0.74575121287 | 0.698515739124 | 0.706812706661 | 0.759017355996 | 0.694010939885 | 0.847837522256 | 0.786662091353 | ... | 0.587475995313 | 0.667969642321 | 0.689140211036 | 0.791381283524 | 0.680499323148 | 0.660476360054 | 0.745725420412 | 0.74390049875 | 0.791229999577 | 0.637764188841 |
5 rows × 885 columns
| tcga-5l-aat0 | tcga-5l-aat1 | tcga-a1-a0sp | tcga-a2-a04v | tcga-a2-a04y | tcga-a2-a0cq | tcga-a2-a1g4 | tcga-a2-a25a | tcga-a7-a0cd | tcga-a7-a13g | ... | tcga-s3-aa11 | tcga-s3-aa14 | tcga-s3-aa15 | tcga-ul-aaz6 | tcga-uu-a93s | tcga-v7-a7hq | tcga-wt-ab44 | tcga-xx-a899 | tcga-xx-a89a | tcga-z7-a8r6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Hybridization REF | |||||||||||||||||||||
| Composite Element REF | value | value | value | value | value | value | value | value | value | value | ... | value | value | value | value | value | value | value | value | value | value |
| years_to_birth | 42 | 63 | 40 | 39 | 53 | 62 | 71 | 44 | 66 | 79 | ... | 67 | 47 | 51 | 73 | 63 | 75 | NaN | 46 | 68 | 46 |
| vital_status | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| days_to_death | NaN | NaN | NaN | 1920 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 116 | NaN | NaN | NaN | NaN | NaN |
| days_to_last_followup | 1477 | 1471 | 584 | NaN | 1099 | 2695 | 595 | 3276 | 1165 | 718 | ... | 421 | 529 | 525 | 518 | NaN | 2033 | 883 | 467 | 488 | 3256 |
5 rows × 1097 columns
TCGAbiolinks
This section demonstrates how to use the TCGAbiolinks R package to access and download clinical and molecular subtype data. It begins by ensuring TCGAbiolinks is installed, then loads the package. It retrieves PAM50 molecular subtype labels using TCGAquery_subtype() and writes them to a CSV file. Additionally, it downloads clinical data using GDCquery_clinic() and formats it with GDCprepare_clinic(), saving the result as another CSV file.
# Install TCGAbiolinks
if (!requireNamespace("TCGAbiolinks", quietly = TRUE)) {
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")
}
# Load the library
library(TCGAbiolinks)
# Download PAM50 subtype labels
pam50_df <- TCGAquery_subtype(tumor = "BRCA")[ , c("patient", "BRCA_Subtype_PAM50")]
write.csv(pam50_df, file = "BRCA_PAM50_labels.csv", row.names = FALSE, quote = FALSE)
# Download clinical data
clin_raw <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
clin_df <- GDCprepare_clinic(clin_raw, clinical.info = "patient")
write.csv(clin_df, file = "BRCA_clinical_data.csv", row.names = FALSE, quote = FALSE)
[2]:
import pandas as pd
# from Firehose
mirna = pd.read_csv(root/"BRCA.miRseq_RPKM_log2.txt", sep="\t",index_col=0,low_memory=False)
meth = pd.read_csv(root/"BRCA.meth.by_mean.data.txt", sep='\t',index_col=0,low_memory=False)
rna = pd.read_csv(root / "BRCA.uncv2.mRNAseq_RSEM_normalized_log2.txt", sep="\t",index_col=0,low_memory=False)
clinical_firehose = pd.read_csv(root / "BRCA.clin.merged.picked.txt",sep="\t", index_col=0, low_memory=False).T
# from TCGABiolinks
pam50 = pd.read_csv(root /"BRCA_PAM50_labels.csv",index_col=0)
clinical_biolinks = pd.read_csv(root /"BRCA_clinical_data.csv",index_col=1)
print("Initial shapes")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
print(f"pam50: {pam50.shape}")
print(f"clinical TCGABioLinks: {clinical_biolinks.shape}")
print(f"clinical FireHose: {clinical_firehose.shape}")
meth = meth.T
rna = rna.T
mirna = mirna.T
print("\nAfter tranpose")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
def trim(idx):
return idx.to_series().str.extract(r'(^TCGA-\w\w-\w\w\w\w)')[0]
meth.index = trim(meth.index)
rna.index = trim(rna.index)
mirna.index = trim(mirna.index)
pam50.index = pam50.index.str.upper()
clinical_biolinks.index = clinical_biolinks.index.str.upper()
clinical_firehose.index = clinical_firehose.index.str.upper()
idx1 = clinical_biolinks.index
idx2 = clinical_firehose.index
# intersection and unique counts
common = idx1.intersection(idx2)
only_in_1 = idx1.difference(idx2)
only_in_2 = idx2.difference(idx1)
print(f"Patients in both clinical datasets: {len(common)}")
common = clinical_biolinks.index.intersection(clinical_firehose.index)
clinical_biolinks = clinical_biolinks.loc[common]
clinical_firehose = clinical_firehose.loc[common]
clinical = pd.concat([clinical_biolinks, clinical_firehose], axis=1)
print(f"Combined Clinical shape {clinical.shape}")
common = sorted(set(meth.index) & set(rna.index) & set(mirna.index) & set(pam50.index) & set(clinical.index))
print(f"Patients in every dataset: {len(common)}")
meth = meth.loc[common]
rna = rna.loc[common]
mirna = mirna.loc[common]
pam50 = pam50.loc[common]
clinical = clinical.loc[common]
print("\nFinal shapes:")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
print(f"pam50: {pam50.shape}")
print(f"clinical: {clinical.shape}\n")
Initial shapes
meth: (20107, 885)
rna: (18321, 1212)
mirna: (503, 1189)
pam50: (1087, 1)
clinical TCGABioLinks: (1098, 101)
clinical FireHose: (1097, 18)
After tranpose
meth: (885, 20107)
rna: (1212, 18321)
mirna: (1189, 503)
Patients in both clinical datasets: 1097
Combined Clinical shape (1097, 119)
Patients in every dataset: 769
Final shapes:
meth: (863, 20107)
rna: (865, 18321)
mirna: (855, 503)
pam50: (769, 1)
clinical: (769, 119)
Handling Multiple Aliquots per Sample
This section addresses cases where some patients have multiple aliquots per sample in the meth, rna, and mirna datasets. It first identifies and counts patients with duplicate entries. Then, it coerces all data to numeric types and aggregates the duplicates by computing the mean across aliquots for each patient, ensuring only one row per patient. After aggregation, the datasets are aligned by keeping only the patients that are common across all five datasets (meth, rna,
mirna, pam50, and clinical). The result is s set of matched samples ready for integrated analysis.
[3]:
for name, df in [("meth", meth), ("rna", rna), ("mirna", mirna)]:
counts = df.index.value_counts()
n_multiple = (counts > 1).sum()
total_duplicates = counts[counts > 1].sum() - n_multiple
print(f"{name}:")
print(f"patients with >1 aliquot: {n_multiple}")
print(f"total duplicate rows: {total_duplicates}\n")
meth = meth.apply(pd.to_numeric, errors="coerce")
rna = rna .apply(pd.to_numeric, errors="coerce")
mirna = mirna.apply(pd.to_numeric, errors="coerce")
meth = meth.groupby(level=0).mean()
rna = rna.groupby(level=0).mean()
mirna = mirna.groupby(level=0).mean()
# Now each has one row per patient
print("Post-aggregation shapes:")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
common = sorted( set(meth.index) & set(rna.index) & set(mirna.index)& set(pam50.index) & set(clinical.index) )
print(f"Patients in every dataset: {len(common)}")
meth = meth.loc[common]
rna = rna.loc[common]
mirna = mirna.loc[common]
pam50 = pam50.loc[common]
clinical = clinical.loc[common]
print("\nFinal shapes")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
print(f"pam50: {pam50.shape}")
print(f"clinical:{clinical.shape}")
meth:
patients with >1 aliquot: 91
total duplicate rows: 94
rna:
patients with >1 aliquot: 93
total duplicate rows: 96
mirna:
patients with >1 aliquot: 84
total duplicate rows: 86
Post-aggregation shapes:
meth: (769, 20107)
rna: (769, 18321)
mirna: (769, 503)
Patients in every dataset: 769
Final shapes
meth: (769, 20107)
rna: (769, 18321)
mirna: (769, 503)
pam50: (769, 1)
clinical:(769, 119)
Review the first few rows of each file
[4]:
display(meth.head())
display(rna.head())
display(mirna.head())
display(clinical.head())
display(pam50.value_counts())
| Hybridization REF | Composite Element REF | A1BG | A1CF | A2BP1 | A2LD1 | A2M | A2ML1 | A4GALT | A4GNT | AAA1 | ... | ZWILCH | ZWINT | ZXDC | ZYG11A | ZYG11B | ZYX | ZZEF1 | ZZZ3 | psiTPTE22 | tAKR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | |||||||||||||||||||||
| TCGA-3C-AAAU | NaN | 0.483716 | 0.295827 | 0.187700 | 0.629586 | 0.559654 | 0.835412 | 0.484800 | 0.690217 | 0.807805 | ... | 0.112978 | 0.053939 | 0.287665 | 0.328087 | 0.502935 | 0.220683 | 0.482044 | 0.107396 | 0.247304 | 0.506404 |
| TCGA-3C-AALI | NaN | 0.637191 | 0.458973 | 0.240516 | 0.666272 | 0.607505 | 0.842391 | 0.550047 | 0.749890 | 0.395290 | ... | 0.111834 | 0.046160 | 0.265322 | 0.405851 | 0.434024 | 0.236362 | 0.458847 | 0.119652 | 0.163022 | 0.623865 |
| TCGA-3C-AALJ | NaN | 0.656092 | 0.489725 | 0.279088 | 0.755630 | 0.662360 | 0.829020 | 0.476107 | 0.653756 | 0.795102 | ... | 0.113218 | 0.042657 | 0.272103 | 0.391326 | 0.449525 | 0.210976 | 0.482641 | 0.102385 | 0.252328 | 0.504451 |
| TCGA-3C-AALK | NaN | 0.615194 | 0.625765 | 0.488889 | 0.745751 | 0.727982 | 0.835365 | 0.556016 | 0.652005 | 0.816423 | ... | 0.145133 | 0.047022 | 0.301284 | 0.410348 | 0.446571 | 0.220185 | 0.485944 | 0.112941 | 0.471956 | 0.682468 |
| TCGA-4H-AAAK | NaN | 0.612080 | 0.507737 | 0.463845 | 0.698516 | 0.692364 | 0.802388 | 0.504870 | 0.531183 | 0.851114 | ... | 0.118928 | 0.045057 | 0.300647 | 0.379998 | 0.487929 | 0.233324 | 0.490736 | 0.115646 | 0.314877 | 0.744877 |
5 rows × 20107 columns
| gene | ?|100133144 | ?|100134869 | ?|10357 | ?|10431 | ?|155060 | ?|26823 | ?|340602 | ?|388795 | ?|390284 | ?|391343 | ... | ZWINT|11130 | ZXDA|7789 | ZXDB|158586 | ZXDC|79364 | ZYG11A|440590 | ZYG11B|79699 | ZYX|7791 | ZZEF1|23140 | ZZZ3|26009 | psiTPTE22|387590 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | |||||||||||||||||||||
| TCGA-3C-AAAU | 4.032489 | 3.692829 | 5.704604 | 8.672694 | 10.213110 | NaN | 0.785174 | -1.536587 | 2.048201 | NaN | ... | 9.864120 | 7.017830 | 9.976968 | 10.695662 | 8.013988 | 10.238851 | 11.776124 | 10.887932 | 10.205129 | 0.785174 |
| TCGA-3C-AALI | 3.211931 | 4.119273 | 6.124231 | 9.139279 | 9.011343 | 0.121015 | 7.170928 | 2.291014 | 0.706022 | 3.027968 | ... | 9.914682 | 5.902438 | 8.809329 | 10.391374 | 7.632831 | 9.237422 | 12.426428 | 10.364848 | 8.667973 | 9.855788 |
| TCGA-3C-AALJ | 3.538886 | 3.206237 | 7.269570 | 10.410275 | 9.209506 | NaN | NaN | 1.443554 | 1.443554 | NaN | ... | 11.305650 | 5.143969 | 9.060691 | 9.586488 | 8.374267 | 9.055784 | 12.414355 | 9.880935 | 8.992994 | 5.143969 |
| TCGA-3C-AALK | 3.595671 | 3.469873 | 7.168565 | 9.757450 | 9.110487 | -1.273343 | NaN | 1.048724 | 2.186215 | NaN | ... | 9.384994 | 5.782065 | 8.773906 | 9.754688 | 7.454703 | 9.246419 | 12.474556 | 9.609426 | 9.453001 | 6.057699 |
| TCGA-4H-AAAK | 2.775430 | 3.850979 | 6.395968 | 9.581922 | 8.027083 | -1.232769 | -1.232769 | 1.574683 | 1.574683 | NaN | ... | 9.397606 | 5.612830 | 8.728789 | 10.035881 | 3.811738 | 9.599438 | 11.980747 | 9.700292 | 9.784147 | 7.548699 |
5 rows × 18321 columns
| gene | hsa-let-7a-1 | hsa-let-7a-2 | hsa-let-7a-3 | hsa-let-7b | hsa-let-7c | hsa-let-7d | hsa-let-7e | hsa-let-7f-1 | hsa-let-7f-2 | hsa-let-7g | ... | hsa-mir-937 | hsa-mir-939 | hsa-mir-940 | hsa-mir-942 | hsa-mir-944 | hsa-mir-95 | hsa-mir-96 | hsa-mir-98 | hsa-mir-99a | hsa-mir-99b |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | |||||||||||||||||||||
| TCGA-3C-AAAU | 13.129765 | 14.117933 | 13.147714 | 14.595135 | 8.414890 | 8.665921 | 10.521777 | 3.879392 | 11.824817 | 8.597744 | ... | 0.906699 | -0.093302 | 2.672234 | 2.467414 | 1.044202 | 2.044202 | 6.906699 | 5.754696 | 7.024602 | 15.506461 |
| TCGA-3C-AALI | 12.918069 | 13.922300 | 12.913194 | 14.512657 | 9.646536 | 9.003653 | 9.131760 | 4.386952 | 12.678841 | 8.455144 | ... | 1.579597 | -0.083367 | 0.139024 | 3.032109 | -0.668331 | 0.331670 | 5.912870 | 6.427066 | 7.885299 | 13.626182 |
| TCGA-3C-AALJ | 13.012033 | 14.010002 | 13.028483 | 13.419612 | 9.312455 | 9.276943 | 11.395711 | 5.314692 | 13.530255 | 9.230563 | ... | 3.270298 | -2.189134 | 0.395828 | 1.855261 | -0.381778 | 0.717757 | 6.603657 | 6.878301 | 7.580704 | 15.013822 |
| TCGA-3C-AALK | 13.144697 | 14.141721 | 13.151281 | 14.667196 | 11.511431 | 8.384763 | 10.368981 | 4.159182 | 12.652559 | 8.471503 | ... | 0.923965 | -0.660997 | -0.076034 | 1.798435 | 1.798435 | 0.798435 | 6.181354 | 5.377922 | 10.031619 | 14.554783 |
| TCGA-4H-AAAK | 13.411684 | 14.413518 | 13.420481 | 14.438548 | 11.693927 | 8.453747 | 10.741371 | 4.494537 | 13.009499 | 8.381220 | ... | 0.182950 | -0.624403 | -1.624403 | 1.076036 | 0.182950 | -0.302475 | 4.318110 | 5.103516 | 10.078201 | 14.650338 |
5 rows × 503 columns
| project | synchronous_malignancy | ajcc_pathologic_stage | days_to_diagnosis | laterality | created_datetime | last_known_disease_status | tissue_or_organ_of_origin | days_to_last_follow_up | age_at_diagnosis | ... | pathology_N_stage | pathology_M_stage | gender | date_of_initial_pathologic_diagnosis | days_to_last_known_alive | radiation_therapy | histological_type | number_of_lymph_nodes | race | ethnicity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TCGA-3C-AAAU | TCGA-BRCA | No | Stage X | 0.0 | Left | NaN | NaN | Breast, NOS | NaN | 20211.0 | ... | nx | mx | female | 2004 | NaN | no | infiltrating lobular carcinoma | 4 | white | not hispanic or latino |
| TCGA-3C-AALI | TCGA-BRCA | No | Stage IIB | 0.0 | Right | NaN | NaN | Breast, NOS | NaN | 18538.0 | ... | n1a | m0 | female | 2003 | NaN | yes | infiltrating ductal carcinoma | 1 | black or african american | not hispanic or latino |
| TCGA-3C-AALJ | TCGA-BRCA | No | Stage IIB | 0.0 | Right | NaN | NaN | Breast, NOS | NaN | 22848.0 | ... | n1a | m0 | female | 2011 | NaN | no | infiltrating ductal carcinoma | 1 | black or african american | not hispanic or latino |
| TCGA-3C-AALK | TCGA-BRCA | No | Stage IA | 0.0 | Right | NaN | NaN | Breast, NOS | NaN | 19074.0 | ... | n0 (i+) | m0 | female | 2011 | NaN | no | infiltrating ductal carcinoma | 0 | black or african american | not hispanic or latino |
| TCGA-4H-AAAK | TCGA-BRCA | No | Stage IIIA | 0.0 | Left | NaN | NaN | Breast, NOS | NaN | 18371.0 | ... | n2a | m0 | female | 2013 | NaN | no | infiltrating lobular carcinoma | 4 | white | not hispanic or latino |
5 rows × 119 columns
BRCA_Subtype_PAM50
LumA 419
LumB 140
Basal 130
Her2 46
Normal 34
Name: count, dtype: int64
Preprocessing
After reviewing the data above, we applied the following steps to the data before further analysis.
Methylation (B -> M-value)
Clip B-values to [E, 1-E] and apply logit transform: M = log_2(B / (1-B)).
Drop the original
Composite Element REFcolumn.
mRNA & miRNA:
Already in log_2 scale (RSEM normalized and RPKM).
Quality Control:
Count samples with all-zero rows in each modality.
Compute NaN counts post-transformation, then replace all NaNs with 0.
Column Name Cleaning:
Replace all
-and|characters with_.Replace
?withunknown.
Label Encoding:
Map PAM50 subtypes to integers: Normal=0, Basal=1, Her2=2, LumA=3, LumB=4
Alignment & Aggregation:
Trim barcodes to patient level.
Aggregate duplicate aliquots by mean per patient.
Drop the
projectcolumn from clinical.Subset all tables to the common patient set (no missing or all-zero samples).
Final Output Shapes:
Methylation M-value: 769 × 20,107
mRNA (log_2): 769 × 20,531
miRNA (log_2): 769 × 503
PAM50 labels: 769 × 1
Clinical covariates: 769 × 101
[ ]:
import numpy as np
import pandas as pd
def beta_to_m(df, eps=1e-6):
B = np.clip(df.values, eps, 1.0 - eps)
M = np.log2(B / (1 - B))
return pd.DataFrame(M, index=df.index, columns=df.columns)
# find rows that are all 0s
zeros_meth = (meth == 0).all(axis=1).sum()
zeros_rna = (rna == 0).all(axis=1).sum()
zeros_mirna = (mirna == 0).all(axis=1).sum()
print(f"All zeros: meth: {zeros_meth}, rna: {zeros_rna}, mirna: {zeros_mirna}")
# find rows with all nans
nan_meth = meth.isna().all(axis=1).sum()
nan_rna = rna.isna().all(axis=1).sum()
nan_mirna = mirna.isna().all(axis=1).sum()
nan_clinical = clinical.isna().all(axis=1).sum()
nan_pam50 = pam50.isna().all(axis=1).sum()
print(f"nan_meth: {nan_meth}, nan_rna: {nan_rna}, nan_mirna: {nan_mirna}, nan_clinical: {nan_clinical}, nan_pam50: {nan_pam50}")
# map PAM50 subtypes to integers
mapping = {"Normal":0, "Basal":1, "Her2":2, "LumA":3, "LumB":4}
pam50 = pam50["BRCA_Subtype_PAM50"].map(mapping).to_frame(name="pam50")
# drop and transform methylation
meth_clean = meth.drop(columns=["Composite Element REF"], errors="ignore")
meth_m = beta_to_m(meth_clean)
clinical = clinical.drop(columns=["project"], errors="ignore")
# clean column names and fill nans
for df in [meth_m, rna, mirna]:
df.columns = df.columns.str.replace(r"\?\|", "unknown_", regex=True)
df.columns = df.columns.str.replace(r"[?|]", "unknown_", regex=True)
df.columns = df.columns.str.replace("-", "_", regex=False)
df.columns = df.columns.str.replace(r"_+", "_", regex=True)
df.fillna(0, inplace=True)
# check for nans after filling
print("NaN counts after filling:")
print(meth_m.isna().sum().sum(),rna.isna().sum().sum(),mirna.isna().sum().sum(),clinical.isna().sum().sum(),pam50.isna().sum().sum())
# align index to PAM50
X_meth = meth_m.loc[pam50.index]
X_rna = rna.loc[pam50.index]
X_mirna = mirna.loc[pam50.index]
clinical= clinical.loc[pam50.index]
print(f"new shapes: meth: {X_meth.shape}, rna: {X_rna.shape}, mirna: {X_mirna.shape}, pam50: {pam50.shape}, clinical: {clinical.shape}")
display(X_meth.head())
display(X_rna.head())
display(X_mirna.head())
display(clinical.head())
display(pam50.value_counts())
All zeros: meth: 0, rna: 0, mirna: 0
nan_meth: 0, nan_rna: 0, nan_mirna: 0, nan_clinical: 0, nan_pam50: 0
NaN counts after filling:
0 0 0 46476 0
new shapes: meth: (769, 20106), rna: (769, 18321), mirna: (769, 503), pam50: (769, 1), clinical: (769, 118)
| Hybridization REF | A1BG | A1CF | A2BP1 | A2LD1 | A2M | A2ML1 | A4GALT | A4GNT | AAA1 | AAAS | ... | ZWILCH | ZWINT | ZXDC | ZYG11A | ZYG11B | ZYX | ZZEF1 | ZZZ3 | psiTPTE22 | tAKR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| patient | |||||||||||||||||||||
| TCGA-3C-AAAU | -0.094004 | -1.251175 | -2.113585 | 0.765262 | 0.345896 | 2.343631 | -0.087741 | 1.155791 | 2.071436 | -2.650851 | ... | -2.972923 | -4.132523 | -1.308165 | -1.034199 | 0.016935 | -1.820233 | -0.103662 | -3.055084 | -1.605783 | 0.036955 |
| TCGA-3C-AALI | 0.812517 | -0.237291 | -1.658888 | 0.997440 | 0.630221 | 2.418135 | 0.289780 | 1.584114 | -0.613329 | -4.072465 | ... | -2.989465 | -4.369032 | -1.469365 | -0.549876 | -0.382967 | -1.691887 | -0.238022 | -2.879231 | -2.360128 | 0.729981 |
| TCGA-3C-AALJ | 0.931878 | -0.059301 | -1.369104 | 1.628617 | 0.972130 | 2.277584 | -0.137988 | 0.916964 | 1.956230 | -3.781647 | ... | -2.969472 | -4.488190 | -1.419578 | -0.637297 | -0.292273 | -1.902991 | -0.100215 | -3.132087 | -1.567104 | 0.025686 |
| TCGA-3C-AALK | 0.676913 | 0.741678 | -0.064133 | 1.552454 | 1.420200 | 2.343133 | 0.324621 | 0.905816 | 2.152928 | -3.894574 | ... | -2.558319 | -4.341028 | -1.213585 | -0.523013 | -0.309506 | -1.824419 | -0.081137 | -2.973455 | -0.162004 | 1.103860 |
| TCGA-4H-AAAK | 0.657963 | 0.044649 | -0.209004 | 1.212210 | 1.170304 | 2.021628 | 0.028103 | 0.180184 | 2.515149 | -3.885526 | ... | -2.889175 | -4.405580 | -1.217950 | -0.706284 | -0.069670 | -1.716283 | -0.053464 | -2.934908 | -1.121575 | 1.545812 |
5 rows × 20106 columns
| gene | unknown_100133144 | unknown_100134869 | unknown_10357 | unknown_10431 | unknown_155060 | unknown_26823 | unknown_340602 | unknown_388795 | unknown_390284 | unknown_391343 | ... | ZWINTunknown_11130 | ZXDAunknown_7789 | ZXDBunknown_158586 | ZXDCunknown_79364 | ZYG11Aunknown_440590 | ZYG11Bunknown_79699 | ZYXunknown_7791 | ZZEF1unknown_23140 | ZZZ3unknown_26009 | psiTPTE22unknown_387590 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| patient | |||||||||||||||||||||
| TCGA-3C-AAAU | 4.032489 | 3.692829 | 5.704604 | 8.672694 | 10.213110 | 0.000000 | 0.785174 | -1.536587 | 2.048201 | 0.000000 | ... | 9.864120 | 7.017830 | 9.976968 | 10.695662 | 8.013988 | 10.238851 | 11.776124 | 10.887932 | 10.205129 | 0.785174 |
| TCGA-3C-AALI | 3.211931 | 4.119273 | 6.124231 | 9.139279 | 9.011343 | 0.121015 | 7.170928 | 2.291014 | 0.706022 | 3.027968 | ... | 9.914682 | 5.902438 | 8.809329 | 10.391374 | 7.632831 | 9.237422 | 12.426428 | 10.364848 | 8.667973 | 9.855788 |
| TCGA-3C-AALJ | 3.538886 | 3.206237 | 7.269570 | 10.410275 | 9.209506 | 0.000000 | 0.000000 | 1.443554 | 1.443554 | 0.000000 | ... | 11.305650 | 5.143969 | 9.060691 | 9.586488 | 8.374267 | 9.055784 | 12.414355 | 9.880935 | 8.992994 | 5.143969 |
| TCGA-3C-AALK | 3.595671 | 3.469873 | 7.168565 | 9.757450 | 9.110487 | -1.273343 | 0.000000 | 1.048724 | 2.186215 | 0.000000 | ... | 9.384994 | 5.782065 | 8.773906 | 9.754688 | 7.454703 | 9.246419 | 12.474556 | 9.609426 | 9.453001 | 6.057699 |
| TCGA-4H-AAAK | 2.775430 | 3.850979 | 6.395968 | 9.581922 | 8.027083 | -1.232769 | -1.232769 | 1.574683 | 1.574683 | 0.000000 | ... | 9.397606 | 5.612830 | 8.728789 | 10.035881 | 3.811738 | 9.599438 | 11.980747 | 9.700292 | 9.784147 | 7.548699 |
5 rows × 18321 columns
| gene | hsa_let_7a_1 | hsa_let_7a_2 | hsa_let_7a_3 | hsa_let_7b | hsa_let_7c | hsa_let_7d | hsa_let_7e | hsa_let_7f_1 | hsa_let_7f_2 | hsa_let_7g | ... | hsa_mir_937 | hsa_mir_939 | hsa_mir_940 | hsa_mir_942 | hsa_mir_944 | hsa_mir_95 | hsa_mir_96 | hsa_mir_98 | hsa_mir_99a | hsa_mir_99b |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| patient | |||||||||||||||||||||
| TCGA-3C-AAAU | 13.129765 | 14.117933 | 13.147714 | 14.595135 | 8.414890 | 8.665921 | 10.521777 | 3.879392 | 11.824817 | 8.597744 | ... | 0.906699 | -0.093302 | 2.672234 | 2.467414 | 1.044202 | 2.044202 | 6.906699 | 5.754696 | 7.024602 | 15.506461 |
| TCGA-3C-AALI | 12.918069 | 13.922300 | 12.913194 | 14.512657 | 9.646536 | 9.003653 | 9.131760 | 4.386952 | 12.678841 | 8.455144 | ... | 1.579597 | -0.083367 | 0.139024 | 3.032109 | -0.668331 | 0.331670 | 5.912870 | 6.427066 | 7.885299 | 13.626182 |
| TCGA-3C-AALJ | 13.012033 | 14.010002 | 13.028483 | 13.419612 | 9.312455 | 9.276943 | 11.395711 | 5.314692 | 13.530255 | 9.230563 | ... | 3.270298 | -2.189134 | 0.395828 | 1.855261 | -0.381778 | 0.717757 | 6.603657 | 6.878301 | 7.580704 | 15.013822 |
| TCGA-3C-AALK | 13.144697 | 14.141721 | 13.151281 | 14.667196 | 11.511431 | 8.384763 | 10.368981 | 4.159182 | 12.652559 | 8.471503 | ... | 0.923965 | -0.660997 | -0.076034 | 1.798435 | 1.798435 | 0.798435 | 6.181354 | 5.377922 | 10.031619 | 14.554783 |
| TCGA-4H-AAAK | 13.411684 | 14.413518 | 13.420481 | 14.438548 | 11.693927 | 8.453747 | 10.741371 | 4.494537 | 13.009499 | 8.381220 | ... | 0.182950 | -0.624403 | -1.624403 | 1.076036 | 0.182950 | -0.302475 | 4.318110 | 5.103516 | 10.078201 | 14.650338 |
5 rows × 503 columns
| synchronous_malignancy | ajcc_pathologic_stage | days_to_diagnosis | laterality | created_datetime | last_known_disease_status | tissue_or_organ_of_origin | days_to_last_follow_up | age_at_diagnosis | primary_diagnosis | ... | pathology_N_stage | pathology_M_stage | gender | date_of_initial_pathologic_diagnosis | days_to_last_known_alive | radiation_therapy | histological_type | number_of_lymph_nodes | race | ethnicity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| patient | |||||||||||||||||||||
| TCGA-3C-AAAU | No | Stage X | 0.0 | Left | NaN | NaN | Breast, NOS | NaN | 20211.0 | Lobular carcinoma, NOS | ... | nx | mx | female | 2004 | NaN | no | infiltrating lobular carcinoma | 4 | white | not hispanic or latino |
| TCGA-3C-AALI | No | Stage IIB | 0.0 | Right | NaN | NaN | Breast, NOS | NaN | 18538.0 | Infiltrating duct carcinoma, NOS | ... | n1a | m0 | female | 2003 | NaN | yes | infiltrating ductal carcinoma | 1 | black or african american | not hispanic or latino |
| TCGA-3C-AALJ | No | Stage IIB | 0.0 | Right | NaN | NaN | Breast, NOS | NaN | 22848.0 | Infiltrating duct carcinoma, NOS | ... | n1a | m0 | female | 2011 | NaN | no | infiltrating ductal carcinoma | 1 | black or african american | not hispanic or latino |
| TCGA-3C-AALK | No | Stage IA | 0.0 | Right | NaN | NaN | Breast, NOS | NaN | 19074.0 | Infiltrating duct carcinoma, NOS | ... | n0 (i+) | m0 | female | 2011 | NaN | no | infiltrating ductal carcinoma | 0 | black or african american | not hispanic or latino |
| TCGA-4H-AAAK | No | Stage IIIA | 0.0 | Left | NaN | NaN | Breast, NOS | NaN | 18371.0 | Lobular carcinoma, NOS | ... | n2a | m0 | female | 2013 | NaN | no | infiltrating lobular carcinoma | 4 | white | not hispanic or latino |
5 rows × 118 columns
pam50
3 419
4 140
1 130
2 46
0 34
Name: count, dtype: int64
[6]:
# lets set up a commong index for all the files and then save them to csv
X_meth.index.name = "patient"
X_rna.index.name = "patient"
X_mirna.index.name = "patient"
pam50.index.name = "patient"
clinical.index.name = "patient"
X_meth.to_csv(root / "meth.csv", index=True)
X_rna.to_csv(root / "rna.csv", index=True)
X_mirna.to_csv(root / "mirna.csv", index=True)
pam50.to_csv(root / "pam50.csv", index=True)
clinical.to_csv(root / "clinical.csv", index=True)
Optional: Load the data we just saved to make sure it looks okay.
[ ]:
meth = pd.read_csv(root / "meth.csv", index_col=0)
rna = pd.read_csv(root / "rna.csv", index_col=0)
mirna = pd.read_csv(root / "mirna.csv", index_col=0)
pam50 = pd.read_csv(root / "pam50.csv", index_col=0)
clinical = pd.read_csv(root / "clinical.csv", index_col=0)
display(meth.head())
display(rna.head())
display(mirna.head())
display(clinical.head())
display(pam50.head())
Easy Access via DatasetLoader
To facilitate working with this data, we have made it available through our DatasetLoader component. If you have additional pre-processed or raw datasets you would like to include, feel free to reach out and we are happy to support adding new datasets to the platform.
[8]:
from bioneuralnet.datasets import DatasetLoader
tgca_brca = DatasetLoader("brca")
print(f"TGCA BRCA dataset shape: {tgca_brca.shape}")
brca_meth = tgca_brca.data["meth"]
brca_rna = tgca_brca.data["rna"]
brca_mirna = tgca_brca.data["mirna"]
brca_clinical = tgca_brca.data["clinical"]
brca_pam50 = tgca_brca.data["pam50"]
TGCA BRCA dataset shape: {'mirna': (769, 503), 'pam50': (769, 1), 'clinical': (769, 118), 'meth': (769, 20106), 'rna': (769, 18321)}
[ ]:
from bioneuralnet.utils.preprocess import preprocess_clinical
#shapes
print(f"RNA shape: {brca_rna.shape}")
print(f"METH shape: {brca_meth.shape}")
print(f"miRNA shape: {brca_mirna.shape}")
print(f"Clinical shape: {brca_clinical.shape}")
print(f"Phenotype shape: {brca_pam50.shape}")
print(f"Phenotype counts:\n{brca_pam50.value_counts()}")
# review min and max values from the datasets
for name, df in {"rna": brca_rna, "meth": brca_meth, "mirna": brca_mirna}.items():
min_val = df.min().min()
max_val = df.max().max()
print(f"\n{name.upper()}:")
print(f"Min: {min_val:.4f}")
print(f"Max: {max_val:.4f}")
#check nans in pam50
print(f"Nan values in pam50 {brca_pam50.isna().sum().sum()}")
brca_pam50 = brca_pam50.dropna()
X_rna = brca_rna.loc[brca_pam50.index]
X_meth = brca_meth.loc[brca_pam50.index]
X_mirna = brca_mirna.loc[brca_pam50.index]
clinical = brca_clinical.loc[brca_pam50.index]
# for more details on the preprocessing function, see bioneuralnet.utils.preprocess
clinical = preprocess_clinical(clinical, brca_pam50, top_k=15, scale=True, ignore_columns=["days_to_birth", "age_at_diagnosis", "days_to_last_followup", "age_at_index", "years_to_birth"])
display(clinical.head())
2025-05-16 10:31:09,364 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-16 10:31:09,365 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 31384 NaNs after median imputation
2025-05-16 10:31:09,365 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 39 columns dropped due to zero variance
RNA shape: (769, 18321)
METH shape: (769, 20106)
miRNA shape: (769, 503)
Clinical shape: (769, 118)
Phenotype shape: (769, 1)
Phenotype counts:
pam50
3 419
4 140
1 130
2 46
0 34
Name: count, dtype: int64
RNA:
Min: -8.5873
Max: 20.9784
METH:
Min: -7.1642
Max: 6.9710
MIRNA:
Min: -4.4631
Max: 19.3838
Nan values in pam50 0
2025-05-16 10:31:09,752 - bioneuralnet.utils.preprocess - INFO - Selected top 15 features by RandomForest importance
| days_to_birth | age_at_diagnosis | days_to_last_followup | age_at_index | years_to_birth | year_of_diagnosis | number_of_lymph_nodes | date_of_initial_pathologic_diagnosis | histological_type_infiltrating lobular carcinoma | primary_diagnosis_Lobular carcinoma, NOS | morphology_8520/3 | race.1_white | days_to_death.1 | laterality_Right | primary_diagnosis_Infiltrating duct carcinoma, NOS | country_of_residence_at_enrollment_United States | sites_of_involvement_Breast, NOS | days_to_death | race_white | ajcc_staging_system_edition_6th | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| patient | ||||||||||||||||||||
| TCGA-3C-AAAU | -20211.0 | 20211.0 | 4047.0 | 55.0 | 55.0 | -1.50 | 1.5 | -1.50 | True | True | True | True | 0.0 | False | False | True | False | 0.0 | True | True |
| TCGA-3C-AALI | -18538.0 | 18538.0 | 4005.0 | 50.0 | 50.0 | -1.75 | 0.0 | -1.75 | False | False | False | False | 0.0 | True | True | True | False | 0.0 | False | True |
| TCGA-3C-AALJ | -22848.0 | 22848.0 | 1474.0 | 62.0 | 62.0 | 0.25 | 0.0 | 0.25 | False | False | False | False | 0.0 | True | True | True | True | 0.0 | False | False |
| TCGA-3C-AALK | -19074.0 | 19074.0 | 1448.0 | 52.0 | 52.0 | 0.25 | -0.5 | 0.25 | False | False | False | False | 0.0 | True | True | True | True | 0.0 | False | False |
| TCGA-4H-AAAK | -18371.0 | 18371.0 | 348.0 | 50.0 | 50.0 | 0.75 | 1.5 | 0.75 | True | True | True | True | 0.0 | False | False | False | False | 0.0 | True | False |
Preparing Multi-Omics Data for downstream tasks
Check sample overlap.
Select top features.
Uses ANOVA F-test to select the most relevant features for classification from each omics dataset.
Combine datasets.
Selected features from RNA, methylation, and miRNA are combined into a single dataset.
Clean missing values.
Counts and removes any missing (nan) values from the combined dataset.
Build similarity graph.
Creates a k-nearest neighbors graph from the transposed feature matrix.
Note: For more details on preprocessing functions and graph generation algorithms, see the Utils documentation
[ ]:
from sklearn.metrics import accuracy_score, f1_score
from bioneuralnet.utils.preprocess import top_anova_f_features
from bioneuralnet.utils.graph import gen_similarity_graph
#count intersection of samples
print("Intersection of samples:")
print(f"RNA: {len(set(X_rna.index) & set(pam50.index))}")
print(f"METH: {len(set(X_meth.index) & set(pam50.index))}")
print(f"miRNA: {len(set(X_mirna.index) & set(pam50.index))}")
print(f"Clinical: {len(set(clinical.index) & set(pam50.index))}")
meth_sel = top_anova_f_features(X_meth, brca_pam50, max_features=1000, task="classification")
rna_sel = top_anova_f_features(X_rna, brca_pam50 ,max_features=1000, task="classification")
mirna_sel = top_anova_f_features(X_mirna, brca_pam50,max_features=503, task="classification")
X_train_full = pd.concat([meth_sel, rna_sel, mirna_sel], axis=1)
#count nan values
print(f"Nan values in X_train_full: {X_train_full.isna().sum().sum()}")
#drop nan values
X_train_full = X_train_full.dropna()
#check if there are any nan values
print(f"Nan value in X_train_full after dropping: {X_train_full.isna().sum().sum()}")
print(f"X_train_full shape: {X_train_full.shape}")
A_train = gen_similarity_graph(X_train_full.T, k=15)
print(f"Network shape: {A_train.shape}")
Intersection of samples:
RNA: 769
METH: 769
miRNA: 769
Clinical: 769
2025-05-16 10:31:12,677 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-16 10:31:12,678 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-05-16 10:31:12,678 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-05-16 10:31:12,835 - bioneuralnet.utils.preprocess - INFO - Selected 1000 features by ANOVA (task=classification), 17514 significant, 0 padded
2025-05-16 10:31:15,470 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-16 10:31:15,471 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-05-16 10:31:15,471 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-05-16 10:31:15,635 - bioneuralnet.utils.preprocess - INFO - Selected 1000 features by ANOVA (task=classification), 16864 significant, 0 padded
2025-05-16 10:31:15,714 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-16 10:31:15,715 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-05-16 10:31:15,715 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-05-16 10:31:15,718 - bioneuralnet.utils.preprocess - INFO - Selected 503 features by ANOVA (task=classification), 465 significant, 38 padded
Nan values in X_train_full: 0
Nan value in X_train_full after dropping: 0
X_train_full shape: (769, 2503)
Network shape: (2503, 2503)
[ ]:
from bioneuralnet.downstream_task import DPMON
save = Path("/home/vicente/Github/BioNeuralNet/TCGA_BRCA/results")
brca_pam50 = brca_pam50.rename(columns={"pam50": "phenotype"})
dpmon = DPMON(
adjacency_matrix=A_train,
omics_list=[meth_sel, rna_sel, mirna_sel],
phenotype_data=brca_pam50,
clinical_data=clinical,
repeat_num=3,
tune=True, gpu=True, cuda=0,
output_dir=Path(save/"results1"),
)
predictions_df, avg_accuracy = dpmon.run()
actual = predictions_df["Actual"]
pred = predictions_df["Predicted"]
dp_acc = (accuracy_score(actual, pred), 0)
dp_f1w = (f1_score(actual, pred, average='weighted'), 0)
dp_f1m = (f1_score(actual, pred, average='macro'), 0)
print(f"DPMON results:")
print(f"Accuracy: {dp_acc[0]}")
print(f"F1 weighted: {dp_f1w[0]}")
print(f"F1 macro: {dp_f1m[0]}")
2025-05-16 15:22:49,397 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/TCGA_BRCA/results/results1
2025-05-16 15:22:49,397 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
2025-05-16 15:22:49,397 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
2025-05-16 15:22:49,411 - bioneuralnet.downstream_task.dpmon - INFO - Running hyperparameter tuning for DPMON.
2025-05-16 15:22:49,412 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
2025-05-16 15:22:49,412 - bioneuralnet.downstream_task.dpmon - INFO - Slicing omics dataset based on network nodes.
2025-05-16 15:22:49,415 - bioneuralnet.downstream_task.dpmon - INFO - Building PyTorch Geometric Data object from adjacency matrix.
2025-05-16 15:22:49,487 - bioneuralnet.downstream_task.dpmon - INFO - Number of nodes in network: 2503
2025-05-16 15:22:49,487 - bioneuralnet.downstream_task.dpmon - INFO - Using clinical vars for node features: ['days_to_birth', 'age_at_diagnosis', 'days_to_last_followup', 'age_at_index', 'years_to_birth', 'year_of_diagnosis', 'number_of_lymph_nodes', 'date_of_initial_pathologic_diagnosis', 'histological_type_infiltrating lobular carcinoma', 'primary_diagnosis_Lobular carcinoma, NOS', 'morphology_8520/3', 'race.1_white', 'days_to_death.1', 'laterality_Right', 'primary_diagnosis_Infiltrating duct carcinoma, NOS', 'country_of_residence_at_enrollment_United States', 'sites_of_involvement_Breast, NOS', 'days_to_death', 'race_white', 'ajcc_staging_system_edition_6th']
2025-05-16 15:22:53,816 - bioneuralnet.downstream_task.dpmon - INFO - Starting hyperparameter tuning for dataset shape: (769, 2504)
2025-05-16 15:22:53,817 INFO tune.py:616 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949
2025-05-16 15:23:37,056 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/vicente/ray_results/tune_dp' in 0.0110s.
2025-05-16 15:23:37,074 - bioneuralnet.downstream_task.dpmon - INFO - Best trial config: {'gnn_layer_num': 2, 'gnn_hidden_dim': 128, 'lr': 0.005072810911305633, 'weight_decay': 0.00396813275339707, 'nn_hidden_dim1': 4, 'nn_hidden_dim2': 4, 'num_epochs': 512}
2025-05-16 15:23:37,075 - bioneuralnet.downstream_task.dpmon - INFO - Best trial final loss: 1.1080904006958008
2025-05-16 15:23:37,075 - bioneuralnet.downstream_task.dpmon - INFO - Best trial final accuracy: 0.8179453836150845
2025-05-16 15:23:37,077 - bioneuralnet.downstream_task.dpmon - INFO - gnn_layer_num gnn_hidden_dim lr weight_decay nn_hidden_dim1 \
0 2 128 0.005073 0.003968 4
nn_hidden_dim2 num_epochs
0 4 512
2025-05-16 15:23:37,079 - bioneuralnet.downstream_task.dpmon - INFO - Best tuned parameters: {'gnn_layer_num': 2, 'gnn_hidden_dim': 128, 'lr': 0.005072810911305633, 'weight_decay': 0.00396813275339707, 'nn_hidden_dim1': 4, 'nn_hidden_dim2': 4, 'num_epochs': 512}
2025-05-16 15:23:37,079 - bioneuralnet.downstream_task.dpmon - INFO - Best tuned parameters: {'gnn_layer_num': 2, 'gnn_hidden_dim': 128, 'lr': 0.005072810911305633, 'weight_decay': 0.00396813275339707, 'nn_hidden_dim1': 4, 'nn_hidden_dim2': 4, 'num_epochs': 512}
2025-05-16 15:23:37,079 - bioneuralnet.downstream_task.dpmon - INFO - Running standard training with tuned parameters.
2025-05-16 15:23:37,079 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
2025-05-16 15:23:37,080 - bioneuralnet.downstream_task.dpmon - INFO - Slicing omics dataset based on network nodes.
2025-05-16 15:23:37,083 - bioneuralnet.downstream_task.dpmon - INFO - Building PyTorch Geometric Data object from adjacency matrix.
2025-05-16 15:23:37,159 - bioneuralnet.downstream_task.dpmon - INFO - Number of nodes in network: 2503
2025-05-16 15:23:37,159 - bioneuralnet.downstream_task.dpmon - INFO - Using clinical vars for node features: ['days_to_birth', 'age_at_diagnosis', 'days_to_last_followup', 'age_at_index', 'years_to_birth', 'year_of_diagnosis', 'number_of_lymph_nodes', 'date_of_initial_pathologic_diagnosis', 'histological_type_infiltrating lobular carcinoma', 'primary_diagnosis_Lobular carcinoma, NOS', 'morphology_8520/3', 'race.1_white', 'days_to_death.1', 'laterality_Right', 'primary_diagnosis_Infiltrating duct carcinoma, NOS', 'country_of_residence_at_enrollment_United States', 'sites_of_involvement_Breast, NOS', 'days_to_death', 'race_white', 'ajcc_staging_system_edition_6th']
2025-05-16 15:23:41,530 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/3
2025-05-16 15:23:41,563 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [1/512], Loss: 1.6182
2025-05-16 15:23:41,650 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [10/512], Loss: 1.5531
2025-05-16 15:23:41,745 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [20/512], Loss: 1.5067
2025-05-16 15:23:41,840 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [30/512], Loss: 1.4441
2025-05-16 15:23:41,939 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [40/512], Loss: 1.3666
2025-05-16 15:23:42,033 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [50/512], Loss: 1.2763
2025-05-16 15:23:42,128 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [60/512], Loss: 1.2001
2025-05-16 15:23:42,224 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [70/512], Loss: 1.1394
2025-05-16 15:23:42,317 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [80/512], Loss: 1.0970
2025-05-16 15:23:42,412 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [90/512], Loss: 1.0614
2025-05-16 15:23:42,505 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [100/512], Loss: 1.0399
2025-05-16 15:23:42,599 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [110/512], Loss: 1.0265
2025-05-16 15:23:42,693 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [120/512], Loss: 1.1995
2025-05-16 15:23:42,787 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [130/512], Loss: 1.0792
2025-05-16 15:23:42,881 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [140/512], Loss: 1.0485
2025-05-16 15:23:42,976 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [150/512], Loss: 1.0294
2025-05-16 15:23:43,077 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [160/512], Loss: 1.0155
2025-05-16 15:23:43,171 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [170/512], Loss: 1.0073
2025-05-16 15:23:43,266 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [180/512], Loss: 1.0014
2025-05-16 15:23:43,361 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [190/512], Loss: 0.9978
2025-05-16 15:23:43,455 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [200/512], Loss: 0.9956
2025-05-16 15:23:43,549 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [210/512], Loss: 0.9942
2025-05-16 15:23:43,643 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [220/512], Loss: 0.9932
2025-05-16 15:23:43,737 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [230/512], Loss: 1.0023
2025-05-16 15:23:43,831 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [240/512], Loss: 0.9951
2025-05-16 15:23:43,925 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [250/512], Loss: 0.9916
2025-05-16 15:23:44,019 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [260/512], Loss: 0.9900
2025-05-16 15:23:44,113 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [270/512], Loss: 0.9888
2025-05-16 15:23:44,207 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [280/512], Loss: 0.9878
2025-05-16 15:23:44,302 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [290/512], Loss: 0.9872
2025-05-16 15:23:44,396 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [300/512], Loss: 1.0332
2025-05-16 15:23:44,490 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [310/512], Loss: 1.0558
2025-05-16 15:23:44,584 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [320/512], Loss: 1.0264
2025-05-16 15:23:44,678 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [330/512], Loss: 1.0068
2025-05-16 15:23:44,772 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [340/512], Loss: 0.9968
2025-05-16 15:23:44,866 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [350/512], Loss: 0.9917
2025-05-16 15:23:44,963 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [360/512], Loss: 0.9890
2025-05-16 15:23:45,059 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [370/512], Loss: 1.0214
2025-05-16 15:23:45,153 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [380/512], Loss: 1.0091
2025-05-16 15:23:45,247 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [390/512], Loss: 0.9997
2025-05-16 15:23:45,341 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [400/512], Loss: 0.9948
2025-05-16 15:23:45,436 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [410/512], Loss: 0.9907
2025-05-16 15:23:45,530 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [420/512], Loss: 0.9878
2025-05-16 15:23:45,624 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [430/512], Loss: 0.9851
2025-05-16 15:23:45,717 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [440/512], Loss: 0.9828
2025-05-16 15:23:45,815 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [450/512], Loss: 0.9802
2025-05-16 15:23:45,909 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [460/512], Loss: 0.9773
2025-05-16 15:23:46,003 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [470/512], Loss: 0.9736
2025-05-16 15:23:46,097 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [480/512], Loss: 0.9691
2025-05-16 15:23:46,192 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [490/512], Loss: 0.9645
2025-05-16 15:23:46,287 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [500/512], Loss: 0.9643
2025-05-16 15:23:46,384 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [510/512], Loss: 1.0325
2025-05-16 15:23:46,407 - bioneuralnet.downstream_task.dpmon - INFO - Training Accuracy: 0.6658
2025-05-16 15:23:46,409 - bioneuralnet.downstream_task.dpmon - INFO - Model saved to /home/vicente/Github/BioNeuralNet/TCGA_BRCA/results/results1/dpm_model_iter_1.pth
2025-05-16 15:23:46,413 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 2/3
2025-05-16 15:23:46,435 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [1/512], Loss: 1.6234
2025-05-16 15:23:46,531 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [10/512], Loss: 1.5598
2025-05-16 15:23:46,626 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [20/512], Loss: 1.5203
2025-05-16 15:23:46,720 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [30/512], Loss: 1.4727
2025-05-16 15:23:46,814 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [40/512], Loss: 1.4078
2025-05-16 15:23:46,908 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [50/512], Loss: 1.3256
2025-05-16 15:23:47,002 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [60/512], Loss: 1.2342
2025-05-16 15:23:47,096 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [70/512], Loss: 1.1537
2025-05-16 15:23:47,190 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [80/512], Loss: 1.1014
2025-05-16 15:23:47,284 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [90/512], Loss: 1.0631
2025-05-16 15:23:47,378 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [100/512], Loss: 1.0422
2025-05-16 15:23:47,472 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [110/512], Loss: 1.0257
2025-05-16 15:23:47,565 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [120/512], Loss: 1.0252
2025-05-16 15:23:47,659 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [130/512], Loss: 1.0136
2025-05-16 15:23:47,757 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [140/512], Loss: 1.0068
2025-05-16 15:23:47,852 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [150/512], Loss: 1.0016
2025-05-16 15:23:47,946 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [160/512], Loss: 0.9965
2025-05-16 15:23:48,040 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [170/512], Loss: 1.0520
2025-05-16 15:23:48,134 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [180/512], Loss: 1.0251
2025-05-16 15:23:48,229 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [190/512], Loss: 1.0158
2025-05-16 15:23:48,322 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [200/512], Loss: 1.0110
2025-05-16 15:23:48,416 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [210/512], Loss: 1.0074
2025-05-16 15:23:48,523 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [220/512], Loss: 1.0034
2025-05-16 15:23:48,618 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [230/512], Loss: 1.0010
2025-05-16 15:23:48,712 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [240/512], Loss: 0.9988
2025-05-16 15:23:48,806 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [250/512], Loss: 0.9974
2025-05-16 15:23:48,900 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [260/512], Loss: 0.9960
2025-05-16 15:23:48,994 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [270/512], Loss: 0.9946
2025-05-16 15:23:49,089 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [280/512], Loss: 0.9932
2025-05-16 15:23:49,183 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [290/512], Loss: 0.9924
2025-05-16 15:23:49,277 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [300/512], Loss: 0.9905
2025-05-16 15:23:49,373 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [310/512], Loss: 0.9880
2025-05-16 15:23:49,467 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [320/512], Loss: 0.9850
2025-05-16 15:23:49,561 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [330/512], Loss: 0.9789
2025-05-16 15:23:49,656 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [340/512], Loss: 0.9995
2025-05-16 15:23:49,750 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [350/512], Loss: 0.9807
2025-05-16 15:23:49,844 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [360/512], Loss: 0.9594
2025-05-16 15:23:49,938 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [370/512], Loss: 0.9547
2025-05-16 15:23:50,031 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [380/512], Loss: 0.9526
2025-05-16 15:23:50,126 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [390/512], Loss: 0.9514
2025-05-16 15:23:50,220 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [400/512], Loss: 0.9506
2025-05-16 15:23:50,314 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [410/512], Loss: 0.9505
2025-05-16 15:23:50,417 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [420/512], Loss: 0.9510
2025-05-16 15:23:50,511 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [430/512], Loss: 0.9494
2025-05-16 15:23:50,605 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [440/512], Loss: 0.9485
2025-05-16 15:23:50,699 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [450/512], Loss: 0.9477
2025-05-16 15:23:50,793 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [460/512], Loss: 0.9473
2025-05-16 15:23:50,887 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [470/512], Loss: 0.9470
2025-05-16 15:23:50,981 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [480/512], Loss: 0.9455
2025-05-16 15:23:51,074 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [490/512], Loss: 0.9464
2025-05-16 15:23:51,168 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [500/512], Loss: 0.9468
2025-05-16 15:23:51,262 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [510/512], Loss: 0.9451
2025-05-16 15:23:51,285 - bioneuralnet.downstream_task.dpmon - INFO - Training Accuracy: 0.9545
2025-05-16 15:23:51,287 - bioneuralnet.downstream_task.dpmon - INFO - Model saved to /home/vicente/Github/BioNeuralNet/TCGA_BRCA/results/results1/dpm_model_iter_2.pth
2025-05-16 15:23:51,291 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 3/3
2025-05-16 15:23:51,309 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [1/512], Loss: 1.6141
2025-05-16 15:23:51,394 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [10/512], Loss: 1.5460
2025-05-16 15:23:51,488 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [20/512], Loss: 1.5051
2025-05-16 15:23:51,582 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [30/512], Loss: 1.4469
2025-05-16 15:23:51,676 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [40/512], Loss: 1.3636
2025-05-16 15:23:51,771 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [50/512], Loss: 1.2702
2025-05-16 15:23:51,865 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [60/512], Loss: 1.1908
2025-05-16 15:23:51,963 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [70/512], Loss: 1.1343
2025-05-16 15:23:52,057 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [80/512], Loss: 1.1024
2025-05-16 15:23:52,152 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [90/512], Loss: 1.0812
2025-05-16 15:23:52,247 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [100/512], Loss: 1.1784
2025-05-16 15:23:52,343 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [110/512], Loss: 1.1234
2025-05-16 15:23:52,437 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [120/512], Loss: 1.1016
2025-05-16 15:23:52,531 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [130/512], Loss: 1.0903
2025-05-16 15:23:52,625 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [140/512], Loss: 1.0835
2025-05-16 15:23:52,724 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [150/512], Loss: 1.0627
2025-05-16 15:23:52,818 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [160/512], Loss: 1.0492
2025-05-16 15:23:52,911 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [170/512], Loss: 1.0344
2025-05-16 15:23:53,005 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [180/512], Loss: 1.0256
2025-05-16 15:23:53,099 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [190/512], Loss: 1.0206
2025-05-16 15:23:53,193 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [200/512], Loss: 1.0164
2025-05-16 15:23:53,287 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [210/512], Loss: 1.0128
2025-05-16 15:23:53,383 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [220/512], Loss: 1.0165
2025-05-16 15:23:53,480 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [230/512], Loss: 1.0111
2025-05-16 15:23:53,574 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [240/512], Loss: 1.0075
2025-05-16 15:23:53,670 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [250/512], Loss: 1.0027
2025-05-16 15:23:53,764 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [260/512], Loss: 0.9996
2025-05-16 15:23:53,868 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [270/512], Loss: 0.9979
2025-05-16 15:23:53,962 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [280/512], Loss: 1.0081
2025-05-16 15:23:54,056 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [290/512], Loss: 1.0096
2025-05-16 15:23:54,150 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [300/512], Loss: 1.0006
2025-05-16 15:23:54,243 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [310/512], Loss: 0.9946
2025-05-16 15:23:54,337 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [320/512], Loss: 0.9909
2025-05-16 15:23:54,431 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [330/512], Loss: 0.9882
2025-05-16 15:23:54,525 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [340/512], Loss: 0.9877
2025-05-16 15:23:54,619 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [350/512], Loss: 0.9849
2025-05-16 15:23:54,712 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [360/512], Loss: 0.9833
2025-05-16 15:23:54,806 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [370/512], Loss: 0.9818
2025-05-16 15:23:54,900 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [380/512], Loss: 0.9807
2025-05-16 15:23:54,993 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [390/512], Loss: 1.0491
2025-05-16 15:23:55,087 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [400/512], Loss: 0.9906
2025-05-16 15:23:55,181 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [410/512], Loss: 0.9831
2025-05-16 15:23:55,275 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [420/512], Loss: 0.9797
2025-05-16 15:23:55,370 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [430/512], Loss: 0.9777
2025-05-16 15:23:55,464 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [440/512], Loss: 0.9759
2025-05-16 15:23:55,558 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [450/512], Loss: 0.9749
2025-05-16 15:23:55,652 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [460/512], Loss: 0.9742
2025-05-16 15:23:55,747 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [470/512], Loss: 0.9737
2025-05-16 15:23:55,851 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [480/512], Loss: 0.9738
2025-05-16 15:23:55,944 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [490/512], Loss: 0.9735
2025-05-16 15:23:56,038 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [500/512], Loss: 0.9727
2025-05-16 15:23:56,132 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [510/512], Loss: 0.9720
2025-05-16 15:23:56,155 - bioneuralnet.downstream_task.dpmon - INFO - Training Accuracy: 0.9558
2025-05-16 15:23:56,157 - bioneuralnet.downstream_task.dpmon - INFO - Model saved to /home/vicente/Github/BioNeuralNet/TCGA_BRCA/results/results1/dpm_model_iter_3.pth
2025-05-16 15:23:56,161 - bioneuralnet.downstream_task.dpmon - INFO - Best Accuracy: 0.9558
2025-05-16 15:23:56,162 - bioneuralnet.downstream_task.dpmon - INFO - Average Accuracy: 0.8587
2025-05-16 15:23:56,162 - bioneuralnet.downstream_task.dpmon - INFO - Standard Deviation of Accuracy: 0.1670
DPMON results:
Accuracy: 0.9557867360208062
F1 weighted: 0.9360974742812752
F1 macro: 0.7772360237077294