{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d182cc95",
   "metadata": {},
   "source": [
    "# TCGA-BRCA Demo\n",
    "\n",
    "## Dataset Source\n",
    "\n",
    "- **Omics Data**: [FireHose BRCA](http://firebrowse.org/?cohort=BRCA)\n",
    "- **Clinical and PAM50 Data**: [TCGAbiolinks](http://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html)\n",
    "\n",
    "## Dataset Overview\n",
    "\n",
    "**Original Data**:\n",
    "\n",
    "- **Methylation**: 20,107 × 885\n",
    "- **mRNA**: 18,321 × 1,212\n",
    "- **miRNA**: 503 × 1,189\n",
    "- **PAM50**: 1,087 × 1\n",
    "- **Clinical**: 1,098 × 101\n",
    "\n",
    "- **Note: Omics matrices are features × samples; clinical matrices are samples × fields.**\n",
    "\n",
    "### PAM50 Subtype Counts (Original)\n",
    "\n",
    "- **LumA**: 419\n",
    "- **LumB**: 140\n",
    "- **Basal**: 130\n",
    "- **Her2**: 46\n",
    "- **Normal**: 34\n",
    "\n",
    "## Patients in Every Dataset\n",
    "\n",
    "- Total patients present in methylation, mRNA, miRNA, PAM50, and clinical: **769**\n",
    "\n",
    "## Final Shapes (Per-Patient)\n",
    "\n",
    "After aggregating multiple aliquots by mean, all modalities align on 769 patients:\n",
    "\n",
    "- **Methylation**: 769 × 20,107\n",
    "- **mRNA**: 769 × 20,531\n",
    "- **miRNA**: 769 × 503\n",
    "- **PAM50**: 769 × 1\n",
    "- **Clinical**: 769 × 119\n",
    "\n",
    "## Data Summary Table\n",
    "\n",
    "| Stage                          | Clinical    | Methylation  | miRNA       | mRNA           | PAM50 (Subtype Counts)                                         | Notes                                   |\n",
    "| ------------------------------ | ----------- | ------------ | ----------- | -------------- | -------------------------------------------------------------- | --------------------------------------- |\n",
    "| **Original Raw Data**          | 1,098 × 101 | 20,107 × 885 | 503 × 1,189 | 18,321 × 1,212 | LumA: 509<br>LumB: 209<br>Basal: 192<br>Her2: 82<br>Normal: 40 | Raw FireHose & TCGAbiolinks files       |\n",
    "| **Patient-Level Intersection** | 769 × 101   | 769 × 20,107 | 769 × 1,046 | 769 × 20,531   | LumA: 419<br>LumB: 140<br>Basal: 130<br>Her2: 46<br>Normal: 34 | Patients with complete data in all sets |\n",
    "\n",
    "## Reference Links\n",
    "\n",
    "- [FireHose BRCA](http://firebrowse.org/?cohort=BRCA)\n",
    "- [TCGAbiolinks](http://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html)\n",
    "- [Direct Download BRCA](http://firebrowse.org/?cohort=BRCA&download_dialog=true)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9698b74",
   "metadata": {},
   "source": [
    "### Lets take a look at the data from FireHose directly after download"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9c0bda23",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mirna shape: (503, 1189), rna shape: (18321, 1212), meth shape: (20107, 885), clinical shape: (18, 1097)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TCGA-3C-AAAU-01</th>\n",
       "      <th>TCGA-3C-AALI-01</th>\n",
       "      <th>TCGA-3C-AALJ-01</th>\n",
       "      <th>TCGA-3C-AALK-01</th>\n",
       "      <th>TCGA-4H-AAAK-01</th>\n",
       "      <th>TCGA-5L-AAT0-01</th>\n",
       "      <th>TCGA-5L-AAT1-01</th>\n",
       "      <th>TCGA-5T-A9QA-01</th>\n",
       "      <th>TCGA-A1-A0SB-01</th>\n",
       "      <th>TCGA-A1-A0SD-01</th>\n",
       "      <th>...</th>\n",
       "      <th>TCGA-BH-A0WA-01</th>\n",
       "      <th>TCGA-E2-A105-01</th>\n",
       "      <th>TCGA-E2-A106-01</th>\n",
       "      <th>TCGA-E2-A107-01</th>\n",
       "      <th>TCGA-E2-A108-01</th>\n",
       "      <th>TCGA-E2-A109-01</th>\n",
       "      <th>TCGA-E2-A10B-01</th>\n",
       "      <th>TCGA-E2-A10C-01</th>\n",
       "      <th>TCGA-E2-A10E-01</th>\n",
       "      <th>TCGA-E2-A10F-01</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gene</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>hsa-let-7a-1</th>\n",
       "      <td>13.129765</td>\n",
       "      <td>12.918069</td>\n",
       "      <td>13.012033</td>\n",
       "      <td>13.144697</td>\n",
       "      <td>13.411684</td>\n",
       "      <td>13.316301</td>\n",
       "      <td>13.445230</td>\n",
       "      <td>13.727850</td>\n",
       "      <td>13.601504</td>\n",
       "      <td>13.598739</td>\n",
       "      <td>...</td>\n",
       "      <td>12.225132</td>\n",
       "      <td>13.938134</td>\n",
       "      <td>13.609853</td>\n",
       "      <td>13.508290</td>\n",
       "      <td>13.406359</td>\n",
       "      <td>13.730647</td>\n",
       "      <td>13.198426</td>\n",
       "      <td>12.793350</td>\n",
       "      <td>14.060268</td>\n",
       "      <td>12.990403</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hsa-let-7a-2</th>\n",
       "      <td>14.117933</td>\n",
       "      <td>13.922300</td>\n",
       "      <td>14.010002</td>\n",
       "      <td>14.141721</td>\n",
       "      <td>14.413518</td>\n",
       "      <td>14.310917</td>\n",
       "      <td>14.448556</td>\n",
       "      <td>14.714551</td>\n",
       "      <td>14.608693</td>\n",
       "      <td>14.606942</td>\n",
       "      <td>...</td>\n",
       "      <td>13.235065</td>\n",
       "      <td>14.930021</td>\n",
       "      <td>14.603389</td>\n",
       "      <td>14.525026</td>\n",
       "      <td>14.402735</td>\n",
       "      <td>14.719166</td>\n",
       "      <td>14.200523</td>\n",
       "      <td>13.796623</td>\n",
       "      <td>15.047592</td>\n",
       "      <td>14.006035</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hsa-let-7a-3</th>\n",
       "      <td>13.147714</td>\n",
       "      <td>12.913194</td>\n",
       "      <td>13.028483</td>\n",
       "      <td>13.151281</td>\n",
       "      <td>13.420481</td>\n",
       "      <td>13.327144</td>\n",
       "      <td>13.446806</td>\n",
       "      <td>13.736891</td>\n",
       "      <td>13.613105</td>\n",
       "      <td>13.606224</td>\n",
       "      <td>...</td>\n",
       "      <td>12.261971</td>\n",
       "      <td>13.972011</td>\n",
       "      <td>13.643274</td>\n",
       "      <td>13.549981</td>\n",
       "      <td>13.438737</td>\n",
       "      <td>13.732070</td>\n",
       "      <td>13.212367</td>\n",
       "      <td>12.793350</td>\n",
       "      <td>14.074978</td>\n",
       "      <td>13.018659</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hsa-let-7b</th>\n",
       "      <td>14.595135</td>\n",
       "      <td>14.512657</td>\n",
       "      <td>13.419612</td>\n",
       "      <td>14.667196</td>\n",
       "      <td>14.438548</td>\n",
       "      <td>14.576493</td>\n",
       "      <td>14.611137</td>\n",
       "      <td>15.098805</td>\n",
       "      <td>16.505758</td>\n",
       "      <td>15.638855</td>\n",
       "      <td>...</td>\n",
       "      <td>14.684912</td>\n",
       "      <td>15.230457</td>\n",
       "      <td>15.357655</td>\n",
       "      <td>15.112011</td>\n",
       "      <td>15.040315</td>\n",
       "      <td>15.806771</td>\n",
       "      <td>15.645910</td>\n",
       "      <td>14.724106</td>\n",
       "      <td>16.370741</td>\n",
       "      <td>15.439239</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hsa-let-7c</th>\n",
       "      <td>8.414890</td>\n",
       "      <td>9.646536</td>\n",
       "      <td>9.312455</td>\n",
       "      <td>11.511431</td>\n",
       "      <td>11.693927</td>\n",
       "      <td>11.138419</td>\n",
       "      <td>11.284446</td>\n",
       "      <td>9.197514</td>\n",
       "      <td>13.392164</td>\n",
       "      <td>11.419823</td>\n",
       "      <td>...</td>\n",
       "      <td>10.565698</td>\n",
       "      <td>10.483745</td>\n",
       "      <td>11.159056</td>\n",
       "      <td>12.473340</td>\n",
       "      <td>12.405828</td>\n",
       "      <td>10.613712</td>\n",
       "      <td>11.395452</td>\n",
       "      <td>9.087202</td>\n",
       "      <td>10.885520</td>\n",
       "      <td>11.385638</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 1189 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "              TCGA-3C-AAAU-01  TCGA-3C-AALI-01  TCGA-3C-AALJ-01  \\\n",
       "gene                                                              \n",
       "hsa-let-7a-1        13.129765        12.918069        13.012033   \n",
       "hsa-let-7a-2        14.117933        13.922300        14.010002   \n",
       "hsa-let-7a-3        13.147714        12.913194        13.028483   \n",
       "hsa-let-7b          14.595135        14.512657        13.419612   \n",
       "hsa-let-7c           8.414890         9.646536         9.312455   \n",
       "\n",
       "              TCGA-3C-AALK-01  TCGA-4H-AAAK-01  TCGA-5L-AAT0-01  \\\n",
       "gene                                                              \n",
       "hsa-let-7a-1        13.144697        13.411684        13.316301   \n",
       "hsa-let-7a-2        14.141721        14.413518        14.310917   \n",
       "hsa-let-7a-3        13.151281        13.420481        13.327144   \n",
       "hsa-let-7b          14.667196        14.438548        14.576493   \n",
       "hsa-let-7c          11.511431        11.693927        11.138419   \n",
       "\n",
       "              TCGA-5L-AAT1-01  TCGA-5T-A9QA-01  TCGA-A1-A0SB-01  \\\n",
       "gene                                                              \n",
       "hsa-let-7a-1        13.445230        13.727850        13.601504   \n",
       "hsa-let-7a-2        14.448556        14.714551        14.608693   \n",
       "hsa-let-7a-3        13.446806        13.736891        13.613105   \n",
       "hsa-let-7b          14.611137        15.098805        16.505758   \n",
       "hsa-let-7c          11.284446         9.197514        13.392164   \n",
       "\n",
       "              TCGA-A1-A0SD-01  ...  TCGA-BH-A0WA-01  TCGA-E2-A105-01  \\\n",
       "gene                           ...                                     \n",
       "hsa-let-7a-1        13.598739  ...        12.225132        13.938134   \n",
       "hsa-let-7a-2        14.606942  ...        13.235065        14.930021   \n",
       "hsa-let-7a-3        13.606224  ...        12.261971        13.972011   \n",
       "hsa-let-7b          15.638855  ...        14.684912        15.230457   \n",
       "hsa-let-7c          11.419823  ...        10.565698        10.483745   \n",
       "\n",
       "              TCGA-E2-A106-01  TCGA-E2-A107-01  TCGA-E2-A108-01  \\\n",
       "gene                                                              \n",
       "hsa-let-7a-1        13.609853        13.508290        13.406359   \n",
       "hsa-let-7a-2        14.603389        14.525026        14.402735   \n",
       "hsa-let-7a-3        13.643274        13.549981        13.438737   \n",
       "hsa-let-7b          15.357655        15.112011        15.040315   \n",
       "hsa-let-7c          11.159056        12.473340        12.405828   \n",
       "\n",
       "              TCGA-E2-A109-01  TCGA-E2-A10B-01  TCGA-E2-A10C-01  \\\n",
       "gene                                                              \n",
       "hsa-let-7a-1        13.730647        13.198426        12.793350   \n",
       "hsa-let-7a-2        14.719166        14.200523        13.796623   \n",
       "hsa-let-7a-3        13.732070        13.212367        12.793350   \n",
       "hsa-let-7b          15.806771        15.645910        14.724106   \n",
       "hsa-let-7c          10.613712        11.395452         9.087202   \n",
       "\n",
       "              TCGA-E2-A10E-01  TCGA-E2-A10F-01  \n",
       "gene                                            \n",
       "hsa-let-7a-1        14.060268        12.990403  \n",
       "hsa-let-7a-2        15.047592        14.006035  \n",
       "hsa-let-7a-3        14.074978        13.018659  \n",
       "hsa-let-7b          16.370741        15.439239  \n",
       "hsa-let-7c          10.885520        11.385638  \n",
       "\n",
       "[5 rows x 1189 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TCGA-3C-AAAU-01</th>\n",
       "      <th>TCGA-3C-AALI-01</th>\n",
       "      <th>TCGA-3C-AALJ-01</th>\n",
       "      <th>TCGA-3C-AALK-01</th>\n",
       "      <th>TCGA-4H-AAAK-01</th>\n",
       "      <th>TCGA-5L-AAT0-01</th>\n",
       "      <th>TCGA-5L-AAT1-01</th>\n",
       "      <th>TCGA-5T-A9QA-01</th>\n",
       "      <th>TCGA-A1-A0SB-01</th>\n",
       "      <th>TCGA-A1-A0SD-01</th>\n",
       "      <th>...</th>\n",
       "      <th>TCGA-UL-AAZ6-01</th>\n",
       "      <th>TCGA-UU-A93S-01</th>\n",
       "      <th>TCGA-V7-A7HQ-01</th>\n",
       "      <th>TCGA-W8-A86G-01</th>\n",
       "      <th>TCGA-WT-AB41-01</th>\n",
       "      <th>TCGA-WT-AB44-01</th>\n",
       "      <th>TCGA-XX-A899-01</th>\n",
       "      <th>TCGA-XX-A89A-01</th>\n",
       "      <th>TCGA-Z7-A8R5-01</th>\n",
       "      <th>TCGA-Z7-A8R6-01</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gene</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>?|100133144</th>\n",
       "      <td>4.032489</td>\n",
       "      <td>3.211931</td>\n",
       "      <td>3.538886</td>\n",
       "      <td>3.595671</td>\n",
       "      <td>2.775430</td>\n",
       "      <td>1.995991</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.550310</td>\n",
       "      <td>3.939189</td>\n",
       "      <td>3.250628</td>\n",
       "      <td>...</td>\n",
       "      <td>-1.324816</td>\n",
       "      <td>2.108558</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.475707</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.846574</td>\n",
       "      <td>4.480524</td>\n",
       "      <td>1.178747</td>\n",
       "      <td>2.783771</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>?|100134869</th>\n",
       "      <td>3.692829</td>\n",
       "      <td>4.119273</td>\n",
       "      <td>3.206237</td>\n",
       "      <td>3.469873</td>\n",
       "      <td>3.850979</td>\n",
       "      <td>3.766489</td>\n",
       "      <td>3.405298</td>\n",
       "      <td>3.169252</td>\n",
       "      <td>3.847346</td>\n",
       "      <td>3.501324</td>\n",
       "      <td>...</td>\n",
       "      <td>3.845189</td>\n",
       "      <td>3.443978</td>\n",
       "      <td>1.622556</td>\n",
       "      <td>3.845099</td>\n",
       "      <td>2.657434</td>\n",
       "      <td>1.703987</td>\n",
       "      <td>4.422294</td>\n",
       "      <td>4.769476</td>\n",
       "      <td>2.866572</td>\n",
       "      <td>4.631075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>?|10357</th>\n",
       "      <td>5.704604</td>\n",
       "      <td>6.124231</td>\n",
       "      <td>7.269570</td>\n",
       "      <td>7.168565</td>\n",
       "      <td>6.395968</td>\n",
       "      <td>6.836141</td>\n",
       "      <td>6.857961</td>\n",
       "      <td>6.749035</td>\n",
       "      <td>6.862786</td>\n",
       "      <td>5.913201</td>\n",
       "      <td>...</td>\n",
       "      <td>7.083470</td>\n",
       "      <td>7.088829</td>\n",
       "      <td>4.906766</td>\n",
       "      <td>7.003547</td>\n",
       "      <td>5.744909</td>\n",
       "      <td>5.401368</td>\n",
       "      <td>7.106177</td>\n",
       "      <td>6.003213</td>\n",
       "      <td>6.410173</td>\n",
       "      <td>7.388457</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>?|10431</th>\n",
       "      <td>8.672694</td>\n",
       "      <td>9.139279</td>\n",
       "      <td>10.410275</td>\n",
       "      <td>9.757450</td>\n",
       "      <td>9.581922</td>\n",
       "      <td>9.657753</td>\n",
       "      <td>10.114256</td>\n",
       "      <td>10.472185</td>\n",
       "      <td>9.360367</td>\n",
       "      <td>9.933569</td>\n",
       "      <td>...</td>\n",
       "      <td>10.616682</td>\n",
       "      <td>11.495054</td>\n",
       "      <td>10.749770</td>\n",
       "      <td>9.446410</td>\n",
       "      <td>10.282241</td>\n",
       "      <td>10.874534</td>\n",
       "      <td>9.350400</td>\n",
       "      <td>9.497295</td>\n",
       "      <td>10.155173</td>\n",
       "      <td>9.970921</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>?|155060</th>\n",
       "      <td>10.213110</td>\n",
       "      <td>9.011343</td>\n",
       "      <td>9.209506</td>\n",
       "      <td>9.110487</td>\n",
       "      <td>8.027083</td>\n",
       "      <td>8.110023</td>\n",
       "      <td>7.704865</td>\n",
       "      <td>6.254741</td>\n",
       "      <td>8.128052</td>\n",
       "      <td>6.387132</td>\n",
       "      <td>...</td>\n",
       "      <td>8.052478</td>\n",
       "      <td>7.516236</td>\n",
       "      <td>9.280761</td>\n",
       "      <td>9.631306</td>\n",
       "      <td>8.137225</td>\n",
       "      <td>9.460539</td>\n",
       "      <td>8.738651</td>\n",
       "      <td>8.556414</td>\n",
       "      <td>7.977670</td>\n",
       "      <td>7.894918</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 1212 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "             TCGA-3C-AAAU-01  TCGA-3C-AALI-01  TCGA-3C-AALJ-01  \\\n",
       "gene                                                             \n",
       "?|100133144         4.032489         3.211931         3.538886   \n",
       "?|100134869         3.692829         4.119273         3.206237   \n",
       "?|10357             5.704604         6.124231         7.269570   \n",
       "?|10431             8.672694         9.139279        10.410275   \n",
       "?|155060           10.213110         9.011343         9.209506   \n",
       "\n",
       "             TCGA-3C-AALK-01  TCGA-4H-AAAK-01  TCGA-5L-AAT0-01  \\\n",
       "gene                                                             \n",
       "?|100133144         3.595671         2.775430         1.995991   \n",
       "?|100134869         3.469873         3.850979         3.766489   \n",
       "?|10357             7.168565         6.395968         6.836141   \n",
       "?|10431             9.757450         9.581922         9.657753   \n",
       "?|155060            9.110487         8.027083         8.110023   \n",
       "\n",
       "             TCGA-5L-AAT1-01  TCGA-5T-A9QA-01  TCGA-A1-A0SB-01  \\\n",
       "gene                                                             \n",
       "?|100133144              NaN         0.550310         3.939189   \n",
       "?|100134869         3.405298         3.169252         3.847346   \n",
       "?|10357             6.857961         6.749035         6.862786   \n",
       "?|10431            10.114256        10.472185         9.360367   \n",
       "?|155060            7.704865         6.254741         8.128052   \n",
       "\n",
       "             TCGA-A1-A0SD-01  ...  TCGA-UL-AAZ6-01  TCGA-UU-A93S-01  \\\n",
       "gene                          ...                                     \n",
       "?|100133144         3.250628  ...        -1.324816         2.108558   \n",
       "?|100134869         3.501324  ...         3.845189         3.443978   \n",
       "?|10357             5.913201  ...         7.083470         7.088829   \n",
       "?|10431             9.933569  ...        10.616682        11.495054   \n",
       "?|155060            6.387132  ...         8.052478         7.516236   \n",
       "\n",
       "             TCGA-V7-A7HQ-01  TCGA-W8-A86G-01  TCGA-WT-AB41-01  \\\n",
       "gene                                                             \n",
       "?|100133144              NaN         2.475707              NaN   \n",
       "?|100134869         1.622556         3.845099         2.657434   \n",
       "?|10357             4.906766         7.003547         5.744909   \n",
       "?|10431            10.749770         9.446410        10.282241   \n",
       "?|155060            9.280761         9.631306         8.137225   \n",
       "\n",
       "             TCGA-WT-AB44-01  TCGA-XX-A899-01  TCGA-XX-A89A-01  \\\n",
       "gene                                                             \n",
       "?|100133144              NaN         3.846574         4.480524   \n",
       "?|100134869         1.703987         4.422294         4.769476   \n",
       "?|10357             5.401368         7.106177         6.003213   \n",
       "?|10431            10.874534         9.350400         9.497295   \n",
       "?|155060            9.460539         8.738651         8.556414   \n",
       "\n",
       "             TCGA-Z7-A8R5-01  TCGA-Z7-A8R6-01  \n",
       "gene                                           \n",
       "?|100133144         1.178747         2.783771  \n",
       "?|100134869         2.866572         4.631075  \n",
       "?|10357             6.410173         7.388457  \n",
       "?|10431            10.155173         9.970921  \n",
       "?|155060            7.977670         7.894918  \n",
       "\n",
       "[5 rows x 1212 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TCGA-3C-AAAU-01</th>\n",
       "      <th>TCGA-3C-AALI-01</th>\n",
       "      <th>TCGA-3C-AALJ-01</th>\n",
       "      <th>TCGA-3C-AALK-01</th>\n",
       "      <th>TCGA-4H-AAAK-01</th>\n",
       "      <th>TCGA-5L-AAT0-01</th>\n",
       "      <th>TCGA-5L-AAT1-01</th>\n",
       "      <th>TCGA-5T-A9QA-01</th>\n",
       "      <th>TCGA-A1-A0SB-01</th>\n",
       "      <th>TCGA-A1-A0SE-01</th>\n",
       "      <th>...</th>\n",
       "      <th>TCGA-UL-AAZ6-01</th>\n",
       "      <th>TCGA-UU-A93S-01</th>\n",
       "      <th>TCGA-V7-A7HQ-01</th>\n",
       "      <th>TCGA-W8-A86G-01</th>\n",
       "      <th>TCGA-WT-AB41-01</th>\n",
       "      <th>TCGA-WT-AB44-01</th>\n",
       "      <th>TCGA-XX-A899-01</th>\n",
       "      <th>TCGA-XX-A89A-01</th>\n",
       "      <th>TCGA-Z7-A8R5-01</th>\n",
       "      <th>TCGA-Z7-A8R6-01</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Hybridization REF</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Composite Element REF</th>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>...</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "      <td>Beta_Value</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A1BG</th>\n",
       "      <td>0.483716119676</td>\n",
       "      <td>0.637191226131</td>\n",
       "      <td>0.656092398242</td>\n",
       "      <td>0.615194471357</td>\n",
       "      <td>0.612080370511</td>\n",
       "      <td>0.469600740678</td>\n",
       "      <td>0.582188239422</td>\n",
       "      <td>0.66617073097</td>\n",
       "      <td>0.659965611959</td>\n",
       "      <td>0.641701155202</td>\n",
       "      <td>...</td>\n",
       "      <td>0.631413241724</td>\n",
       "      <td>0.64952294395</td>\n",
       "      <td>0.596585169597</td>\n",
       "      <td>0.615558357651</td>\n",
       "      <td>0.580837880262</td>\n",
       "      <td>0.615814023324</td>\n",
       "      <td>0.589897794957</td>\n",
       "      <td>0.572606636128</td>\n",
       "      <td>0.617859586161</td>\n",
       "      <td>0.568150149265</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A1CF</th>\n",
       "      <td>0.295827203492</td>\n",
       "      <td>0.458972998571</td>\n",
       "      <td>0.489725289638</td>\n",
       "      <td>0.625765223243</td>\n",
       "      <td>0.507736509665</td>\n",
       "      <td>0.514770866326</td>\n",
       "      <td>0.549850958729</td>\n",
       "      <td>0.381038654448</td>\n",
       "      <td>0.826312156393</td>\n",
       "      <td>0.606699429409</td>\n",
       "      <td>...</td>\n",
       "      <td>0.383469192855</td>\n",
       "      <td>0.183354853938</td>\n",
       "      <td>0.403909161312</td>\n",
       "      <td>0.716980255014</td>\n",
       "      <td>0.613131295074</td>\n",
       "      <td>0.665043713213</td>\n",
       "      <td>0.705153725375</td>\n",
       "      <td>0.494848686021</td>\n",
       "      <td>0.691835387189</td>\n",
       "      <td>0.224696596211</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A2BP1</th>\n",
       "      <td>0.187699869591</td>\n",
       "      <td>0.240515847704</td>\n",
       "      <td>0.279087851226</td>\n",
       "      <td>0.488888510474</td>\n",
       "      <td>0.463845494635</td>\n",
       "      <td>0.504450855353</td>\n",
       "      <td>0.480885816745</td>\n",
       "      <td>0.622832399216</td>\n",
       "      <td>0.474678831563</td>\n",
       "      <td>0.339829506578</td>\n",
       "      <td>...</td>\n",
       "      <td>0.130529915536</td>\n",
       "      <td>0.319855310743</td>\n",
       "      <td>0.335517456053</td>\n",
       "      <td>0.512185396638</td>\n",
       "      <td>0.563519806811</td>\n",
       "      <td>0.507364324635</td>\n",
       "      <td>0.520542747167</td>\n",
       "      <td>0.412562068574</td>\n",
       "      <td>0.522169978143</td>\n",
       "      <td>0.33955834608</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A2LD1</th>\n",
       "      <td>0.62958551322</td>\n",
       "      <td>0.666272288675</td>\n",
       "      <td>0.755630499986</td>\n",
       "      <td>0.74575121287</td>\n",
       "      <td>0.698515739124</td>\n",
       "      <td>0.706812706661</td>\n",
       "      <td>0.759017355996</td>\n",
       "      <td>0.694010939885</td>\n",
       "      <td>0.847837522256</td>\n",
       "      <td>0.786662091353</td>\n",
       "      <td>...</td>\n",
       "      <td>0.587475995313</td>\n",
       "      <td>0.667969642321</td>\n",
       "      <td>0.689140211036</td>\n",
       "      <td>0.791381283524</td>\n",
       "      <td>0.680499323148</td>\n",
       "      <td>0.660476360054</td>\n",
       "      <td>0.745725420412</td>\n",
       "      <td>0.74390049875</td>\n",
       "      <td>0.791229999577</td>\n",
       "      <td>0.637764188841</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 885 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      TCGA-3C-AAAU-01 TCGA-3C-AALI-01 TCGA-3C-AALJ-01  \\\n",
       "Hybridization REF                                                       \n",
       "Composite Element REF      Beta_Value      Beta_Value      Beta_Value   \n",
       "A1BG                   0.483716119676  0.637191226131  0.656092398242   \n",
       "A1CF                   0.295827203492  0.458972998571  0.489725289638   \n",
       "A2BP1                  0.187699869591  0.240515847704  0.279087851226   \n",
       "A2LD1                   0.62958551322  0.666272288675  0.755630499986   \n",
       "\n",
       "                      TCGA-3C-AALK-01 TCGA-4H-AAAK-01 TCGA-5L-AAT0-01  \\\n",
       "Hybridization REF                                                       \n",
       "Composite Element REF      Beta_Value      Beta_Value      Beta_Value   \n",
       "A1BG                   0.615194471357  0.612080370511  0.469600740678   \n",
       "A1CF                   0.625765223243  0.507736509665  0.514770866326   \n",
       "A2BP1                  0.488888510474  0.463845494635  0.504450855353   \n",
       "A2LD1                   0.74575121287  0.698515739124  0.706812706661   \n",
       "\n",
       "                      TCGA-5L-AAT1-01 TCGA-5T-A9QA-01 TCGA-A1-A0SB-01  \\\n",
       "Hybridization REF                                                       \n",
       "Composite Element REF      Beta_Value      Beta_Value      Beta_Value   \n",
       "A1BG                   0.582188239422   0.66617073097  0.659965611959   \n",
       "A1CF                   0.549850958729  0.381038654448  0.826312156393   \n",
       "A2BP1                  0.480885816745  0.622832399216  0.474678831563   \n",
       "A2LD1                  0.759017355996  0.694010939885  0.847837522256   \n",
       "\n",
       "                      TCGA-A1-A0SE-01  ... TCGA-UL-AAZ6-01 TCGA-UU-A93S-01  \\\n",
       "Hybridization REF                      ...                                   \n",
       "Composite Element REF      Beta_Value  ...      Beta_Value      Beta_Value   \n",
       "A1BG                   0.641701155202  ...  0.631413241724   0.64952294395   \n",
       "A1CF                   0.606699429409  ...  0.383469192855  0.183354853938   \n",
       "A2BP1                  0.339829506578  ...  0.130529915536  0.319855310743   \n",
       "A2LD1                  0.786662091353  ...  0.587475995313  0.667969642321   \n",
       "\n",
       "                      TCGA-V7-A7HQ-01 TCGA-W8-A86G-01 TCGA-WT-AB41-01  \\\n",
       "Hybridization REF                                                       \n",
       "Composite Element REF      Beta_Value      Beta_Value      Beta_Value   \n",
       "A1BG                   0.596585169597  0.615558357651  0.580837880262   \n",
       "A1CF                   0.403909161312  0.716980255014  0.613131295074   \n",
       "A2BP1                  0.335517456053  0.512185396638  0.563519806811   \n",
       "A2LD1                  0.689140211036  0.791381283524  0.680499323148   \n",
       "\n",
       "                      TCGA-WT-AB44-01 TCGA-XX-A899-01 TCGA-XX-A89A-01  \\\n",
       "Hybridization REF                                                       \n",
       "Composite Element REF      Beta_Value      Beta_Value      Beta_Value   \n",
       "A1BG                   0.615814023324  0.589897794957  0.572606636128   \n",
       "A1CF                   0.665043713213  0.705153725375  0.494848686021   \n",
       "A2BP1                  0.507364324635  0.520542747167  0.412562068574   \n",
       "A2LD1                  0.660476360054  0.745725420412   0.74390049875   \n",
       "\n",
       "                      TCGA-Z7-A8R5-01 TCGA-Z7-A8R6-01  \n",
       "Hybridization REF                                      \n",
       "Composite Element REF      Beta_Value      Beta_Value  \n",
       "A1BG                   0.617859586161  0.568150149265  \n",
       "A1CF                   0.691835387189  0.224696596211  \n",
       "A2BP1                  0.522169978143   0.33955834608  \n",
       "A2LD1                  0.791229999577  0.637764188841  \n",
       "\n",
       "[5 rows x 885 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tcga-5l-aat0</th>\n",
       "      <th>tcga-5l-aat1</th>\n",
       "      <th>tcga-a1-a0sp</th>\n",
       "      <th>tcga-a2-a04v</th>\n",
       "      <th>tcga-a2-a04y</th>\n",
       "      <th>tcga-a2-a0cq</th>\n",
       "      <th>tcga-a2-a1g4</th>\n",
       "      <th>tcga-a2-a25a</th>\n",
       "      <th>tcga-a7-a0cd</th>\n",
       "      <th>tcga-a7-a13g</th>\n",
       "      <th>...</th>\n",
       "      <th>tcga-s3-aa11</th>\n",
       "      <th>tcga-s3-aa14</th>\n",
       "      <th>tcga-s3-aa15</th>\n",
       "      <th>tcga-ul-aaz6</th>\n",
       "      <th>tcga-uu-a93s</th>\n",
       "      <th>tcga-v7-a7hq</th>\n",
       "      <th>tcga-wt-ab44</th>\n",
       "      <th>tcga-xx-a899</th>\n",
       "      <th>tcga-xx-a89a</th>\n",
       "      <th>tcga-z7-a8r6</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Hybridization REF</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Composite Element REF</th>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>...</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "      <td>value</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>years_to_birth</th>\n",
       "      <td>42</td>\n",
       "      <td>63</td>\n",
       "      <td>40</td>\n",
       "      <td>39</td>\n",
       "      <td>53</td>\n",
       "      <td>62</td>\n",
       "      <td>71</td>\n",
       "      <td>44</td>\n",
       "      <td>66</td>\n",
       "      <td>79</td>\n",
       "      <td>...</td>\n",
       "      <td>67</td>\n",
       "      <td>47</td>\n",
       "      <td>51</td>\n",
       "      <td>73</td>\n",
       "      <td>63</td>\n",
       "      <td>75</td>\n",
       "      <td>NaN</td>\n",
       "      <td>46</td>\n",
       "      <td>68</td>\n",
       "      <td>46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>vital_status</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>days_to_death</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1920</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>116</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>days_to_last_followup</th>\n",
       "      <td>1477</td>\n",
       "      <td>1471</td>\n",
       "      <td>584</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1099</td>\n",
       "      <td>2695</td>\n",
       "      <td>595</td>\n",
       "      <td>3276</td>\n",
       "      <td>1165</td>\n",
       "      <td>718</td>\n",
       "      <td>...</td>\n",
       "      <td>421</td>\n",
       "      <td>529</td>\n",
       "      <td>525</td>\n",
       "      <td>518</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2033</td>\n",
       "      <td>883</td>\n",
       "      <td>467</td>\n",
       "      <td>488</td>\n",
       "      <td>3256</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 1097 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      tcga-5l-aat0 tcga-5l-aat1 tcga-a1-a0sp tcga-a2-a04v  \\\n",
       "Hybridization REF                                                           \n",
       "Composite Element REF        value        value        value        value   \n",
       "years_to_birth                  42           63           40           39   \n",
       "vital_status                     0            0            0            1   \n",
       "days_to_death                  NaN          NaN          NaN         1920   \n",
       "days_to_last_followup         1477         1471          584          NaN   \n",
       "\n",
       "                      tcga-a2-a04y tcga-a2-a0cq tcga-a2-a1g4 tcga-a2-a25a  \\\n",
       "Hybridization REF                                                           \n",
       "Composite Element REF        value        value        value        value   \n",
       "years_to_birth                  53           62           71           44   \n",
       "vital_status                     0            0            0            0   \n",
       "days_to_death                  NaN          NaN          NaN          NaN   \n",
       "days_to_last_followup         1099         2695          595         3276   \n",
       "\n",
       "                      tcga-a7-a0cd tcga-a7-a13g  ... tcga-s3-aa11  \\\n",
       "Hybridization REF                                ...                \n",
       "Composite Element REF        value        value  ...        value   \n",
       "years_to_birth                  66           79  ...           67   \n",
       "vital_status                     0            0  ...            0   \n",
       "days_to_death                  NaN          NaN  ...          NaN   \n",
       "days_to_last_followup         1165          718  ...          421   \n",
       "\n",
       "                      tcga-s3-aa14 tcga-s3-aa15 tcga-ul-aaz6 tcga-uu-a93s  \\\n",
       "Hybridization REF                                                           \n",
       "Composite Element REF        value        value        value        value   \n",
       "years_to_birth                  47           51           73           63   \n",
       "vital_status                     0            0            0            1   \n",
       "days_to_death                  NaN          NaN          NaN          116   \n",
       "days_to_last_followup          529          525          518          NaN   \n",
       "\n",
       "                      tcga-v7-a7hq tcga-wt-ab44 tcga-xx-a899 tcga-xx-a89a  \\\n",
       "Hybridization REF                                                           \n",
       "Composite Element REF        value        value        value        value   \n",
       "years_to_birth                  75          NaN           46           68   \n",
       "vital_status                     0            0            0            0   \n",
       "days_to_death                  NaN          NaN          NaN          NaN   \n",
       "days_to_last_followup         2033          883          467          488   \n",
       "\n",
       "                      tcga-z7-a8r6  \n",
       "Hybridization REF                   \n",
       "Composite Element REF        value  \n",
       "years_to_birth                  46  \n",
       "vital_status                     0  \n",
       "days_to_death                  NaN  \n",
       "days_to_last_followup         3256  \n",
       "\n",
       "[5 rows x 1097 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from pathlib import Path\n",
    "root = Path(\"/home/vicente/Github/BioNeuralNet/TCGA_BRCA_DATA\")\n",
    "\n",
    "mirna_raw = pd.read_csv(root/\"BRCA.miRseq_RPKM_log2.txt\", sep=\"\\t\",index_col=0,low_memory=False)                            \n",
    "rna_raw = pd.read_csv(root / \"BRCA.uncv2.mRNAseq_RSEM_normalized_log2.txt\", sep=\"\\t\",index_col=0,low_memory=False)\n",
    "meth_raw = pd.read_csv(root/\"BRCA.meth.by_mean.data.txt\", sep='\\t',index_col=0,low_memory=False)\n",
    "clinical_raw = pd.read_csv(root / \"BRCA.clin.merged.picked.txt\",sep=\"\\t\", index_col=0, low_memory=False)\n",
    "\n",
    "print(f\"mirna shape: {mirna_raw.shape}, rna shape: {rna_raw.shape}, meth shape: {meth_raw.shape}, clinical shape: {clinical_raw.shape}\")\n",
    "display(mirna_raw.head())\n",
    "display(rna_raw.head())\n",
    "display(meth_raw.head())\n",
    "display(clinical_raw.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aacae339",
   "metadata": {},
   "source": [
    "## TCGAbiolinks\n",
    "\n",
    "This section demonstrates how to use the `TCGAbiolinks` R package to access and download clinical and molecular subtype data. It begins by ensuring `TCGAbiolinks` is installed, then loads the package. It retrieves PAM50 molecular subtype labels using `TCGAquery_subtype()` and writes them to a CSV file. Additionally, it downloads clinical data using `GDCquery_clinic()` and formats it with `GDCprepare_clinic()`, saving the result as another CSV file."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a445601f",
   "metadata": {},
   "source": [
    "```R\n",
    "  # Install TCGAbiolinks\n",
    "  if (!requireNamespace(\"TCGAbiolinks\", quietly = TRUE)) {\n",
    "    if (!requireNamespace(\"BiocManager\", quietly = TRUE))\n",
    "      install.packages(\"BiocManager\")\n",
    "    BiocManager::install(\"TCGAbiolinks\")\n",
    "  }\n",
    "\n",
    "  # Load the library\n",
    "  library(TCGAbiolinks)\n",
    "\n",
    "  # Download PAM50 subtype labels\n",
    "  pam50_df <- TCGAquery_subtype(tumor = \"BRCA\")[ , c(\"patient\", \"BRCA_Subtype_PAM50\")]\n",
    "  write.csv(pam50_df, file = \"BRCA_PAM50_labels.csv\", row.names = FALSE, quote = FALSE)\n",
    "\n",
    "  # Download clinical data\n",
    "  clin_raw <- GDCquery_clinic(project = \"TCGA-BRCA\", type = \"clinical\")\n",
    "  clin_df <- GDCprepare_clinic(clin_raw, clinical.info = \"patient\")\n",
    "  write.csv(clin_df, file = \"BRCA_clinical_data.csv\", row.names = FALSE, quote = FALSE)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "128f63dd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Initial shapes\n",
      "meth: (20107, 885)\n",
      "rna: (18321, 1212)\n",
      "mirna: (503, 1189)\n",
      "pam50: (1087, 1)\n",
      "clinical TCGABioLinks: (1098, 101)\n",
      "clinical FireHose: (1097, 18)\n",
      "\n",
      "After tranpose\n",
      "meth: (885, 20107)\n",
      "rna: (1212, 18321)\n",
      "mirna: (1189, 503)\n",
      "Patients in both clinical datasets: 1097\n",
      "Combined Clinical shape (1097, 119)\n",
      "Patients in every dataset: 769\n",
      "\n",
      "Final shapes:\n",
      "meth: (863, 20107)\n",
      "rna: (865, 18321)\n",
      "mirna: (855, 503)\n",
      "pam50: (769, 1)\n",
      "clinical: (769, 119)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# from Firehose\n",
    "mirna = pd.read_csv(root/\"BRCA.miRseq_RPKM_log2.txt\", sep=\"\\t\",index_col=0,low_memory=False)\n",
    "meth = pd.read_csv(root/\"BRCA.meth.by_mean.data.txt\", sep='\\t',index_col=0,low_memory=False)                             \n",
    "rna = pd.read_csv(root / \"BRCA.uncv2.mRNAseq_RSEM_normalized_log2.txt\", sep=\"\\t\",index_col=0,low_memory=False)\n",
    "clinical_firehose = pd.read_csv(root / \"BRCA.clin.merged.picked.txt\",sep=\"\\t\", index_col=0, low_memory=False).T\n",
    "\n",
    "# from TCGABiolinks\n",
    "pam50 = pd.read_csv(root /\"BRCA_PAM50_labels.csv\",index_col=0)\n",
    "clinical_biolinks = pd.read_csv(root /\"BRCA_clinical_data.csv\",index_col=1)\n",
    "\n",
    "print(\"Initial shapes\")\n",
    "print(f\"meth: {meth.shape}\")\n",
    "print(f\"rna: {rna.shape}\")\n",
    "print(f\"mirna: {mirna.shape}\")\n",
    "print(f\"pam50: {pam50.shape}\")\n",
    "print(f\"clinical TCGABioLinks: {clinical_biolinks.shape}\")\n",
    "print(f\"clinical FireHose: {clinical_firehose.shape}\")\n",
    "\n",
    "meth = meth.T\n",
    "rna = rna.T\n",
    "mirna = mirna.T\n",
    "\n",
    "print(\"\\nAfter tranpose\")\n",
    "print(f\"meth: {meth.shape}\")\n",
    "print(f\"rna: {rna.shape}\")\n",
    "print(f\"mirna: {mirna.shape}\")\n",
    "\n",
    "def trim(idx):\n",
    "    return idx.to_series().str.extract(r'(^TCGA-\\w\\w-\\w\\w\\w\\w)')[0]\n",
    "\n",
    "meth.index = trim(meth.index)\n",
    "rna.index = trim(rna.index)\n",
    "mirna.index = trim(mirna.index)\n",
    "pam50.index = pam50.index.str.upper()\n",
    "clinical_biolinks.index = clinical_biolinks.index.str.upper()\n",
    "clinical_firehose.index = clinical_firehose.index.str.upper()\n",
    "\n",
    "idx1 = clinical_biolinks.index\n",
    "idx2 = clinical_firehose.index\n",
    "\n",
    "# intersection and unique counts\n",
    "common = idx1.intersection(idx2)\n",
    "only_in_1 = idx1.difference(idx2)\n",
    "only_in_2 = idx2.difference(idx1)\n",
    "\n",
    "print(f\"Patients in both clinical datasets: {len(common)}\")\n",
    "common = clinical_biolinks.index.intersection(clinical_firehose.index)\n",
    "clinical_biolinks = clinical_biolinks.loc[common]\n",
    "clinical_firehose = clinical_firehose.loc[common]\n",
    "\n",
    "clinical = pd.concat([clinical_biolinks, clinical_firehose], axis=1)\n",
    "\n",
    "print(f\"Combined Clinical shape {clinical.shape}\")\n",
    "\n",
    "common = sorted(set(meth.index) & set(rna.index) & set(mirna.index) & set(pam50.index) & set(clinical.index))\n",
    "print(f\"Patients in every dataset: {len(common)}\")\n",
    "\n",
    "meth = meth.loc[common]\n",
    "rna = rna.loc[common]\n",
    "mirna = mirna.loc[common]\n",
    "pam50 = pam50.loc[common]\n",
    "clinical = clinical.loc[common]\n",
    "\n",
    "print(\"\\nFinal shapes:\")\n",
    "print(f\"meth: {meth.shape}\")\n",
    "print(f\"rna: {rna.shape}\")\n",
    "print(f\"mirna: {mirna.shape}\")\n",
    "print(f\"pam50: {pam50.shape}\")\n",
    "print(f\"clinical: {clinical.shape}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32ba4b2c",
   "metadata": {},
   "source": [
    "### Handling Multiple Aliquots per Sample\n",
    "\n",
    "This section addresses cases where some patients have multiple aliquots per sample in the `meth`, `rna`, and `mirna` datasets. It first identifies and counts patients with duplicate entries. Then, it coerces all data to numeric types and aggregates the duplicates by computing the mean across aliquots for each patient, ensuring only one row per patient. After aggregation, the datasets are aligned by keeping only the patients that are common across all five datasets (`meth`, `rna`, `mirna`, `pam50`, and `clinical`). The result is s set of matched samples ready for integrated analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "b841497a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "meth:\n",
      "patients with >1 aliquot: 91\n",
      "total duplicate rows: 94\n",
      "\n",
      "rna:\n",
      "patients with >1 aliquot: 93\n",
      "total duplicate rows: 96\n",
      "\n",
      "mirna:\n",
      "patients with >1 aliquot: 84\n",
      "total duplicate rows: 86\n",
      "\n",
      "Post-aggregation shapes:\n",
      "meth: (769, 20107)\n",
      "rna: (769, 18321)\n",
      "mirna: (769, 503)\n",
      "Patients in every dataset: 769\n",
      "\n",
      "Final shapes\n",
      "meth: (769, 20107)\n",
      "rna: (769, 18321)\n",
      "mirna: (769, 503)\n",
      "pam50: (769, 1)\n",
      "clinical:(769, 119)\n"
     ]
    }
   ],
   "source": [
    "for name, df in [(\"meth\", meth), (\"rna\", rna), (\"mirna\", mirna)]:\n",
    "    counts = df.index.value_counts()\n",
    "    n_multiple = (counts > 1).sum()\n",
    "    total_duplicates = counts[counts > 1].sum() - n_multiple\n",
    "    \n",
    "    print(f\"{name}:\")\n",
    "    print(f\"patients with >1 aliquot: {n_multiple}\")\n",
    "    print(f\"total duplicate rows: {total_duplicates}\\n\")\n",
    "\n",
    "meth = meth.apply(pd.to_numeric, errors=\"coerce\")\n",
    "rna = rna .apply(pd.to_numeric, errors=\"coerce\")\n",
    "mirna = mirna.apply(pd.to_numeric, errors=\"coerce\")\n",
    "\n",
    "meth = meth.groupby(level=0).mean()\n",
    "rna = rna.groupby(level=0).mean()\n",
    "mirna = mirna.groupby(level=0).mean()\n",
    "\n",
    "# Now each has one row per patient\n",
    "print(\"Post-aggregation shapes:\")\n",
    "print(f\"meth: {meth.shape}\")\n",
    "print(f\"rna: {rna.shape}\")\n",
    "print(f\"mirna: {mirna.shape}\")\n",
    "\n",
    "common = sorted( set(meth.index) & set(rna.index) & set(mirna.index)& set(pam50.index) & set(clinical.index) )\n",
    "print(f\"Patients in every dataset: {len(common)}\")\n",
    "\n",
    "meth = meth.loc[common]\n",
    "rna = rna.loc[common]\n",
    "mirna = mirna.loc[common]\n",
    "pam50 = pam50.loc[common]\n",
    "clinical = clinical.loc[common]\n",
    "\n",
    "print(\"\\nFinal shapes\")\n",
    "print(f\"meth: {meth.shape}\")\n",
    "print(f\"rna: {rna.shape}\")\n",
    "print(f\"mirna: {mirna.shape}\")\n",
    "print(f\"pam50: {pam50.shape}\")\n",
    "print(f\"clinical:{clinical.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d8dac23",
   "metadata": {},
   "source": [
    "### Review the first few rows of each file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "4f35bd67",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Hybridization REF</th>\n",
       "      <th>Composite Element REF</th>\n",
       "      <th>A1BG</th>\n",
       "      <th>A1CF</th>\n",
       "      <th>A2BP1</th>\n",
       "      <th>A2LD1</th>\n",
       "      <th>A2M</th>\n",
       "      <th>A2ML1</th>\n",
       "      <th>A4GALT</th>\n",
       "      <th>A4GNT</th>\n",
       "      <th>AAA1</th>\n",
       "      <th>...</th>\n",
       "      <th>ZWILCH</th>\n",
       "      <th>ZWINT</th>\n",
       "      <th>ZXDC</th>\n",
       "      <th>ZYG11A</th>\n",
       "      <th>ZYG11B</th>\n",
       "      <th>ZYX</th>\n",
       "      <th>ZZEF1</th>\n",
       "      <th>ZZZ3</th>\n",
       "      <th>psiTPTE22</th>\n",
       "      <th>tAKR</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AAAU</th>\n",
       "      <td>NaN</td>\n",
       "      <td>0.483716</td>\n",
       "      <td>0.295827</td>\n",
       "      <td>0.187700</td>\n",
       "      <td>0.629586</td>\n",
       "      <td>0.559654</td>\n",
       "      <td>0.835412</td>\n",
       "      <td>0.484800</td>\n",
       "      <td>0.690217</td>\n",
       "      <td>0.807805</td>\n",
       "      <td>...</td>\n",
       "      <td>0.112978</td>\n",
       "      <td>0.053939</td>\n",
       "      <td>0.287665</td>\n",
       "      <td>0.328087</td>\n",
       "      <td>0.502935</td>\n",
       "      <td>0.220683</td>\n",
       "      <td>0.482044</td>\n",
       "      <td>0.107396</td>\n",
       "      <td>0.247304</td>\n",
       "      <td>0.506404</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALI</th>\n",
       "      <td>NaN</td>\n",
       "      <td>0.637191</td>\n",
       "      <td>0.458973</td>\n",
       "      <td>0.240516</td>\n",
       "      <td>0.666272</td>\n",
       "      <td>0.607505</td>\n",
       "      <td>0.842391</td>\n",
       "      <td>0.550047</td>\n",
       "      <td>0.749890</td>\n",
       "      <td>0.395290</td>\n",
       "      <td>...</td>\n",
       "      <td>0.111834</td>\n",
       "      <td>0.046160</td>\n",
       "      <td>0.265322</td>\n",
       "      <td>0.405851</td>\n",
       "      <td>0.434024</td>\n",
       "      <td>0.236362</td>\n",
       "      <td>0.458847</td>\n",
       "      <td>0.119652</td>\n",
       "      <td>0.163022</td>\n",
       "      <td>0.623865</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALJ</th>\n",
       "      <td>NaN</td>\n",
       "      <td>0.656092</td>\n",
       "      <td>0.489725</td>\n",
       "      <td>0.279088</td>\n",
       "      <td>0.755630</td>\n",
       "      <td>0.662360</td>\n",
       "      <td>0.829020</td>\n",
       "      <td>0.476107</td>\n",
       "      <td>0.653756</td>\n",
       "      <td>0.795102</td>\n",
       "      <td>...</td>\n",
       "      <td>0.113218</td>\n",
       "      <td>0.042657</td>\n",
       "      <td>0.272103</td>\n",
       "      <td>0.391326</td>\n",
       "      <td>0.449525</td>\n",
       "      <td>0.210976</td>\n",
       "      <td>0.482641</td>\n",
       "      <td>0.102385</td>\n",
       "      <td>0.252328</td>\n",
       "      <td>0.504451</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALK</th>\n",
       "      <td>NaN</td>\n",
       "      <td>0.615194</td>\n",
       "      <td>0.625765</td>\n",
       "      <td>0.488889</td>\n",
       "      <td>0.745751</td>\n",
       "      <td>0.727982</td>\n",
       "      <td>0.835365</td>\n",
       "      <td>0.556016</td>\n",
       "      <td>0.652005</td>\n",
       "      <td>0.816423</td>\n",
       "      <td>...</td>\n",
       "      <td>0.145133</td>\n",
       "      <td>0.047022</td>\n",
       "      <td>0.301284</td>\n",
       "      <td>0.410348</td>\n",
       "      <td>0.446571</td>\n",
       "      <td>0.220185</td>\n",
       "      <td>0.485944</td>\n",
       "      <td>0.112941</td>\n",
       "      <td>0.471956</td>\n",
       "      <td>0.682468</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-4H-AAAK</th>\n",
       "      <td>NaN</td>\n",
       "      <td>0.612080</td>\n",
       "      <td>0.507737</td>\n",
       "      <td>0.463845</td>\n",
       "      <td>0.698516</td>\n",
       "      <td>0.692364</td>\n",
       "      <td>0.802388</td>\n",
       "      <td>0.504870</td>\n",
       "      <td>0.531183</td>\n",
       "      <td>0.851114</td>\n",
       "      <td>...</td>\n",
       "      <td>0.118928</td>\n",
       "      <td>0.045057</td>\n",
       "      <td>0.300647</td>\n",
       "      <td>0.379998</td>\n",
       "      <td>0.487929</td>\n",
       "      <td>0.233324</td>\n",
       "      <td>0.490736</td>\n",
       "      <td>0.115646</td>\n",
       "      <td>0.314877</td>\n",
       "      <td>0.744877</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 20107 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Hybridization REF  Composite Element REF      A1BG      A1CF     A2BP1  \\\n",
       "0                                                                        \n",
       "TCGA-3C-AAAU                         NaN  0.483716  0.295827  0.187700   \n",
       "TCGA-3C-AALI                         NaN  0.637191  0.458973  0.240516   \n",
       "TCGA-3C-AALJ                         NaN  0.656092  0.489725  0.279088   \n",
       "TCGA-3C-AALK                         NaN  0.615194  0.625765  0.488889   \n",
       "TCGA-4H-AAAK                         NaN  0.612080  0.507737  0.463845   \n",
       "\n",
       "Hybridization REF     A2LD1       A2M     A2ML1    A4GALT     A4GNT      AAA1  \\\n",
       "0                                                                               \n",
       "TCGA-3C-AAAU       0.629586  0.559654  0.835412  0.484800  0.690217  0.807805   \n",
       "TCGA-3C-AALI       0.666272  0.607505  0.842391  0.550047  0.749890  0.395290   \n",
       "TCGA-3C-AALJ       0.755630  0.662360  0.829020  0.476107  0.653756  0.795102   \n",
       "TCGA-3C-AALK       0.745751  0.727982  0.835365  0.556016  0.652005  0.816423   \n",
       "TCGA-4H-AAAK       0.698516  0.692364  0.802388  0.504870  0.531183  0.851114   \n",
       "\n",
       "Hybridization REF  ...    ZWILCH     ZWINT      ZXDC    ZYG11A    ZYG11B  \\\n",
       "0                  ...                                                     \n",
       "TCGA-3C-AAAU       ...  0.112978  0.053939  0.287665  0.328087  0.502935   \n",
       "TCGA-3C-AALI       ...  0.111834  0.046160  0.265322  0.405851  0.434024   \n",
       "TCGA-3C-AALJ       ...  0.113218  0.042657  0.272103  0.391326  0.449525   \n",
       "TCGA-3C-AALK       ...  0.145133  0.047022  0.301284  0.410348  0.446571   \n",
       "TCGA-4H-AAAK       ...  0.118928  0.045057  0.300647  0.379998  0.487929   \n",
       "\n",
       "Hybridization REF       ZYX     ZZEF1      ZZZ3  psiTPTE22      tAKR  \n",
       "0                                                                     \n",
       "TCGA-3C-AAAU       0.220683  0.482044  0.107396   0.247304  0.506404  \n",
       "TCGA-3C-AALI       0.236362  0.458847  0.119652   0.163022  0.623865  \n",
       "TCGA-3C-AALJ       0.210976  0.482641  0.102385   0.252328  0.504451  \n",
       "TCGA-3C-AALK       0.220185  0.485944  0.112941   0.471956  0.682468  \n",
       "TCGA-4H-AAAK       0.233324  0.490736  0.115646   0.314877  0.744877  \n",
       "\n",
       "[5 rows x 20107 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>gene</th>\n",
       "      <th>?|100133144</th>\n",
       "      <th>?|100134869</th>\n",
       "      <th>?|10357</th>\n",
       "      <th>?|10431</th>\n",
       "      <th>?|155060</th>\n",
       "      <th>?|26823</th>\n",
       "      <th>?|340602</th>\n",
       "      <th>?|388795</th>\n",
       "      <th>?|390284</th>\n",
       "      <th>?|391343</th>\n",
       "      <th>...</th>\n",
       "      <th>ZWINT|11130</th>\n",
       "      <th>ZXDA|7789</th>\n",
       "      <th>ZXDB|158586</th>\n",
       "      <th>ZXDC|79364</th>\n",
       "      <th>ZYG11A|440590</th>\n",
       "      <th>ZYG11B|79699</th>\n",
       "      <th>ZYX|7791</th>\n",
       "      <th>ZZEF1|23140</th>\n",
       "      <th>ZZZ3|26009</th>\n",
       "      <th>psiTPTE22|387590</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AAAU</th>\n",
       "      <td>4.032489</td>\n",
       "      <td>3.692829</td>\n",
       "      <td>5.704604</td>\n",
       "      <td>8.672694</td>\n",
       "      <td>10.213110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.785174</td>\n",
       "      <td>-1.536587</td>\n",
       "      <td>2.048201</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>9.864120</td>\n",
       "      <td>7.017830</td>\n",
       "      <td>9.976968</td>\n",
       "      <td>10.695662</td>\n",
       "      <td>8.013988</td>\n",
       "      <td>10.238851</td>\n",
       "      <td>11.776124</td>\n",
       "      <td>10.887932</td>\n",
       "      <td>10.205129</td>\n",
       "      <td>0.785174</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALI</th>\n",
       "      <td>3.211931</td>\n",
       "      <td>4.119273</td>\n",
       "      <td>6.124231</td>\n",
       "      <td>9.139279</td>\n",
       "      <td>9.011343</td>\n",
       "      <td>0.121015</td>\n",
       "      <td>7.170928</td>\n",
       "      <td>2.291014</td>\n",
       "      <td>0.706022</td>\n",
       "      <td>3.027968</td>\n",
       "      <td>...</td>\n",
       "      <td>9.914682</td>\n",
       "      <td>5.902438</td>\n",
       "      <td>8.809329</td>\n",
       "      <td>10.391374</td>\n",
       "      <td>7.632831</td>\n",
       "      <td>9.237422</td>\n",
       "      <td>12.426428</td>\n",
       "      <td>10.364848</td>\n",
       "      <td>8.667973</td>\n",
       "      <td>9.855788</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALJ</th>\n",
       "      <td>3.538886</td>\n",
       "      <td>3.206237</td>\n",
       "      <td>7.269570</td>\n",
       "      <td>10.410275</td>\n",
       "      <td>9.209506</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.443554</td>\n",
       "      <td>1.443554</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>11.305650</td>\n",
       "      <td>5.143969</td>\n",
       "      <td>9.060691</td>\n",
       "      <td>9.586488</td>\n",
       "      <td>8.374267</td>\n",
       "      <td>9.055784</td>\n",
       "      <td>12.414355</td>\n",
       "      <td>9.880935</td>\n",
       "      <td>8.992994</td>\n",
       "      <td>5.143969</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALK</th>\n",
       "      <td>3.595671</td>\n",
       "      <td>3.469873</td>\n",
       "      <td>7.168565</td>\n",
       "      <td>9.757450</td>\n",
       "      <td>9.110487</td>\n",
       "      <td>-1.273343</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.048724</td>\n",
       "      <td>2.186215</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>9.384994</td>\n",
       "      <td>5.782065</td>\n",
       "      <td>8.773906</td>\n",
       "      <td>9.754688</td>\n",
       "      <td>7.454703</td>\n",
       "      <td>9.246419</td>\n",
       "      <td>12.474556</td>\n",
       "      <td>9.609426</td>\n",
       "      <td>9.453001</td>\n",
       "      <td>6.057699</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-4H-AAAK</th>\n",
       "      <td>2.775430</td>\n",
       "      <td>3.850979</td>\n",
       "      <td>6.395968</td>\n",
       "      <td>9.581922</td>\n",
       "      <td>8.027083</td>\n",
       "      <td>-1.232769</td>\n",
       "      <td>-1.232769</td>\n",
       "      <td>1.574683</td>\n",
       "      <td>1.574683</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>9.397606</td>\n",
       "      <td>5.612830</td>\n",
       "      <td>8.728789</td>\n",
       "      <td>10.035881</td>\n",
       "      <td>3.811738</td>\n",
       "      <td>9.599438</td>\n",
       "      <td>11.980747</td>\n",
       "      <td>9.700292</td>\n",
       "      <td>9.784147</td>\n",
       "      <td>7.548699</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 18321 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "gene          ?|100133144  ?|100134869   ?|10357    ?|10431   ?|155060  \\\n",
       "0                                                                        \n",
       "TCGA-3C-AAAU     4.032489     3.692829  5.704604   8.672694  10.213110   \n",
       "TCGA-3C-AALI     3.211931     4.119273  6.124231   9.139279   9.011343   \n",
       "TCGA-3C-AALJ     3.538886     3.206237  7.269570  10.410275   9.209506   \n",
       "TCGA-3C-AALK     3.595671     3.469873  7.168565   9.757450   9.110487   \n",
       "TCGA-4H-AAAK     2.775430     3.850979  6.395968   9.581922   8.027083   \n",
       "\n",
       "gene           ?|26823  ?|340602  ?|388795  ?|390284  ?|391343  ...  \\\n",
       "0                                                               ...   \n",
       "TCGA-3C-AAAU       NaN  0.785174 -1.536587  2.048201       NaN  ...   \n",
       "TCGA-3C-AALI  0.121015  7.170928  2.291014  0.706022  3.027968  ...   \n",
       "TCGA-3C-AALJ       NaN       NaN  1.443554  1.443554       NaN  ...   \n",
       "TCGA-3C-AALK -1.273343       NaN  1.048724  2.186215       NaN  ...   \n",
       "TCGA-4H-AAAK -1.232769 -1.232769  1.574683  1.574683       NaN  ...   \n",
       "\n",
       "gene          ZWINT|11130  ZXDA|7789  ZXDB|158586  ZXDC|79364  ZYG11A|440590  \\\n",
       "0                                                                              \n",
       "TCGA-3C-AAAU     9.864120   7.017830     9.976968   10.695662       8.013988   \n",
       "TCGA-3C-AALI     9.914682   5.902438     8.809329   10.391374       7.632831   \n",
       "TCGA-3C-AALJ    11.305650   5.143969     9.060691    9.586488       8.374267   \n",
       "TCGA-3C-AALK     9.384994   5.782065     8.773906    9.754688       7.454703   \n",
       "TCGA-4H-AAAK     9.397606   5.612830     8.728789   10.035881       3.811738   \n",
       "\n",
       "gene          ZYG11B|79699   ZYX|7791  ZZEF1|23140  ZZZ3|26009  \\\n",
       "0                                                                \n",
       "TCGA-3C-AAAU     10.238851  11.776124    10.887932   10.205129   \n",
       "TCGA-3C-AALI      9.237422  12.426428    10.364848    8.667973   \n",
       "TCGA-3C-AALJ      9.055784  12.414355     9.880935    8.992994   \n",
       "TCGA-3C-AALK      9.246419  12.474556     9.609426    9.453001   \n",
       "TCGA-4H-AAAK      9.599438  11.980747     9.700292    9.784147   \n",
       "\n",
       "gene          psiTPTE22|387590  \n",
       "0                               \n",
       "TCGA-3C-AAAU          0.785174  \n",
       "TCGA-3C-AALI          9.855788  \n",
       "TCGA-3C-AALJ          5.143969  \n",
       "TCGA-3C-AALK          6.057699  \n",
       "TCGA-4H-AAAK          7.548699  \n",
       "\n",
       "[5 rows x 18321 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>gene</th>\n",
       "      <th>hsa-let-7a-1</th>\n",
       "      <th>hsa-let-7a-2</th>\n",
       "      <th>hsa-let-7a-3</th>\n",
       "      <th>hsa-let-7b</th>\n",
       "      <th>hsa-let-7c</th>\n",
       "      <th>hsa-let-7d</th>\n",
       "      <th>hsa-let-7e</th>\n",
       "      <th>hsa-let-7f-1</th>\n",
       "      <th>hsa-let-7f-2</th>\n",
       "      <th>hsa-let-7g</th>\n",
       "      <th>...</th>\n",
       "      <th>hsa-mir-937</th>\n",
       "      <th>hsa-mir-939</th>\n",
       "      <th>hsa-mir-940</th>\n",
       "      <th>hsa-mir-942</th>\n",
       "      <th>hsa-mir-944</th>\n",
       "      <th>hsa-mir-95</th>\n",
       "      <th>hsa-mir-96</th>\n",
       "      <th>hsa-mir-98</th>\n",
       "      <th>hsa-mir-99a</th>\n",
       "      <th>hsa-mir-99b</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AAAU</th>\n",
       "      <td>13.129765</td>\n",
       "      <td>14.117933</td>\n",
       "      <td>13.147714</td>\n",
       "      <td>14.595135</td>\n",
       "      <td>8.414890</td>\n",
       "      <td>8.665921</td>\n",
       "      <td>10.521777</td>\n",
       "      <td>3.879392</td>\n",
       "      <td>11.824817</td>\n",
       "      <td>8.597744</td>\n",
       "      <td>...</td>\n",
       "      <td>0.906699</td>\n",
       "      <td>-0.093302</td>\n",
       "      <td>2.672234</td>\n",
       "      <td>2.467414</td>\n",
       "      <td>1.044202</td>\n",
       "      <td>2.044202</td>\n",
       "      <td>6.906699</td>\n",
       "      <td>5.754696</td>\n",
       "      <td>7.024602</td>\n",
       "      <td>15.506461</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALI</th>\n",
       "      <td>12.918069</td>\n",
       "      <td>13.922300</td>\n",
       "      <td>12.913194</td>\n",
       "      <td>14.512657</td>\n",
       "      <td>9.646536</td>\n",
       "      <td>9.003653</td>\n",
       "      <td>9.131760</td>\n",
       "      <td>4.386952</td>\n",
       "      <td>12.678841</td>\n",
       "      <td>8.455144</td>\n",
       "      <td>...</td>\n",
       "      <td>1.579597</td>\n",
       "      <td>-0.083367</td>\n",
       "      <td>0.139024</td>\n",
       "      <td>3.032109</td>\n",
       "      <td>-0.668331</td>\n",
       "      <td>0.331670</td>\n",
       "      <td>5.912870</td>\n",
       "      <td>6.427066</td>\n",
       "      <td>7.885299</td>\n",
       "      <td>13.626182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALJ</th>\n",
       "      <td>13.012033</td>\n",
       "      <td>14.010002</td>\n",
       "      <td>13.028483</td>\n",
       "      <td>13.419612</td>\n",
       "      <td>9.312455</td>\n",
       "      <td>9.276943</td>\n",
       "      <td>11.395711</td>\n",
       "      <td>5.314692</td>\n",
       "      <td>13.530255</td>\n",
       "      <td>9.230563</td>\n",
       "      <td>...</td>\n",
       "      <td>3.270298</td>\n",
       "      <td>-2.189134</td>\n",
       "      <td>0.395828</td>\n",
       "      <td>1.855261</td>\n",
       "      <td>-0.381778</td>\n",
       "      <td>0.717757</td>\n",
       "      <td>6.603657</td>\n",
       "      <td>6.878301</td>\n",
       "      <td>7.580704</td>\n",
       "      <td>15.013822</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALK</th>\n",
       "      <td>13.144697</td>\n",
       "      <td>14.141721</td>\n",
       "      <td>13.151281</td>\n",
       "      <td>14.667196</td>\n",
       "      <td>11.511431</td>\n",
       "      <td>8.384763</td>\n",
       "      <td>10.368981</td>\n",
       "      <td>4.159182</td>\n",
       "      <td>12.652559</td>\n",
       "      <td>8.471503</td>\n",
       "      <td>...</td>\n",
       "      <td>0.923965</td>\n",
       "      <td>-0.660997</td>\n",
       "      <td>-0.076034</td>\n",
       "      <td>1.798435</td>\n",
       "      <td>1.798435</td>\n",
       "      <td>0.798435</td>\n",
       "      <td>6.181354</td>\n",
       "      <td>5.377922</td>\n",
       "      <td>10.031619</td>\n",
       "      <td>14.554783</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-4H-AAAK</th>\n",
       "      <td>13.411684</td>\n",
       "      <td>14.413518</td>\n",
       "      <td>13.420481</td>\n",
       "      <td>14.438548</td>\n",
       "      <td>11.693927</td>\n",
       "      <td>8.453747</td>\n",
       "      <td>10.741371</td>\n",
       "      <td>4.494537</td>\n",
       "      <td>13.009499</td>\n",
       "      <td>8.381220</td>\n",
       "      <td>...</td>\n",
       "      <td>0.182950</td>\n",
       "      <td>-0.624403</td>\n",
       "      <td>-1.624403</td>\n",
       "      <td>1.076036</td>\n",
       "      <td>0.182950</td>\n",
       "      <td>-0.302475</td>\n",
       "      <td>4.318110</td>\n",
       "      <td>5.103516</td>\n",
       "      <td>10.078201</td>\n",
       "      <td>14.650338</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 503 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "gene          hsa-let-7a-1  hsa-let-7a-2  hsa-let-7a-3  hsa-let-7b  \\\n",
       "0                                                                    \n",
       "TCGA-3C-AAAU     13.129765     14.117933     13.147714   14.595135   \n",
       "TCGA-3C-AALI     12.918069     13.922300     12.913194   14.512657   \n",
       "TCGA-3C-AALJ     13.012033     14.010002     13.028483   13.419612   \n",
       "TCGA-3C-AALK     13.144697     14.141721     13.151281   14.667196   \n",
       "TCGA-4H-AAAK     13.411684     14.413518     13.420481   14.438548   \n",
       "\n",
       "gene          hsa-let-7c  hsa-let-7d  hsa-let-7e  hsa-let-7f-1  hsa-let-7f-2  \\\n",
       "0                                                                              \n",
       "TCGA-3C-AAAU    8.414890    8.665921   10.521777      3.879392     11.824817   \n",
       "TCGA-3C-AALI    9.646536    9.003653    9.131760      4.386952     12.678841   \n",
       "TCGA-3C-AALJ    9.312455    9.276943   11.395711      5.314692     13.530255   \n",
       "TCGA-3C-AALK   11.511431    8.384763   10.368981      4.159182     12.652559   \n",
       "TCGA-4H-AAAK   11.693927    8.453747   10.741371      4.494537     13.009499   \n",
       "\n",
       "gene          hsa-let-7g  ...  hsa-mir-937  hsa-mir-939  hsa-mir-940  \\\n",
       "0                         ...                                          \n",
       "TCGA-3C-AAAU    8.597744  ...     0.906699    -0.093302     2.672234   \n",
       "TCGA-3C-AALI    8.455144  ...     1.579597    -0.083367     0.139024   \n",
       "TCGA-3C-AALJ    9.230563  ...     3.270298    -2.189134     0.395828   \n",
       "TCGA-3C-AALK    8.471503  ...     0.923965    -0.660997    -0.076034   \n",
       "TCGA-4H-AAAK    8.381220  ...     0.182950    -0.624403    -1.624403   \n",
       "\n",
       "gene          hsa-mir-942  hsa-mir-944  hsa-mir-95  hsa-mir-96  hsa-mir-98  \\\n",
       "0                                                                            \n",
       "TCGA-3C-AAAU     2.467414     1.044202    2.044202    6.906699    5.754696   \n",
       "TCGA-3C-AALI     3.032109    -0.668331    0.331670    5.912870    6.427066   \n",
       "TCGA-3C-AALJ     1.855261    -0.381778    0.717757    6.603657    6.878301   \n",
       "TCGA-3C-AALK     1.798435     1.798435    0.798435    6.181354    5.377922   \n",
       "TCGA-4H-AAAK     1.076036     0.182950   -0.302475    4.318110    5.103516   \n",
       "\n",
       "gene          hsa-mir-99a  hsa-mir-99b  \n",
       "0                                       \n",
       "TCGA-3C-AAAU     7.024602    15.506461  \n",
       "TCGA-3C-AALI     7.885299    13.626182  \n",
       "TCGA-3C-AALJ     7.580704    15.013822  \n",
       "TCGA-3C-AALK    10.031619    14.554783  \n",
       "TCGA-4H-AAAK    10.078201    14.650338  \n",
       "\n",
       "[5 rows x 503 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>project</th>\n",
       "      <th>synchronous_malignancy</th>\n",
       "      <th>ajcc_pathologic_stage</th>\n",
       "      <th>days_to_diagnosis</th>\n",
       "      <th>laterality</th>\n",
       "      <th>created_datetime</th>\n",
       "      <th>last_known_disease_status</th>\n",
       "      <th>tissue_or_organ_of_origin</th>\n",
       "      <th>days_to_last_follow_up</th>\n",
       "      <th>age_at_diagnosis</th>\n",
       "      <th>...</th>\n",
       "      <th>pathology_N_stage</th>\n",
       "      <th>pathology_M_stage</th>\n",
       "      <th>gender</th>\n",
       "      <th>date_of_initial_pathologic_diagnosis</th>\n",
       "      <th>days_to_last_known_alive</th>\n",
       "      <th>radiation_therapy</th>\n",
       "      <th>histological_type</th>\n",
       "      <th>number_of_lymph_nodes</th>\n",
       "      <th>race</th>\n",
       "      <th>ethnicity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AAAU</th>\n",
       "      <td>TCGA-BRCA</td>\n",
       "      <td>No</td>\n",
       "      <td>Stage X</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Left</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Breast, NOS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20211.0</td>\n",
       "      <td>...</td>\n",
       "      <td>nx</td>\n",
       "      <td>mx</td>\n",
       "      <td>female</td>\n",
       "      <td>2004</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "      <td>infiltrating lobular carcinoma</td>\n",
       "      <td>4</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALI</th>\n",
       "      <td>TCGA-BRCA</td>\n",
       "      <td>No</td>\n",
       "      <td>Stage IIB</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Right</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Breast, NOS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>18538.0</td>\n",
       "      <td>...</td>\n",
       "      <td>n1a</td>\n",
       "      <td>m0</td>\n",
       "      <td>female</td>\n",
       "      <td>2003</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yes</td>\n",
       "      <td>infiltrating ductal carcinoma</td>\n",
       "      <td>1</td>\n",
       "      <td>black or african american</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALJ</th>\n",
       "      <td>TCGA-BRCA</td>\n",
       "      <td>No</td>\n",
       "      <td>Stage IIB</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Right</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Breast, NOS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>22848.0</td>\n",
       "      <td>...</td>\n",
       "      <td>n1a</td>\n",
       "      <td>m0</td>\n",
       "      <td>female</td>\n",
       "      <td>2011</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "      <td>infiltrating ductal carcinoma</td>\n",
       "      <td>1</td>\n",
       "      <td>black or african american</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALK</th>\n",
       "      <td>TCGA-BRCA</td>\n",
       "      <td>No</td>\n",
       "      <td>Stage IA</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Right</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Breast, NOS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>19074.0</td>\n",
       "      <td>...</td>\n",
       "      <td>n0 (i+)</td>\n",
       "      <td>m0</td>\n",
       "      <td>female</td>\n",
       "      <td>2011</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "      <td>infiltrating ductal carcinoma</td>\n",
       "      <td>0</td>\n",
       "      <td>black or african american</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-4H-AAAK</th>\n",
       "      <td>TCGA-BRCA</td>\n",
       "      <td>No</td>\n",
       "      <td>Stage IIIA</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Left</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Breast, NOS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>18371.0</td>\n",
       "      <td>...</td>\n",
       "      <td>n2a</td>\n",
       "      <td>m0</td>\n",
       "      <td>female</td>\n",
       "      <td>2013</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "      <td>infiltrating lobular carcinoma</td>\n",
       "      <td>4</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 119 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                project synchronous_malignancy ajcc_pathologic_stage  \\\n",
       "TCGA-3C-AAAU  TCGA-BRCA                     No               Stage X   \n",
       "TCGA-3C-AALI  TCGA-BRCA                     No             Stage IIB   \n",
       "TCGA-3C-AALJ  TCGA-BRCA                     No             Stage IIB   \n",
       "TCGA-3C-AALK  TCGA-BRCA                     No              Stage IA   \n",
       "TCGA-4H-AAAK  TCGA-BRCA                     No            Stage IIIA   \n",
       "\n",
       "              days_to_diagnosis laterality  created_datetime  \\\n",
       "TCGA-3C-AAAU                0.0       Left               NaN   \n",
       "TCGA-3C-AALI                0.0      Right               NaN   \n",
       "TCGA-3C-AALJ                0.0      Right               NaN   \n",
       "TCGA-3C-AALK                0.0      Right               NaN   \n",
       "TCGA-4H-AAAK                0.0       Left               NaN   \n",
       "\n",
       "             last_known_disease_status tissue_or_organ_of_origin  \\\n",
       "TCGA-3C-AAAU                       NaN               Breast, NOS   \n",
       "TCGA-3C-AALI                       NaN               Breast, NOS   \n",
       "TCGA-3C-AALJ                       NaN               Breast, NOS   \n",
       "TCGA-3C-AALK                       NaN               Breast, NOS   \n",
       "TCGA-4H-AAAK                       NaN               Breast, NOS   \n",
       "\n",
       "              days_to_last_follow_up  age_at_diagnosis  ... pathology_N_stage  \\\n",
       "TCGA-3C-AAAU                     NaN           20211.0  ...                nx   \n",
       "TCGA-3C-AALI                     NaN           18538.0  ...               n1a   \n",
       "TCGA-3C-AALJ                     NaN           22848.0  ...               n1a   \n",
       "TCGA-3C-AALK                     NaN           19074.0  ...           n0 (i+)   \n",
       "TCGA-4H-AAAK                     NaN           18371.0  ...               n2a   \n",
       "\n",
       "             pathology_M_stage  gender date_of_initial_pathologic_diagnosis  \\\n",
       "TCGA-3C-AAAU                mx  female                                 2004   \n",
       "TCGA-3C-AALI                m0  female                                 2003   \n",
       "TCGA-3C-AALJ                m0  female                                 2011   \n",
       "TCGA-3C-AALK                m0  female                                 2011   \n",
       "TCGA-4H-AAAK                m0  female                                 2013   \n",
       "\n",
       "              days_to_last_known_alive radiation_therapy  \\\n",
       "TCGA-3C-AAAU                       NaN                no   \n",
       "TCGA-3C-AALI                       NaN               yes   \n",
       "TCGA-3C-AALJ                       NaN                no   \n",
       "TCGA-3C-AALK                       NaN                no   \n",
       "TCGA-4H-AAAK                       NaN                no   \n",
       "\n",
       "                           histological_type number_of_lymph_nodes  \\\n",
       "TCGA-3C-AAAU  infiltrating lobular carcinoma                     4   \n",
       "TCGA-3C-AALI   infiltrating ductal carcinoma                     1   \n",
       "TCGA-3C-AALJ   infiltrating ductal carcinoma                     1   \n",
       "TCGA-3C-AALK   infiltrating ductal carcinoma                     0   \n",
       "TCGA-4H-AAAK  infiltrating lobular carcinoma                     4   \n",
       "\n",
       "                                   race               ethnicity  \n",
       "TCGA-3C-AAAU                      white  not hispanic or latino  \n",
       "TCGA-3C-AALI  black or african american  not hispanic or latino  \n",
       "TCGA-3C-AALJ  black or african american  not hispanic or latino  \n",
       "TCGA-3C-AALK  black or african american  not hispanic or latino  \n",
       "TCGA-4H-AAAK                      white  not hispanic or latino  \n",
       "\n",
       "[5 rows x 119 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "BRCA_Subtype_PAM50\n",
       "LumA                  419\n",
       "LumB                  140\n",
       "Basal                 130\n",
       "Her2                   46\n",
       "Normal                 34\n",
       "Name: count, dtype: int64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(meth.head())\n",
    "display(rna.head())\n",
    "display(mirna.head())\n",
    "display(clinical.head())\n",
    "display(pam50.value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17f7d599",
   "metadata": {},
   "source": [
    "# Preprocessing\n",
    "\n",
    "After reviewing the data above, we applied the following steps to the data before further analysis.\n",
    "\n",
    "1. Methylation (B -> M-value)\n",
    "   - Clip B-values to \\[E, 1-E] and apply logit transform: M = log_2(B / (1-B)).\n",
    "   - Drop the original `Composite Element REF` column.\n",
    "\n",
    "2. mRNA & miRNA:\n",
    "   - Already in log_2 scale (RSEM normalized and RPKM).\n",
    "\n",
    "3. Quality Control:\n",
    "   - Count samples with all-zero rows in each modality.\n",
    "   - Compute NaN counts post-transformation, then replace all NaNs with 0.\n",
    "\n",
    "4. Column Name Cleaning:\n",
    "   - Replace all `-` and `|` characters with `_`.\n",
    "   - Replace `?` with `unknown`.\n",
    "\n",
    "5. Label Encoding:\n",
    "   - Map PAM50 subtypes to integers: Normal=0, Basal=1, Her2=2, LumA=3, LumB=4\n",
    "\n",
    "6. Alignment & Aggregation:\n",
    "   - Trim barcodes to patient level.\n",
    "   - Aggregate duplicate aliquots by mean per patient.\n",
    "   - Drop the `project` column from clinical.\n",
    "   - Subset all tables to the common patient set (no missing or all-zero samples).\n",
    "\n",
    "7. Final Output Shapes:\n",
    "   - Methylation M-value: 769 × 20,107\n",
    "   - mRNA (log_2): 769 × 20,531\n",
    "   - miRNA (log_2): 769 × 503\n",
    "   - PAM50 labels: 769 × 1\n",
    "   - Clinical covariates: 769 × 101"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5bb6450e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "All zeros: meth: 0, rna: 0, mirna: 0\n",
      "nan_meth: 0, nan_rna: 0, nan_mirna: 0, nan_clinical: 0, nan_pam50: 0\n",
      "NaN counts after filling:\n",
      "0 0 0 46476 0\n",
      "new shapes: meth: (769, 20106), rna: (769, 18321), mirna: (769, 503), pam50: (769, 1), clinical: (769, 118)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Hybridization REF</th>\n",
       "      <th>A1BG</th>\n",
       "      <th>A1CF</th>\n",
       "      <th>A2BP1</th>\n",
       "      <th>A2LD1</th>\n",
       "      <th>A2M</th>\n",
       "      <th>A2ML1</th>\n",
       "      <th>A4GALT</th>\n",
       "      <th>A4GNT</th>\n",
       "      <th>AAA1</th>\n",
       "      <th>AAAS</th>\n",
       "      <th>...</th>\n",
       "      <th>ZWILCH</th>\n",
       "      <th>ZWINT</th>\n",
       "      <th>ZXDC</th>\n",
       "      <th>ZYG11A</th>\n",
       "      <th>ZYG11B</th>\n",
       "      <th>ZYX</th>\n",
       "      <th>ZZEF1</th>\n",
       "      <th>ZZZ3</th>\n",
       "      <th>psiTPTE22</th>\n",
       "      <th>tAKR</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>patient</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AAAU</th>\n",
       "      <td>-0.094004</td>\n",
       "      <td>-1.251175</td>\n",
       "      <td>-2.113585</td>\n",
       "      <td>0.765262</td>\n",
       "      <td>0.345896</td>\n",
       "      <td>2.343631</td>\n",
       "      <td>-0.087741</td>\n",
       "      <td>1.155791</td>\n",
       "      <td>2.071436</td>\n",
       "      <td>-2.650851</td>\n",
       "      <td>...</td>\n",
       "      <td>-2.972923</td>\n",
       "      <td>-4.132523</td>\n",
       "      <td>-1.308165</td>\n",
       "      <td>-1.034199</td>\n",
       "      <td>0.016935</td>\n",
       "      <td>-1.820233</td>\n",
       "      <td>-0.103662</td>\n",
       "      <td>-3.055084</td>\n",
       "      <td>-1.605783</td>\n",
       "      <td>0.036955</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALI</th>\n",
       "      <td>0.812517</td>\n",
       "      <td>-0.237291</td>\n",
       "      <td>-1.658888</td>\n",
       "      <td>0.997440</td>\n",
       "      <td>0.630221</td>\n",
       "      <td>2.418135</td>\n",
       "      <td>0.289780</td>\n",
       "      <td>1.584114</td>\n",
       "      <td>-0.613329</td>\n",
       "      <td>-4.072465</td>\n",
       "      <td>...</td>\n",
       "      <td>-2.989465</td>\n",
       "      <td>-4.369032</td>\n",
       "      <td>-1.469365</td>\n",
       "      <td>-0.549876</td>\n",
       "      <td>-0.382967</td>\n",
       "      <td>-1.691887</td>\n",
       "      <td>-0.238022</td>\n",
       "      <td>-2.879231</td>\n",
       "      <td>-2.360128</td>\n",
       "      <td>0.729981</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALJ</th>\n",
       "      <td>0.931878</td>\n",
       "      <td>-0.059301</td>\n",
       "      <td>-1.369104</td>\n",
       "      <td>1.628617</td>\n",
       "      <td>0.972130</td>\n",
       "      <td>2.277584</td>\n",
       "      <td>-0.137988</td>\n",
       "      <td>0.916964</td>\n",
       "      <td>1.956230</td>\n",
       "      <td>-3.781647</td>\n",
       "      <td>...</td>\n",
       "      <td>-2.969472</td>\n",
       "      <td>-4.488190</td>\n",
       "      <td>-1.419578</td>\n",
       "      <td>-0.637297</td>\n",
       "      <td>-0.292273</td>\n",
       "      <td>-1.902991</td>\n",
       "      <td>-0.100215</td>\n",
       "      <td>-3.132087</td>\n",
       "      <td>-1.567104</td>\n",
       "      <td>0.025686</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALK</th>\n",
       "      <td>0.676913</td>\n",
       "      <td>0.741678</td>\n",
       "      <td>-0.064133</td>\n",
       "      <td>1.552454</td>\n",
       "      <td>1.420200</td>\n",
       "      <td>2.343133</td>\n",
       "      <td>0.324621</td>\n",
       "      <td>0.905816</td>\n",
       "      <td>2.152928</td>\n",
       "      <td>-3.894574</td>\n",
       "      <td>...</td>\n",
       "      <td>-2.558319</td>\n",
       "      <td>-4.341028</td>\n",
       "      <td>-1.213585</td>\n",
       "      <td>-0.523013</td>\n",
       "      <td>-0.309506</td>\n",
       "      <td>-1.824419</td>\n",
       "      <td>-0.081137</td>\n",
       "      <td>-2.973455</td>\n",
       "      <td>-0.162004</td>\n",
       "      <td>1.103860</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-4H-AAAK</th>\n",
       "      <td>0.657963</td>\n",
       "      <td>0.044649</td>\n",
       "      <td>-0.209004</td>\n",
       "      <td>1.212210</td>\n",
       "      <td>1.170304</td>\n",
       "      <td>2.021628</td>\n",
       "      <td>0.028103</td>\n",
       "      <td>0.180184</td>\n",
       "      <td>2.515149</td>\n",
       "      <td>-3.885526</td>\n",
       "      <td>...</td>\n",
       "      <td>-2.889175</td>\n",
       "      <td>-4.405580</td>\n",
       "      <td>-1.217950</td>\n",
       "      <td>-0.706284</td>\n",
       "      <td>-0.069670</td>\n",
       "      <td>-1.716283</td>\n",
       "      <td>-0.053464</td>\n",
       "      <td>-2.934908</td>\n",
       "      <td>-1.121575</td>\n",
       "      <td>1.545812</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 20106 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Hybridization REF      A1BG      A1CF     A2BP1     A2LD1       A2M     A2ML1  \\\n",
       "patient                                                                         \n",
       "TCGA-3C-AAAU      -0.094004 -1.251175 -2.113585  0.765262  0.345896  2.343631   \n",
       "TCGA-3C-AALI       0.812517 -0.237291 -1.658888  0.997440  0.630221  2.418135   \n",
       "TCGA-3C-AALJ       0.931878 -0.059301 -1.369104  1.628617  0.972130  2.277584   \n",
       "TCGA-3C-AALK       0.676913  0.741678 -0.064133  1.552454  1.420200  2.343133   \n",
       "TCGA-4H-AAAK       0.657963  0.044649 -0.209004  1.212210  1.170304  2.021628   \n",
       "\n",
       "Hybridization REF    A4GALT     A4GNT      AAA1      AAAS  ...    ZWILCH  \\\n",
       "patient                                                    ...             \n",
       "TCGA-3C-AAAU      -0.087741  1.155791  2.071436 -2.650851  ... -2.972923   \n",
       "TCGA-3C-AALI       0.289780  1.584114 -0.613329 -4.072465  ... -2.989465   \n",
       "TCGA-3C-AALJ      -0.137988  0.916964  1.956230 -3.781647  ... -2.969472   \n",
       "TCGA-3C-AALK       0.324621  0.905816  2.152928 -3.894574  ... -2.558319   \n",
       "TCGA-4H-AAAK       0.028103  0.180184  2.515149 -3.885526  ... -2.889175   \n",
       "\n",
       "Hybridization REF     ZWINT      ZXDC    ZYG11A    ZYG11B       ZYX     ZZEF1  \\\n",
       "patient                                                                         \n",
       "TCGA-3C-AAAU      -4.132523 -1.308165 -1.034199  0.016935 -1.820233 -0.103662   \n",
       "TCGA-3C-AALI      -4.369032 -1.469365 -0.549876 -0.382967 -1.691887 -0.238022   \n",
       "TCGA-3C-AALJ      -4.488190 -1.419578 -0.637297 -0.292273 -1.902991 -0.100215   \n",
       "TCGA-3C-AALK      -4.341028 -1.213585 -0.523013 -0.309506 -1.824419 -0.081137   \n",
       "TCGA-4H-AAAK      -4.405580 -1.217950 -0.706284 -0.069670 -1.716283 -0.053464   \n",
       "\n",
       "Hybridization REF      ZZZ3  psiTPTE22      tAKR  \n",
       "patient                                           \n",
       "TCGA-3C-AAAU      -3.055084  -1.605783  0.036955  \n",
       "TCGA-3C-AALI      -2.879231  -2.360128  0.729981  \n",
       "TCGA-3C-AALJ      -3.132087  -1.567104  0.025686  \n",
       "TCGA-3C-AALK      -2.973455  -0.162004  1.103860  \n",
       "TCGA-4H-AAAK      -2.934908  -1.121575  1.545812  \n",
       "\n",
       "[5 rows x 20106 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>gene</th>\n",
       "      <th>unknown_100133144</th>\n",
       "      <th>unknown_100134869</th>\n",
       "      <th>unknown_10357</th>\n",
       "      <th>unknown_10431</th>\n",
       "      <th>unknown_155060</th>\n",
       "      <th>unknown_26823</th>\n",
       "      <th>unknown_340602</th>\n",
       "      <th>unknown_388795</th>\n",
       "      <th>unknown_390284</th>\n",
       "      <th>unknown_391343</th>\n",
       "      <th>...</th>\n",
       "      <th>ZWINTunknown_11130</th>\n",
       "      <th>ZXDAunknown_7789</th>\n",
       "      <th>ZXDBunknown_158586</th>\n",
       "      <th>ZXDCunknown_79364</th>\n",
       "      <th>ZYG11Aunknown_440590</th>\n",
       "      <th>ZYG11Bunknown_79699</th>\n",
       "      <th>ZYXunknown_7791</th>\n",
       "      <th>ZZEF1unknown_23140</th>\n",
       "      <th>ZZZ3unknown_26009</th>\n",
       "      <th>psiTPTE22unknown_387590</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>patient</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AAAU</th>\n",
       "      <td>4.032489</td>\n",
       "      <td>3.692829</td>\n",
       "      <td>5.704604</td>\n",
       "      <td>8.672694</td>\n",
       "      <td>10.213110</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.785174</td>\n",
       "      <td>-1.536587</td>\n",
       "      <td>2.048201</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>9.864120</td>\n",
       "      <td>7.017830</td>\n",
       "      <td>9.976968</td>\n",
       "      <td>10.695662</td>\n",
       "      <td>8.013988</td>\n",
       "      <td>10.238851</td>\n",
       "      <td>11.776124</td>\n",
       "      <td>10.887932</td>\n",
       "      <td>10.205129</td>\n",
       "      <td>0.785174</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALI</th>\n",
       "      <td>3.211931</td>\n",
       "      <td>4.119273</td>\n",
       "      <td>6.124231</td>\n",
       "      <td>9.139279</td>\n",
       "      <td>9.011343</td>\n",
       "      <td>0.121015</td>\n",
       "      <td>7.170928</td>\n",
       "      <td>2.291014</td>\n",
       "      <td>0.706022</td>\n",
       "      <td>3.027968</td>\n",
       "      <td>...</td>\n",
       "      <td>9.914682</td>\n",
       "      <td>5.902438</td>\n",
       "      <td>8.809329</td>\n",
       "      <td>10.391374</td>\n",
       "      <td>7.632831</td>\n",
       "      <td>9.237422</td>\n",
       "      <td>12.426428</td>\n",
       "      <td>10.364848</td>\n",
       "      <td>8.667973</td>\n",
       "      <td>9.855788</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALJ</th>\n",
       "      <td>3.538886</td>\n",
       "      <td>3.206237</td>\n",
       "      <td>7.269570</td>\n",
       "      <td>10.410275</td>\n",
       "      <td>9.209506</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.443554</td>\n",
       "      <td>1.443554</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>11.305650</td>\n",
       "      <td>5.143969</td>\n",
       "      <td>9.060691</td>\n",
       "      <td>9.586488</td>\n",
       "      <td>8.374267</td>\n",
       "      <td>9.055784</td>\n",
       "      <td>12.414355</td>\n",
       "      <td>9.880935</td>\n",
       "      <td>8.992994</td>\n",
       "      <td>5.143969</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALK</th>\n",
       "      <td>3.595671</td>\n",
       "      <td>3.469873</td>\n",
       "      <td>7.168565</td>\n",
       "      <td>9.757450</td>\n",
       "      <td>9.110487</td>\n",
       "      <td>-1.273343</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.048724</td>\n",
       "      <td>2.186215</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>9.384994</td>\n",
       "      <td>5.782065</td>\n",
       "      <td>8.773906</td>\n",
       "      <td>9.754688</td>\n",
       "      <td>7.454703</td>\n",
       "      <td>9.246419</td>\n",
       "      <td>12.474556</td>\n",
       "      <td>9.609426</td>\n",
       "      <td>9.453001</td>\n",
       "      <td>6.057699</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-4H-AAAK</th>\n",
       "      <td>2.775430</td>\n",
       "      <td>3.850979</td>\n",
       "      <td>6.395968</td>\n",
       "      <td>9.581922</td>\n",
       "      <td>8.027083</td>\n",
       "      <td>-1.232769</td>\n",
       "      <td>-1.232769</td>\n",
       "      <td>1.574683</td>\n",
       "      <td>1.574683</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>9.397606</td>\n",
       "      <td>5.612830</td>\n",
       "      <td>8.728789</td>\n",
       "      <td>10.035881</td>\n",
       "      <td>3.811738</td>\n",
       "      <td>9.599438</td>\n",
       "      <td>11.980747</td>\n",
       "      <td>9.700292</td>\n",
       "      <td>9.784147</td>\n",
       "      <td>7.548699</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 18321 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "gene          unknown_100133144  unknown_100134869  unknown_10357  \\\n",
       "patient                                                             \n",
       "TCGA-3C-AAAU           4.032489           3.692829       5.704604   \n",
       "TCGA-3C-AALI           3.211931           4.119273       6.124231   \n",
       "TCGA-3C-AALJ           3.538886           3.206237       7.269570   \n",
       "TCGA-3C-AALK           3.595671           3.469873       7.168565   \n",
       "TCGA-4H-AAAK           2.775430           3.850979       6.395968   \n",
       "\n",
       "gene          unknown_10431  unknown_155060  unknown_26823  unknown_340602  \\\n",
       "patient                                                                      \n",
       "TCGA-3C-AAAU       8.672694       10.213110       0.000000        0.785174   \n",
       "TCGA-3C-AALI       9.139279        9.011343       0.121015        7.170928   \n",
       "TCGA-3C-AALJ      10.410275        9.209506       0.000000        0.000000   \n",
       "TCGA-3C-AALK       9.757450        9.110487      -1.273343        0.000000   \n",
       "TCGA-4H-AAAK       9.581922        8.027083      -1.232769       -1.232769   \n",
       "\n",
       "gene          unknown_388795  unknown_390284  unknown_391343  ...  \\\n",
       "patient                                                       ...   \n",
       "TCGA-3C-AAAU       -1.536587        2.048201        0.000000  ...   \n",
       "TCGA-3C-AALI        2.291014        0.706022        3.027968  ...   \n",
       "TCGA-3C-AALJ        1.443554        1.443554        0.000000  ...   \n",
       "TCGA-3C-AALK        1.048724        2.186215        0.000000  ...   \n",
       "TCGA-4H-AAAK        1.574683        1.574683        0.000000  ...   \n",
       "\n",
       "gene          ZWINTunknown_11130  ZXDAunknown_7789  ZXDBunknown_158586  \\\n",
       "patient                                                                  \n",
       "TCGA-3C-AAAU            9.864120          7.017830            9.976968   \n",
       "TCGA-3C-AALI            9.914682          5.902438            8.809329   \n",
       "TCGA-3C-AALJ           11.305650          5.143969            9.060691   \n",
       "TCGA-3C-AALK            9.384994          5.782065            8.773906   \n",
       "TCGA-4H-AAAK            9.397606          5.612830            8.728789   \n",
       "\n",
       "gene          ZXDCunknown_79364  ZYG11Aunknown_440590  ZYG11Bunknown_79699  \\\n",
       "patient                                                                      \n",
       "TCGA-3C-AAAU          10.695662              8.013988            10.238851   \n",
       "TCGA-3C-AALI          10.391374              7.632831             9.237422   \n",
       "TCGA-3C-AALJ           9.586488              8.374267             9.055784   \n",
       "TCGA-3C-AALK           9.754688              7.454703             9.246419   \n",
       "TCGA-4H-AAAK          10.035881              3.811738             9.599438   \n",
       "\n",
       "gene          ZYXunknown_7791  ZZEF1unknown_23140  ZZZ3unknown_26009  \\\n",
       "patient                                                                \n",
       "TCGA-3C-AAAU        11.776124           10.887932          10.205129   \n",
       "TCGA-3C-AALI        12.426428           10.364848           8.667973   \n",
       "TCGA-3C-AALJ        12.414355            9.880935           8.992994   \n",
       "TCGA-3C-AALK        12.474556            9.609426           9.453001   \n",
       "TCGA-4H-AAAK        11.980747            9.700292           9.784147   \n",
       "\n",
       "gene          psiTPTE22unknown_387590  \n",
       "patient                                \n",
       "TCGA-3C-AAAU                 0.785174  \n",
       "TCGA-3C-AALI                 9.855788  \n",
       "TCGA-3C-AALJ                 5.143969  \n",
       "TCGA-3C-AALK                 6.057699  \n",
       "TCGA-4H-AAAK                 7.548699  \n",
       "\n",
       "[5 rows x 18321 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>gene</th>\n",
       "      <th>hsa_let_7a_1</th>\n",
       "      <th>hsa_let_7a_2</th>\n",
       "      <th>hsa_let_7a_3</th>\n",
       "      <th>hsa_let_7b</th>\n",
       "      <th>hsa_let_7c</th>\n",
       "      <th>hsa_let_7d</th>\n",
       "      <th>hsa_let_7e</th>\n",
       "      <th>hsa_let_7f_1</th>\n",
       "      <th>hsa_let_7f_2</th>\n",
       "      <th>hsa_let_7g</th>\n",
       "      <th>...</th>\n",
       "      <th>hsa_mir_937</th>\n",
       "      <th>hsa_mir_939</th>\n",
       "      <th>hsa_mir_940</th>\n",
       "      <th>hsa_mir_942</th>\n",
       "      <th>hsa_mir_944</th>\n",
       "      <th>hsa_mir_95</th>\n",
       "      <th>hsa_mir_96</th>\n",
       "      <th>hsa_mir_98</th>\n",
       "      <th>hsa_mir_99a</th>\n",
       "      <th>hsa_mir_99b</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>patient</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AAAU</th>\n",
       "      <td>13.129765</td>\n",
       "      <td>14.117933</td>\n",
       "      <td>13.147714</td>\n",
       "      <td>14.595135</td>\n",
       "      <td>8.414890</td>\n",
       "      <td>8.665921</td>\n",
       "      <td>10.521777</td>\n",
       "      <td>3.879392</td>\n",
       "      <td>11.824817</td>\n",
       "      <td>8.597744</td>\n",
       "      <td>...</td>\n",
       "      <td>0.906699</td>\n",
       "      <td>-0.093302</td>\n",
       "      <td>2.672234</td>\n",
       "      <td>2.467414</td>\n",
       "      <td>1.044202</td>\n",
       "      <td>2.044202</td>\n",
       "      <td>6.906699</td>\n",
       "      <td>5.754696</td>\n",
       "      <td>7.024602</td>\n",
       "      <td>15.506461</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALI</th>\n",
       "      <td>12.918069</td>\n",
       "      <td>13.922300</td>\n",
       "      <td>12.913194</td>\n",
       "      <td>14.512657</td>\n",
       "      <td>9.646536</td>\n",
       "      <td>9.003653</td>\n",
       "      <td>9.131760</td>\n",
       "      <td>4.386952</td>\n",
       "      <td>12.678841</td>\n",
       "      <td>8.455144</td>\n",
       "      <td>...</td>\n",
       "      <td>1.579597</td>\n",
       "      <td>-0.083367</td>\n",
       "      <td>0.139024</td>\n",
       "      <td>3.032109</td>\n",
       "      <td>-0.668331</td>\n",
       "      <td>0.331670</td>\n",
       "      <td>5.912870</td>\n",
       "      <td>6.427066</td>\n",
       "      <td>7.885299</td>\n",
       "      <td>13.626182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALJ</th>\n",
       "      <td>13.012033</td>\n",
       "      <td>14.010002</td>\n",
       "      <td>13.028483</td>\n",
       "      <td>13.419612</td>\n",
       "      <td>9.312455</td>\n",
       "      <td>9.276943</td>\n",
       "      <td>11.395711</td>\n",
       "      <td>5.314692</td>\n",
       "      <td>13.530255</td>\n",
       "      <td>9.230563</td>\n",
       "      <td>...</td>\n",
       "      <td>3.270298</td>\n",
       "      <td>-2.189134</td>\n",
       "      <td>0.395828</td>\n",
       "      <td>1.855261</td>\n",
       "      <td>-0.381778</td>\n",
       "      <td>0.717757</td>\n",
       "      <td>6.603657</td>\n",
       "      <td>6.878301</td>\n",
       "      <td>7.580704</td>\n",
       "      <td>15.013822</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALK</th>\n",
       "      <td>13.144697</td>\n",
       "      <td>14.141721</td>\n",
       "      <td>13.151281</td>\n",
       "      <td>14.667196</td>\n",
       "      <td>11.511431</td>\n",
       "      <td>8.384763</td>\n",
       "      <td>10.368981</td>\n",
       "      <td>4.159182</td>\n",
       "      <td>12.652559</td>\n",
       "      <td>8.471503</td>\n",
       "      <td>...</td>\n",
       "      <td>0.923965</td>\n",
       "      <td>-0.660997</td>\n",
       "      <td>-0.076034</td>\n",
       "      <td>1.798435</td>\n",
       "      <td>1.798435</td>\n",
       "      <td>0.798435</td>\n",
       "      <td>6.181354</td>\n",
       "      <td>5.377922</td>\n",
       "      <td>10.031619</td>\n",
       "      <td>14.554783</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-4H-AAAK</th>\n",
       "      <td>13.411684</td>\n",
       "      <td>14.413518</td>\n",
       "      <td>13.420481</td>\n",
       "      <td>14.438548</td>\n",
       "      <td>11.693927</td>\n",
       "      <td>8.453747</td>\n",
       "      <td>10.741371</td>\n",
       "      <td>4.494537</td>\n",
       "      <td>13.009499</td>\n",
       "      <td>8.381220</td>\n",
       "      <td>...</td>\n",
       "      <td>0.182950</td>\n",
       "      <td>-0.624403</td>\n",
       "      <td>-1.624403</td>\n",
       "      <td>1.076036</td>\n",
       "      <td>0.182950</td>\n",
       "      <td>-0.302475</td>\n",
       "      <td>4.318110</td>\n",
       "      <td>5.103516</td>\n",
       "      <td>10.078201</td>\n",
       "      <td>14.650338</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 503 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "gene          hsa_let_7a_1  hsa_let_7a_2  hsa_let_7a_3  hsa_let_7b  \\\n",
       "patient                                                              \n",
       "TCGA-3C-AAAU     13.129765     14.117933     13.147714   14.595135   \n",
       "TCGA-3C-AALI     12.918069     13.922300     12.913194   14.512657   \n",
       "TCGA-3C-AALJ     13.012033     14.010002     13.028483   13.419612   \n",
       "TCGA-3C-AALK     13.144697     14.141721     13.151281   14.667196   \n",
       "TCGA-4H-AAAK     13.411684     14.413518     13.420481   14.438548   \n",
       "\n",
       "gene          hsa_let_7c  hsa_let_7d  hsa_let_7e  hsa_let_7f_1  hsa_let_7f_2  \\\n",
       "patient                                                                        \n",
       "TCGA-3C-AAAU    8.414890    8.665921   10.521777      3.879392     11.824817   \n",
       "TCGA-3C-AALI    9.646536    9.003653    9.131760      4.386952     12.678841   \n",
       "TCGA-3C-AALJ    9.312455    9.276943   11.395711      5.314692     13.530255   \n",
       "TCGA-3C-AALK   11.511431    8.384763   10.368981      4.159182     12.652559   \n",
       "TCGA-4H-AAAK   11.693927    8.453747   10.741371      4.494537     13.009499   \n",
       "\n",
       "gene          hsa_let_7g  ...  hsa_mir_937  hsa_mir_939  hsa_mir_940  \\\n",
       "patient                   ...                                          \n",
       "TCGA-3C-AAAU    8.597744  ...     0.906699    -0.093302     2.672234   \n",
       "TCGA-3C-AALI    8.455144  ...     1.579597    -0.083367     0.139024   \n",
       "TCGA-3C-AALJ    9.230563  ...     3.270298    -2.189134     0.395828   \n",
       "TCGA-3C-AALK    8.471503  ...     0.923965    -0.660997    -0.076034   \n",
       "TCGA-4H-AAAK    8.381220  ...     0.182950    -0.624403    -1.624403   \n",
       "\n",
       "gene          hsa_mir_942  hsa_mir_944  hsa_mir_95  hsa_mir_96  hsa_mir_98  \\\n",
       "patient                                                                      \n",
       "TCGA-3C-AAAU     2.467414     1.044202    2.044202    6.906699    5.754696   \n",
       "TCGA-3C-AALI     3.032109    -0.668331    0.331670    5.912870    6.427066   \n",
       "TCGA-3C-AALJ     1.855261    -0.381778    0.717757    6.603657    6.878301   \n",
       "TCGA-3C-AALK     1.798435     1.798435    0.798435    6.181354    5.377922   \n",
       "TCGA-4H-AAAK     1.076036     0.182950   -0.302475    4.318110    5.103516   \n",
       "\n",
       "gene          hsa_mir_99a  hsa_mir_99b  \n",
       "patient                                 \n",
       "TCGA-3C-AAAU     7.024602    15.506461  \n",
       "TCGA-3C-AALI     7.885299    13.626182  \n",
       "TCGA-3C-AALJ     7.580704    15.013822  \n",
       "TCGA-3C-AALK    10.031619    14.554783  \n",
       "TCGA-4H-AAAK    10.078201    14.650338  \n",
       "\n",
       "[5 rows x 503 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>synchronous_malignancy</th>\n",
       "      <th>ajcc_pathologic_stage</th>\n",
       "      <th>days_to_diagnosis</th>\n",
       "      <th>laterality</th>\n",
       "      <th>created_datetime</th>\n",
       "      <th>last_known_disease_status</th>\n",
       "      <th>tissue_or_organ_of_origin</th>\n",
       "      <th>days_to_last_follow_up</th>\n",
       "      <th>age_at_diagnosis</th>\n",
       "      <th>primary_diagnosis</th>\n",
       "      <th>...</th>\n",
       "      <th>pathology_N_stage</th>\n",
       "      <th>pathology_M_stage</th>\n",
       "      <th>gender</th>\n",
       "      <th>date_of_initial_pathologic_diagnosis</th>\n",
       "      <th>days_to_last_known_alive</th>\n",
       "      <th>radiation_therapy</th>\n",
       "      <th>histological_type</th>\n",
       "      <th>number_of_lymph_nodes</th>\n",
       "      <th>race</th>\n",
       "      <th>ethnicity</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>patient</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AAAU</th>\n",
       "      <td>No</td>\n",
       "      <td>Stage X</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Left</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Breast, NOS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20211.0</td>\n",
       "      <td>Lobular carcinoma, NOS</td>\n",
       "      <td>...</td>\n",
       "      <td>nx</td>\n",
       "      <td>mx</td>\n",
       "      <td>female</td>\n",
       "      <td>2004</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "      <td>infiltrating lobular carcinoma</td>\n",
       "      <td>4</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALI</th>\n",
       "      <td>No</td>\n",
       "      <td>Stage IIB</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Right</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Breast, NOS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>18538.0</td>\n",
       "      <td>Infiltrating duct carcinoma, NOS</td>\n",
       "      <td>...</td>\n",
       "      <td>n1a</td>\n",
       "      <td>m0</td>\n",
       "      <td>female</td>\n",
       "      <td>2003</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yes</td>\n",
       "      <td>infiltrating ductal carcinoma</td>\n",
       "      <td>1</td>\n",
       "      <td>black or african american</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALJ</th>\n",
       "      <td>No</td>\n",
       "      <td>Stage IIB</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Right</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Breast, NOS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>22848.0</td>\n",
       "      <td>Infiltrating duct carcinoma, NOS</td>\n",
       "      <td>...</td>\n",
       "      <td>n1a</td>\n",
       "      <td>m0</td>\n",
       "      <td>female</td>\n",
       "      <td>2011</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "      <td>infiltrating ductal carcinoma</td>\n",
       "      <td>1</td>\n",
       "      <td>black or african american</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALK</th>\n",
       "      <td>No</td>\n",
       "      <td>Stage IA</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Right</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Breast, NOS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>19074.0</td>\n",
       "      <td>Infiltrating duct carcinoma, NOS</td>\n",
       "      <td>...</td>\n",
       "      <td>n0 (i+)</td>\n",
       "      <td>m0</td>\n",
       "      <td>female</td>\n",
       "      <td>2011</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "      <td>infiltrating ductal carcinoma</td>\n",
       "      <td>0</td>\n",
       "      <td>black or african american</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-4H-AAAK</th>\n",
       "      <td>No</td>\n",
       "      <td>Stage IIIA</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Left</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Breast, NOS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>18371.0</td>\n",
       "      <td>Lobular carcinoma, NOS</td>\n",
       "      <td>...</td>\n",
       "      <td>n2a</td>\n",
       "      <td>m0</td>\n",
       "      <td>female</td>\n",
       "      <td>2013</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "      <td>infiltrating lobular carcinoma</td>\n",
       "      <td>4</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 118 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "             synchronous_malignancy ajcc_pathologic_stage  days_to_diagnosis  \\\n",
       "patient                                                                        \n",
       "TCGA-3C-AAAU                     No               Stage X                0.0   \n",
       "TCGA-3C-AALI                     No             Stage IIB                0.0   \n",
       "TCGA-3C-AALJ                     No             Stage IIB                0.0   \n",
       "TCGA-3C-AALK                     No              Stage IA                0.0   \n",
       "TCGA-4H-AAAK                     No            Stage IIIA                0.0   \n",
       "\n",
       "             laterality  created_datetime last_known_disease_status  \\\n",
       "patient                                                               \n",
       "TCGA-3C-AAAU       Left               NaN                       NaN   \n",
       "TCGA-3C-AALI      Right               NaN                       NaN   \n",
       "TCGA-3C-AALJ      Right               NaN                       NaN   \n",
       "TCGA-3C-AALK      Right               NaN                       NaN   \n",
       "TCGA-4H-AAAK       Left               NaN                       NaN   \n",
       "\n",
       "             tissue_or_organ_of_origin  days_to_last_follow_up  \\\n",
       "patient                                                          \n",
       "TCGA-3C-AAAU               Breast, NOS                     NaN   \n",
       "TCGA-3C-AALI               Breast, NOS                     NaN   \n",
       "TCGA-3C-AALJ               Breast, NOS                     NaN   \n",
       "TCGA-3C-AALK               Breast, NOS                     NaN   \n",
       "TCGA-4H-AAAK               Breast, NOS                     NaN   \n",
       "\n",
       "              age_at_diagnosis                 primary_diagnosis  ...  \\\n",
       "patient                                                           ...   \n",
       "TCGA-3C-AAAU           20211.0            Lobular carcinoma, NOS  ...   \n",
       "TCGA-3C-AALI           18538.0  Infiltrating duct carcinoma, NOS  ...   \n",
       "TCGA-3C-AALJ           22848.0  Infiltrating duct carcinoma, NOS  ...   \n",
       "TCGA-3C-AALK           19074.0  Infiltrating duct carcinoma, NOS  ...   \n",
       "TCGA-4H-AAAK           18371.0            Lobular carcinoma, NOS  ...   \n",
       "\n",
       "             pathology_N_stage pathology_M_stage  gender  \\\n",
       "patient                                                    \n",
       "TCGA-3C-AAAU                nx                mx  female   \n",
       "TCGA-3C-AALI               n1a                m0  female   \n",
       "TCGA-3C-AALJ               n1a                m0  female   \n",
       "TCGA-3C-AALK           n0 (i+)                m0  female   \n",
       "TCGA-4H-AAAK               n2a                m0  female   \n",
       "\n",
       "              date_of_initial_pathologic_diagnosis days_to_last_known_alive  \\\n",
       "patient                                                                       \n",
       "TCGA-3C-AAAU                                  2004                      NaN   \n",
       "TCGA-3C-AALI                                  2003                      NaN   \n",
       "TCGA-3C-AALJ                                  2011                      NaN   \n",
       "TCGA-3C-AALK                                  2011                      NaN   \n",
       "TCGA-4H-AAAK                                  2013                      NaN   \n",
       "\n",
       "             radiation_therapy               histological_type  \\\n",
       "patient                                                          \n",
       "TCGA-3C-AAAU                no  infiltrating lobular carcinoma   \n",
       "TCGA-3C-AALI               yes   infiltrating ductal carcinoma   \n",
       "TCGA-3C-AALJ                no   infiltrating ductal carcinoma   \n",
       "TCGA-3C-AALK                no   infiltrating ductal carcinoma   \n",
       "TCGA-4H-AAAK                no  infiltrating lobular carcinoma   \n",
       "\n",
       "              number_of_lymph_nodes                       race  \\\n",
       "patient                                                          \n",
       "TCGA-3C-AAAU                      4                      white   \n",
       "TCGA-3C-AALI                      1  black or african american   \n",
       "TCGA-3C-AALJ                      1  black or african american   \n",
       "TCGA-3C-AALK                      0  black or african american   \n",
       "TCGA-4H-AAAK                      4                      white   \n",
       "\n",
       "                           ethnicity  \n",
       "patient                               \n",
       "TCGA-3C-AAAU  not hispanic or latino  \n",
       "TCGA-3C-AALI  not hispanic or latino  \n",
       "TCGA-3C-AALJ  not hispanic or latino  \n",
       "TCGA-3C-AALK  not hispanic or latino  \n",
       "TCGA-4H-AAAK  not hispanic or latino  \n",
       "\n",
       "[5 rows x 118 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "pam50\n",
       "3        419\n",
       "4        140\n",
       "1        130\n",
       "2         46\n",
       "0         34\n",
       "Name: count, dtype: int64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "def beta_to_m(df, eps=1e-6):\n",
    "    B = np.clip(df.values, eps, 1.0 - eps)\n",
    "    M = np.log2(B / (1 - B))\n",
    "    return pd.DataFrame(M, index=df.index, columns=df.columns)\n",
    "\n",
    "# find rows that are all 0s\n",
    "zeros_meth = (meth  == 0).all(axis=1).sum()\n",
    "zeros_rna = (rna   == 0).all(axis=1).sum()\n",
    "zeros_mirna = (mirna == 0).all(axis=1).sum()\n",
    "print(f\"All zeros: meth: {zeros_meth}, rna: {zeros_rna}, mirna: {zeros_mirna}\")\n",
    "\n",
    "# find rows with all nans\n",
    "nan_meth = meth.isna().all(axis=1).sum()\n",
    "nan_rna = rna.isna().all(axis=1).sum()\n",
    "nan_mirna = mirna.isna().all(axis=1).sum()\n",
    "nan_clinical = clinical.isna().all(axis=1).sum()\n",
    "nan_pam50 = pam50.isna().all(axis=1).sum()\n",
    "print(f\"nan_meth: {nan_meth}, nan_rna: {nan_rna}, nan_mirna: {nan_mirna}, nan_clinical: {nan_clinical}, nan_pam50: {nan_pam50}\")\n",
    "\n",
    "# map PAM50 subtypes to integers\n",
    "mapping = {\"Normal\":0, \"Basal\":1, \"Her2\":2, \"LumA\":3, \"LumB\":4}\n",
    "pam50 = pam50[\"BRCA_Subtype_PAM50\"].map(mapping).to_frame(name=\"pam50\")\n",
    "\n",
    "# drop and transform methylation\n",
    "meth_clean = meth.drop(columns=[\"Composite Element REF\"], errors=\"ignore\")\n",
    "meth_m = beta_to_m(meth_clean)\n",
    "clinical = clinical.drop(columns=[\"project\"], errors=\"ignore\")\n",
    "\n",
    "# clean column names and fill nans\n",
    "for df in [meth_m, rna, mirna]:\n",
    "    df.columns = df.columns.str.replace(r\"\\?\\|\", \"unknown_\", regex=True)\n",
    "    df.columns = df.columns.str.replace(r\"[?|]\", \"unknown_\", regex=True)\n",
    "    df.columns = df.columns.str.replace(\"-\", \"_\", regex=False)\n",
    "    df.columns = df.columns.str.replace(r\"_+\", \"_\", regex=True)\n",
    "    df.fillna(0, inplace=True)\n",
    "\n",
    "# check for nans after filling\n",
    "print(\"NaN counts after filling:\")\n",
    "print(meth_m.isna().sum().sum(),rna.isna().sum().sum(),mirna.isna().sum().sum(),clinical.isna().sum().sum(),pam50.isna().sum().sum())\n",
    "\n",
    "# align index to PAM50\n",
    "X_meth = meth_m.loc[pam50.index]\n",
    "X_rna = rna.loc[pam50.index]\n",
    "X_mirna = mirna.loc[pam50.index]\n",
    "clinical= clinical.loc[pam50.index]\n",
    "\n",
    "print(f\"new shapes: meth: {X_meth.shape}, rna: {X_rna.shape}, mirna: {X_mirna.shape}, pam50: {pam50.shape}, clinical: {clinical.shape}\")\n",
    "display(X_meth.head())\n",
    "display(X_rna.head())\n",
    "display(X_mirna.head())\n",
    "display(clinical.head())\n",
    "display(pam50.value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "2f0714e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# lets set up a commong index for all the files and then save them to csv\n",
    "X_meth.index.name = \"patient\"\n",
    "X_rna.index.name = \"patient\"\n",
    "X_mirna.index.name = \"patient\"\n",
    "pam50.index.name = \"patient\"\n",
    "clinical.index.name = \"patient\"\n",
    "\n",
    "X_meth.to_csv(root / \"meth.csv\", index=True)\n",
    "X_rna.to_csv(root / \"rna.csv\", index=True)\n",
    "X_mirna.to_csv(root / \"mirna.csv\", index=True)\n",
    "pam50.to_csv(root / \"pam50.csv\", index=True)\n",
    "clinical.to_csv(root / \"clinical.csv\", index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a387bfa4",
   "metadata": {},
   "source": [
    "# Optional: Load the data we just saved to make sure it looks okay."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef2982ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "meth = pd.read_csv(root / \"meth.csv\", index_col=0)\n",
    "rna = pd.read_csv(root / \"rna.csv\", index_col=0)\n",
    "mirna = pd.read_csv(root / \"mirna.csv\", index_col=0)\n",
    "pam50 = pd.read_csv(root / \"pam50.csv\", index_col=0)\n",
    "clinical = pd.read_csv(root / \"clinical.csv\", index_col=0)\n",
    "\n",
    "display(meth.head())\n",
    "display(rna.head())\n",
    "display(mirna.head())\n",
    "display(clinical.head())\n",
    "display(pam50.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2168980",
   "metadata": {},
   "source": [
    "# Easy Access via DatasetLoader\n",
    "\n",
    "To facilitate working with this data, we have made it available through our DatasetLoader component. If you have additional pre-processed or raw datasets you would like to include, feel free to reach out and we are happy to support adding new datasets to the platform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "2d0340f6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TGCA BRCA dataset shape: {'mirna': (769, 503), 'pam50': (769, 1), 'clinical': (769, 118), 'meth': (769, 20106), 'rna': (769, 18321)}\n"
     ]
    }
   ],
   "source": [
    "from bioneuralnet.datasets import DatasetLoader\n",
    "\n",
    "tgca_brca = DatasetLoader(\"brca\")\n",
    "\n",
    "print(f\"TGCA BRCA dataset shape: {tgca_brca.shape}\")\n",
    "brca_meth = tgca_brca.data[\"meth\"]\n",
    "brca_rna = tgca_brca.data[\"rna\"]\n",
    "brca_mirna = tgca_brca.data[\"mirna\"]\n",
    "brca_clinical = tgca_brca.data[\"clinical\"]\n",
    "brca_pam50 = tgca_brca.data[\"pam50\"]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "338dc995",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-05-16 10:31:09,364 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values\n",
      "2025-05-16 10:31:09,365 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 31384 NaNs after median imputation\n",
      "2025-05-16 10:31:09,365 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 39 columns dropped due to zero variance\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RNA shape: (769, 18321)\n",
      "METH shape: (769, 20106)\n",
      "miRNA shape: (769, 503)\n",
      "Clinical shape: (769, 118)\n",
      "Phenotype shape: (769, 1)\n",
      "Phenotype counts:\n",
      "pam50\n",
      "3        419\n",
      "4        140\n",
      "1        130\n",
      "2         46\n",
      "0         34\n",
      "Name: count, dtype: int64\n",
      "\n",
      "RNA:\n",
      "Min: -8.5873\n",
      "Max: 20.9784\n",
      "\n",
      "METH:\n",
      "Min: -7.1642\n",
      "Max: 6.9710\n",
      "\n",
      "MIRNA:\n",
      "Min: -4.4631\n",
      "Max: 19.3838\n",
      "Nan values in pam50 0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-05-16 10:31:09,752 - bioneuralnet.utils.preprocess - INFO - Selected top 15 features by RandomForest importance\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>days_to_birth</th>\n",
       "      <th>age_at_diagnosis</th>\n",
       "      <th>days_to_last_followup</th>\n",
       "      <th>age_at_index</th>\n",
       "      <th>years_to_birth</th>\n",
       "      <th>year_of_diagnosis</th>\n",
       "      <th>number_of_lymph_nodes</th>\n",
       "      <th>date_of_initial_pathologic_diagnosis</th>\n",
       "      <th>histological_type_infiltrating lobular carcinoma</th>\n",
       "      <th>primary_diagnosis_Lobular carcinoma, NOS</th>\n",
       "      <th>morphology_8520/3</th>\n",
       "      <th>race.1_white</th>\n",
       "      <th>days_to_death.1</th>\n",
       "      <th>laterality_Right</th>\n",
       "      <th>primary_diagnosis_Infiltrating duct carcinoma, NOS</th>\n",
       "      <th>country_of_residence_at_enrollment_United States</th>\n",
       "      <th>sites_of_involvement_Breast, NOS</th>\n",
       "      <th>days_to_death</th>\n",
       "      <th>race_white</th>\n",
       "      <th>ajcc_staging_system_edition_6th</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>patient</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AAAU</th>\n",
       "      <td>-20211.0</td>\n",
       "      <td>20211.0</td>\n",
       "      <td>4047.0</td>\n",
       "      <td>55.0</td>\n",
       "      <td>55.0</td>\n",
       "      <td>-1.50</td>\n",
       "      <td>1.5</td>\n",
       "      <td>-1.50</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>0.0</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALI</th>\n",
       "      <td>-18538.0</td>\n",
       "      <td>18538.0</td>\n",
       "      <td>4005.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>-1.75</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-1.75</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>0.0</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALJ</th>\n",
       "      <td>-22848.0</td>\n",
       "      <td>22848.0</td>\n",
       "      <td>1474.0</td>\n",
       "      <td>62.0</td>\n",
       "      <td>62.0</td>\n",
       "      <td>0.25</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.25</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>0.0</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-3C-AALK</th>\n",
       "      <td>-19074.0</td>\n",
       "      <td>19074.0</td>\n",
       "      <td>1448.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>0.25</td>\n",
       "      <td>-0.5</td>\n",
       "      <td>0.25</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>0.0</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-4H-AAAK</th>\n",
       "      <td>-18371.0</td>\n",
       "      <td>18371.0</td>\n",
       "      <td>348.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>0.75</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.75</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>0.0</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              days_to_birth  age_at_diagnosis  days_to_last_followup  \\\n",
       "patient                                                                \n",
       "TCGA-3C-AAAU       -20211.0           20211.0                 4047.0   \n",
       "TCGA-3C-AALI       -18538.0           18538.0                 4005.0   \n",
       "TCGA-3C-AALJ       -22848.0           22848.0                 1474.0   \n",
       "TCGA-3C-AALK       -19074.0           19074.0                 1448.0   \n",
       "TCGA-4H-AAAK       -18371.0           18371.0                  348.0   \n",
       "\n",
       "              age_at_index  years_to_birth  year_of_diagnosis  \\\n",
       "patient                                                         \n",
       "TCGA-3C-AAAU          55.0            55.0              -1.50   \n",
       "TCGA-3C-AALI          50.0            50.0              -1.75   \n",
       "TCGA-3C-AALJ          62.0            62.0               0.25   \n",
       "TCGA-3C-AALK          52.0            52.0               0.25   \n",
       "TCGA-4H-AAAK          50.0            50.0               0.75   \n",
       "\n",
       "              number_of_lymph_nodes  date_of_initial_pathologic_diagnosis  \\\n",
       "patient                                                                     \n",
       "TCGA-3C-AAAU                    1.5                                 -1.50   \n",
       "TCGA-3C-AALI                    0.0                                 -1.75   \n",
       "TCGA-3C-AALJ                    0.0                                  0.25   \n",
       "TCGA-3C-AALK                   -0.5                                  0.25   \n",
       "TCGA-4H-AAAK                    1.5                                  0.75   \n",
       "\n",
       "              histological_type_infiltrating lobular carcinoma  \\\n",
       "patient                                                          \n",
       "TCGA-3C-AAAU                                              True   \n",
       "TCGA-3C-AALI                                             False   \n",
       "TCGA-3C-AALJ                                             False   \n",
       "TCGA-3C-AALK                                             False   \n",
       "TCGA-4H-AAAK                                              True   \n",
       "\n",
       "              primary_diagnosis_Lobular carcinoma, NOS  morphology_8520/3  \\\n",
       "patient                                                                     \n",
       "TCGA-3C-AAAU                                      True               True   \n",
       "TCGA-3C-AALI                                     False              False   \n",
       "TCGA-3C-AALJ                                     False              False   \n",
       "TCGA-3C-AALK                                     False              False   \n",
       "TCGA-4H-AAAK                                      True               True   \n",
       "\n",
       "              race.1_white  days_to_death.1  laterality_Right  \\\n",
       "patient                                                         \n",
       "TCGA-3C-AAAU          True              0.0             False   \n",
       "TCGA-3C-AALI         False              0.0              True   \n",
       "TCGA-3C-AALJ         False              0.0              True   \n",
       "TCGA-3C-AALK         False              0.0              True   \n",
       "TCGA-4H-AAAK          True              0.0             False   \n",
       "\n",
       "              primary_diagnosis_Infiltrating duct carcinoma, NOS  \\\n",
       "patient                                                            \n",
       "TCGA-3C-AAAU                                              False    \n",
       "TCGA-3C-AALI                                               True    \n",
       "TCGA-3C-AALJ                                               True    \n",
       "TCGA-3C-AALK                                               True    \n",
       "TCGA-4H-AAAK                                              False    \n",
       "\n",
       "              country_of_residence_at_enrollment_United States  \\\n",
       "patient                                                          \n",
       "TCGA-3C-AAAU                                              True   \n",
       "TCGA-3C-AALI                                              True   \n",
       "TCGA-3C-AALJ                                              True   \n",
       "TCGA-3C-AALK                                              True   \n",
       "TCGA-4H-AAAK                                             False   \n",
       "\n",
       "              sites_of_involvement_Breast, NOS  days_to_death  race_white  \\\n",
       "patient                                                                     \n",
       "TCGA-3C-AAAU                             False            0.0        True   \n",
       "TCGA-3C-AALI                             False            0.0       False   \n",
       "TCGA-3C-AALJ                              True            0.0       False   \n",
       "TCGA-3C-AALK                              True            0.0       False   \n",
       "TCGA-4H-AAAK                             False            0.0        True   \n",
       "\n",
       "              ajcc_staging_system_edition_6th  \n",
       "patient                                        \n",
       "TCGA-3C-AAAU                             True  \n",
       "TCGA-3C-AALI                             True  \n",
       "TCGA-3C-AALJ                            False  \n",
       "TCGA-3C-AALK                            False  \n",
       "TCGA-4H-AAAK                            False  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from bioneuralnet.utils.preprocess import preprocess_clinical\n",
    "\n",
    "#shapes\n",
    "print(f\"RNA shape: {brca_rna.shape}\")\n",
    "print(f\"METH shape: {brca_meth.shape}\")\n",
    "print(f\"miRNA shape: {brca_mirna.shape}\")\n",
    "print(f\"Clinical shape: {brca_clinical.shape}\")\n",
    "print(f\"Phenotype shape: {brca_pam50.shape}\")\n",
    "print(f\"Phenotype counts:\\n{brca_pam50.value_counts()}\")\n",
    "\n",
    "# review min and max values from the datasets\n",
    "for name, df in {\"rna\": brca_rna, \"meth\": brca_meth, \"mirna\": brca_mirna}.items():\n",
    "    min_val = df.min().min()\n",
    "    max_val = df.max().max()\n",
    "    print(f\"\\n{name.upper()}:\")\n",
    "    print(f\"Min: {min_val:.4f}\")\n",
    "    print(f\"Max: {max_val:.4f}\")\n",
    "\n",
    "#check nans in pam50\n",
    "print(f\"Nan values in pam50 {brca_pam50.isna().sum().sum()}\")\n",
    "\n",
    "brca_pam50 = brca_pam50.dropna()\n",
    "X_rna = brca_rna.loc[brca_pam50.index]\n",
    "X_meth = brca_meth.loc[brca_pam50.index]\n",
    "X_mirna = brca_mirna.loc[brca_pam50.index]\n",
    "clinical = brca_clinical.loc[brca_pam50.index]\n",
    "\n",
    "# for more details on the preprocessing function, see bioneuralnet.utils.preprocess\n",
    "clinical = preprocess_clinical(clinical, brca_pam50, top_k=15, scale=True, ignore_columns=[\"days_to_birth\", \"age_at_diagnosis\", \"days_to_last_followup\", \"age_at_index\", \"years_to_birth\"])\n",
    "display(clinical.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89cb8500",
   "metadata": {},
   "source": [
    "# Preparing Multi-Omics Data for downstream tasks\n",
    "\n",
    "1. Check sample overlap.\n",
    "\n",
    "2. Select top features.\n",
    "\n",
    "    - Uses ANOVA F-test to select the most relevant features for classification from each omics dataset.\n",
    "\n",
    "3. Combine datasets.\n",
    "\n",
    "    - Selected features from RNA, methylation, and miRNA are combined into a single dataset.\n",
    "\n",
    "4. Clean missing values.\n",
    "\n",
    "    - Counts and removes any missing (nan) values from the combined dataset.\n",
    "\n",
    "5. Build similarity graph.\n",
    "\n",
    "    - Creates a k-nearest neighbors graph from the transposed feature matrix.\n",
    "\n",
    "Note: For more details on preprocessing functions and graph generation algorithms, see the [Utils documentation](https://bioneuralnet.readthedocs.io/en/latest/utils.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4646135",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Intersection of samples:\n",
      "RNA: 769\n",
      "METH: 769\n",
      "miRNA: 769\n",
      "Clinical: 769\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-05-16 10:31:12,677 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values\n",
      "2025-05-16 10:31:12,678 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation\n",
      "2025-05-16 10:31:12,678 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance\n",
      "2025-05-16 10:31:12,835 - bioneuralnet.utils.preprocess - INFO - Selected 1000 features by ANOVA (task=classification), 17514 significant, 0 padded\n",
      "2025-05-16 10:31:15,470 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values\n",
      "2025-05-16 10:31:15,471 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation\n",
      "2025-05-16 10:31:15,471 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance\n",
      "2025-05-16 10:31:15,635 - bioneuralnet.utils.preprocess - INFO - Selected 1000 features by ANOVA (task=classification), 16864 significant, 0 padded\n",
      "2025-05-16 10:31:15,714 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values\n",
      "2025-05-16 10:31:15,715 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation\n",
      "2025-05-16 10:31:15,715 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance\n",
      "2025-05-16 10:31:15,718 - bioneuralnet.utils.preprocess - INFO - Selected 503 features by ANOVA (task=classification), 465 significant, 38 padded\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Nan values in X_train_full: 0\n",
      "Nan value in X_train_full after dropping: 0\n",
      "X_train_full shape: (769, 2503)\n",
      "Network shape: (2503, 2503)\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import accuracy_score, f1_score\n",
    "from bioneuralnet.utils.preprocess import top_anova_f_features\n",
    "from bioneuralnet.utils.graph import gen_similarity_graph\n",
    "\n",
    "#count intersection of samples\n",
    "print(\"Intersection of samples:\")\n",
    "print(f\"RNA: {len(set(X_rna.index) & set(pam50.index))}\")\n",
    "print(f\"METH: {len(set(X_meth.index) & set(pam50.index))}\")\n",
    "print(f\"miRNA: {len(set(X_mirna.index) & set(pam50.index))}\")\n",
    "print(f\"Clinical: {len(set(clinical.index) & set(pam50.index))}\")\n",
    "\n",
    "meth_sel = top_anova_f_features(X_meth, brca_pam50, max_features=1000, task=\"classification\")\n",
    "rna_sel = top_anova_f_features(X_rna, brca_pam50 ,max_features=1000, task=\"classification\")\n",
    "mirna_sel = top_anova_f_features(X_mirna, brca_pam50,max_features=503, task=\"classification\")\n",
    "X_train_full = pd.concat([meth_sel, rna_sel, mirna_sel], axis=1)\n",
    "\n",
    "#count nan values\n",
    "print(f\"Nan values in X_train_full: {X_train_full.isna().sum().sum()}\")\n",
    "\n",
    "#drop nan values\n",
    "X_train_full = X_train_full.dropna()\n",
    "\n",
    "#check if there are any nan values\n",
    "print(f\"Nan value in X_train_full after dropping: {X_train_full.isna().sum().sum()}\")\n",
    "\n",
    "print(f\"X_train_full shape: {X_train_full.shape}\")\n",
    "A_train = gen_similarity_graph(X_train_full.T, k=15)\n",
    "\n",
    "print(f\"Network shape: {A_train.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "423f2808",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-05-16 15:22:49,397 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/TCGA_BRCA/results/results1\n",
      "2025-05-16 15:22:49,397 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.\n",
      "2025-05-16 15:22:49,397 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.\n",
      "2025-05-16 15:22:49,411 - bioneuralnet.downstream_task.dpmon - INFO - Running hyperparameter tuning for DPMON.\n",
      "2025-05-16 15:22:49,412 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0\n",
      "2025-05-16 15:22:49,412 - bioneuralnet.downstream_task.dpmon - INFO - Slicing omics dataset based on network nodes.\n",
      "2025-05-16 15:22:49,415 - bioneuralnet.downstream_task.dpmon - INFO - Building PyTorch Geometric Data object from adjacency matrix.\n",
      "2025-05-16 15:22:49,487 - bioneuralnet.downstream_task.dpmon - INFO - Number of nodes in network: 2503\n",
      "2025-05-16 15:22:49,487 - bioneuralnet.downstream_task.dpmon - INFO - Using clinical vars for node features: ['days_to_birth', 'age_at_diagnosis', 'days_to_last_followup', 'age_at_index', 'years_to_birth', 'year_of_diagnosis', 'number_of_lymph_nodes', 'date_of_initial_pathologic_diagnosis', 'histological_type_infiltrating lobular carcinoma', 'primary_diagnosis_Lobular carcinoma, NOS', 'morphology_8520/3', 'race.1_white', 'days_to_death.1', 'laterality_Right', 'primary_diagnosis_Infiltrating duct carcinoma, NOS', 'country_of_residence_at_enrollment_United States', 'sites_of_involvement_Breast, NOS', 'days_to_death', 'race_white', 'ajcc_staging_system_edition_6th']\n",
      "2025-05-16 15:22:53,816 - bioneuralnet.downstream_task.dpmon - INFO - Starting hyperparameter tuning for dataset shape: (769, 2504)\n",
      "2025-05-16 15:22:53,817\tINFO tune.py:616 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949\n",
      "2025-05-16 15:23:37,056\tINFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/vicente/ray_results/tune_dp' in 0.0110s.\n",
      "2025-05-16 15:23:37,074 - bioneuralnet.downstream_task.dpmon - INFO - Best trial config: {'gnn_layer_num': 2, 'gnn_hidden_dim': 128, 'lr': 0.005072810911305633, 'weight_decay': 0.00396813275339707, 'nn_hidden_dim1': 4, 'nn_hidden_dim2': 4, 'num_epochs': 512}\n",
      "2025-05-16 15:23:37,075 - bioneuralnet.downstream_task.dpmon - INFO - Best trial final loss: 1.1080904006958008\n",
      "2025-05-16 15:23:37,075 - bioneuralnet.downstream_task.dpmon - INFO - Best trial final accuracy: 0.8179453836150845\n",
      "2025-05-16 15:23:37,077 - bioneuralnet.downstream_task.dpmon - INFO -    gnn_layer_num  gnn_hidden_dim        lr  weight_decay  nn_hidden_dim1  \\\n",
      "0              2             128  0.005073      0.003968               4   \n",
      "\n",
      "   nn_hidden_dim2  num_epochs  \n",
      "0               4         512  \n",
      "2025-05-16 15:23:37,079 - bioneuralnet.downstream_task.dpmon - INFO - Best tuned parameters: {'gnn_layer_num': 2, 'gnn_hidden_dim': 128, 'lr': 0.005072810911305633, 'weight_decay': 0.00396813275339707, 'nn_hidden_dim1': 4, 'nn_hidden_dim2': 4, 'num_epochs': 512}\n",
      "2025-05-16 15:23:37,079 - bioneuralnet.downstream_task.dpmon - INFO - Best tuned parameters: {'gnn_layer_num': 2, 'gnn_hidden_dim': 128, 'lr': 0.005072810911305633, 'weight_decay': 0.00396813275339707, 'nn_hidden_dim1': 4, 'nn_hidden_dim2': 4, 'num_epochs': 512}\n",
      "2025-05-16 15:23:37,079 - bioneuralnet.downstream_task.dpmon - INFO - Running standard training with tuned parameters.\n",
      "2025-05-16 15:23:37,079 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0\n",
      "2025-05-16 15:23:37,080 - bioneuralnet.downstream_task.dpmon - INFO - Slicing omics dataset based on network nodes.\n",
      "2025-05-16 15:23:37,083 - bioneuralnet.downstream_task.dpmon - INFO - Building PyTorch Geometric Data object from adjacency matrix.\n",
      "2025-05-16 15:23:37,159 - bioneuralnet.downstream_task.dpmon - INFO - Number of nodes in network: 2503\n",
      "2025-05-16 15:23:37,159 - bioneuralnet.downstream_task.dpmon - INFO - Using clinical vars for node features: ['days_to_birth', 'age_at_diagnosis', 'days_to_last_followup', 'age_at_index', 'years_to_birth', 'year_of_diagnosis', 'number_of_lymph_nodes', 'date_of_initial_pathologic_diagnosis', 'histological_type_infiltrating lobular carcinoma', 'primary_diagnosis_Lobular carcinoma, NOS', 'morphology_8520/3', 'race.1_white', 'days_to_death.1', 'laterality_Right', 'primary_diagnosis_Infiltrating duct carcinoma, NOS', 'country_of_residence_at_enrollment_United States', 'sites_of_involvement_Breast, NOS', 'days_to_death', 'race_white', 'ajcc_staging_system_edition_6th']\n",
      "2025-05-16 15:23:41,530 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 1/3\n",
      "2025-05-16 15:23:41,563 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [1/512], Loss: 1.6182\n",
      "2025-05-16 15:23:41,650 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [10/512], Loss: 1.5531\n",
      "2025-05-16 15:23:41,745 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [20/512], Loss: 1.5067\n",
      "2025-05-16 15:23:41,840 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [30/512], Loss: 1.4441\n",
      "2025-05-16 15:23:41,939 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [40/512], Loss: 1.3666\n",
      "2025-05-16 15:23:42,033 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [50/512], Loss: 1.2763\n",
      "2025-05-16 15:23:42,128 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [60/512], Loss: 1.2001\n",
      "2025-05-16 15:23:42,224 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [70/512], Loss: 1.1394\n",
      "2025-05-16 15:23:42,317 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [80/512], Loss: 1.0970\n",
      "2025-05-16 15:23:42,412 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [90/512], Loss: 1.0614\n",
      "2025-05-16 15:23:42,505 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [100/512], Loss: 1.0399\n",
      "2025-05-16 15:23:42,599 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [110/512], Loss: 1.0265\n",
      "2025-05-16 15:23:42,693 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [120/512], Loss: 1.1995\n",
      "2025-05-16 15:23:42,787 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [130/512], Loss: 1.0792\n",
      "2025-05-16 15:23:42,881 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [140/512], Loss: 1.0485\n",
      "2025-05-16 15:23:42,976 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [150/512], Loss: 1.0294\n",
      "2025-05-16 15:23:43,077 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [160/512], Loss: 1.0155\n",
      "2025-05-16 15:23:43,171 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [170/512], Loss: 1.0073\n",
      "2025-05-16 15:23:43,266 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [180/512], Loss: 1.0014\n",
      "2025-05-16 15:23:43,361 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [190/512], Loss: 0.9978\n",
      "2025-05-16 15:23:43,455 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [200/512], Loss: 0.9956\n",
      "2025-05-16 15:23:43,549 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [210/512], Loss: 0.9942\n",
      "2025-05-16 15:23:43,643 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [220/512], Loss: 0.9932\n",
      "2025-05-16 15:23:43,737 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [230/512], Loss: 1.0023\n",
      "2025-05-16 15:23:43,831 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [240/512], Loss: 0.9951\n",
      "2025-05-16 15:23:43,925 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [250/512], Loss: 0.9916\n",
      "2025-05-16 15:23:44,019 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [260/512], Loss: 0.9900\n",
      "2025-05-16 15:23:44,113 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [270/512], Loss: 0.9888\n",
      "2025-05-16 15:23:44,207 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [280/512], Loss: 0.9878\n",
      "2025-05-16 15:23:44,302 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [290/512], Loss: 0.9872\n",
      "2025-05-16 15:23:44,396 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [300/512], Loss: 1.0332\n",
      "2025-05-16 15:23:44,490 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [310/512], Loss: 1.0558\n",
      "2025-05-16 15:23:44,584 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [320/512], Loss: 1.0264\n",
      "2025-05-16 15:23:44,678 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [330/512], Loss: 1.0068\n",
      "2025-05-16 15:23:44,772 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [340/512], Loss: 0.9968\n",
      "2025-05-16 15:23:44,866 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [350/512], Loss: 0.9917\n",
      "2025-05-16 15:23:44,963 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [360/512], Loss: 0.9890\n",
      "2025-05-16 15:23:45,059 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [370/512], Loss: 1.0214\n",
      "2025-05-16 15:23:45,153 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [380/512], Loss: 1.0091\n",
      "2025-05-16 15:23:45,247 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [390/512], Loss: 0.9997\n",
      "2025-05-16 15:23:45,341 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [400/512], Loss: 0.9948\n",
      "2025-05-16 15:23:45,436 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [410/512], Loss: 0.9907\n",
      "2025-05-16 15:23:45,530 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [420/512], Loss: 0.9878\n",
      "2025-05-16 15:23:45,624 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [430/512], Loss: 0.9851\n",
      "2025-05-16 15:23:45,717 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [440/512], Loss: 0.9828\n",
      "2025-05-16 15:23:45,815 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [450/512], Loss: 0.9802\n",
      "2025-05-16 15:23:45,909 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [460/512], Loss: 0.9773\n",
      "2025-05-16 15:23:46,003 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [470/512], Loss: 0.9736\n",
      "2025-05-16 15:23:46,097 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [480/512], Loss: 0.9691\n",
      "2025-05-16 15:23:46,192 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [490/512], Loss: 0.9645\n",
      "2025-05-16 15:23:46,287 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [500/512], Loss: 0.9643\n",
      "2025-05-16 15:23:46,384 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [510/512], Loss: 1.0325\n",
      "2025-05-16 15:23:46,407 - bioneuralnet.downstream_task.dpmon - INFO - Training Accuracy: 0.6658\n",
      "2025-05-16 15:23:46,409 - bioneuralnet.downstream_task.dpmon - INFO - Model saved to /home/vicente/Github/BioNeuralNet/TCGA_BRCA/results/results1/dpm_model_iter_1.pth\n",
      "2025-05-16 15:23:46,413 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 2/3\n",
      "2025-05-16 15:23:46,435 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [1/512], Loss: 1.6234\n",
      "2025-05-16 15:23:46,531 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [10/512], Loss: 1.5598\n",
      "2025-05-16 15:23:46,626 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [20/512], Loss: 1.5203\n",
      "2025-05-16 15:23:46,720 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [30/512], Loss: 1.4727\n",
      "2025-05-16 15:23:46,814 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [40/512], Loss: 1.4078\n",
      "2025-05-16 15:23:46,908 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [50/512], Loss: 1.3256\n",
      "2025-05-16 15:23:47,002 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [60/512], Loss: 1.2342\n",
      "2025-05-16 15:23:47,096 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [70/512], Loss: 1.1537\n",
      "2025-05-16 15:23:47,190 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [80/512], Loss: 1.1014\n",
      "2025-05-16 15:23:47,284 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [90/512], Loss: 1.0631\n",
      "2025-05-16 15:23:47,378 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [100/512], Loss: 1.0422\n",
      "2025-05-16 15:23:47,472 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [110/512], Loss: 1.0257\n",
      "2025-05-16 15:23:47,565 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [120/512], Loss: 1.0252\n",
      "2025-05-16 15:23:47,659 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [130/512], Loss: 1.0136\n",
      "2025-05-16 15:23:47,757 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [140/512], Loss: 1.0068\n",
      "2025-05-16 15:23:47,852 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [150/512], Loss: 1.0016\n",
      "2025-05-16 15:23:47,946 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [160/512], Loss: 0.9965\n",
      "2025-05-16 15:23:48,040 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [170/512], Loss: 1.0520\n",
      "2025-05-16 15:23:48,134 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [180/512], Loss: 1.0251\n",
      "2025-05-16 15:23:48,229 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [190/512], Loss: 1.0158\n",
      "2025-05-16 15:23:48,322 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [200/512], Loss: 1.0110\n",
      "2025-05-16 15:23:48,416 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [210/512], Loss: 1.0074\n",
      "2025-05-16 15:23:48,523 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [220/512], Loss: 1.0034\n",
      "2025-05-16 15:23:48,618 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [230/512], Loss: 1.0010\n",
      "2025-05-16 15:23:48,712 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [240/512], Loss: 0.9988\n",
      "2025-05-16 15:23:48,806 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [250/512], Loss: 0.9974\n",
      "2025-05-16 15:23:48,900 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [260/512], Loss: 0.9960\n",
      "2025-05-16 15:23:48,994 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [270/512], Loss: 0.9946\n",
      "2025-05-16 15:23:49,089 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [280/512], Loss: 0.9932\n",
      "2025-05-16 15:23:49,183 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [290/512], Loss: 0.9924\n",
      "2025-05-16 15:23:49,277 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [300/512], Loss: 0.9905\n",
      "2025-05-16 15:23:49,373 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [310/512], Loss: 0.9880\n",
      "2025-05-16 15:23:49,467 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [320/512], Loss: 0.9850\n",
      "2025-05-16 15:23:49,561 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [330/512], Loss: 0.9789\n",
      "2025-05-16 15:23:49,656 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [340/512], Loss: 0.9995\n",
      "2025-05-16 15:23:49,750 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [350/512], Loss: 0.9807\n",
      "2025-05-16 15:23:49,844 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [360/512], Loss: 0.9594\n",
      "2025-05-16 15:23:49,938 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [370/512], Loss: 0.9547\n",
      "2025-05-16 15:23:50,031 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [380/512], Loss: 0.9526\n",
      "2025-05-16 15:23:50,126 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [390/512], Loss: 0.9514\n",
      "2025-05-16 15:23:50,220 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [400/512], Loss: 0.9506\n",
      "2025-05-16 15:23:50,314 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [410/512], Loss: 0.9505\n",
      "2025-05-16 15:23:50,417 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [420/512], Loss: 0.9510\n",
      "2025-05-16 15:23:50,511 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [430/512], Loss: 0.9494\n",
      "2025-05-16 15:23:50,605 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [440/512], Loss: 0.9485\n",
      "2025-05-16 15:23:50,699 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [450/512], Loss: 0.9477\n",
      "2025-05-16 15:23:50,793 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [460/512], Loss: 0.9473\n",
      "2025-05-16 15:23:50,887 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [470/512], Loss: 0.9470\n",
      "2025-05-16 15:23:50,981 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [480/512], Loss: 0.9455\n",
      "2025-05-16 15:23:51,074 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [490/512], Loss: 0.9464\n",
      "2025-05-16 15:23:51,168 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [500/512], Loss: 0.9468\n",
      "2025-05-16 15:23:51,262 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [510/512], Loss: 0.9451\n",
      "2025-05-16 15:23:51,285 - bioneuralnet.downstream_task.dpmon - INFO - Training Accuracy: 0.9545\n",
      "2025-05-16 15:23:51,287 - bioneuralnet.downstream_task.dpmon - INFO - Model saved to /home/vicente/Github/BioNeuralNet/TCGA_BRCA/results/results1/dpm_model_iter_2.pth\n",
      "2025-05-16 15:23:51,291 - bioneuralnet.downstream_task.dpmon - INFO - Training iteration 3/3\n",
      "2025-05-16 15:23:51,309 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [1/512], Loss: 1.6141\n",
      "2025-05-16 15:23:51,394 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [10/512], Loss: 1.5460\n",
      "2025-05-16 15:23:51,488 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [20/512], Loss: 1.5051\n",
      "2025-05-16 15:23:51,582 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [30/512], Loss: 1.4469\n",
      "2025-05-16 15:23:51,676 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [40/512], Loss: 1.3636\n",
      "2025-05-16 15:23:51,771 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [50/512], Loss: 1.2702\n",
      "2025-05-16 15:23:51,865 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [60/512], Loss: 1.1908\n",
      "2025-05-16 15:23:51,963 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [70/512], Loss: 1.1343\n",
      "2025-05-16 15:23:52,057 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [80/512], Loss: 1.1024\n",
      "2025-05-16 15:23:52,152 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [90/512], Loss: 1.0812\n",
      "2025-05-16 15:23:52,247 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [100/512], Loss: 1.1784\n",
      "2025-05-16 15:23:52,343 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [110/512], Loss: 1.1234\n",
      "2025-05-16 15:23:52,437 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [120/512], Loss: 1.1016\n",
      "2025-05-16 15:23:52,531 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [130/512], Loss: 1.0903\n",
      "2025-05-16 15:23:52,625 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [140/512], Loss: 1.0835\n",
      "2025-05-16 15:23:52,724 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [150/512], Loss: 1.0627\n",
      "2025-05-16 15:23:52,818 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [160/512], Loss: 1.0492\n",
      "2025-05-16 15:23:52,911 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [170/512], Loss: 1.0344\n",
      "2025-05-16 15:23:53,005 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [180/512], Loss: 1.0256\n",
      "2025-05-16 15:23:53,099 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [190/512], Loss: 1.0206\n",
      "2025-05-16 15:23:53,193 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [200/512], Loss: 1.0164\n",
      "2025-05-16 15:23:53,287 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [210/512], Loss: 1.0128\n",
      "2025-05-16 15:23:53,383 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [220/512], Loss: 1.0165\n",
      "2025-05-16 15:23:53,480 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [230/512], Loss: 1.0111\n",
      "2025-05-16 15:23:53,574 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [240/512], Loss: 1.0075\n",
      "2025-05-16 15:23:53,670 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [250/512], Loss: 1.0027\n",
      "2025-05-16 15:23:53,764 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [260/512], Loss: 0.9996\n",
      "2025-05-16 15:23:53,868 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [270/512], Loss: 0.9979\n",
      "2025-05-16 15:23:53,962 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [280/512], Loss: 1.0081\n",
      "2025-05-16 15:23:54,056 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [290/512], Loss: 1.0096\n",
      "2025-05-16 15:23:54,150 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [300/512], Loss: 1.0006\n",
      "2025-05-16 15:23:54,243 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [310/512], Loss: 0.9946\n",
      "2025-05-16 15:23:54,337 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [320/512], Loss: 0.9909\n",
      "2025-05-16 15:23:54,431 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [330/512], Loss: 0.9882\n",
      "2025-05-16 15:23:54,525 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [340/512], Loss: 0.9877\n",
      "2025-05-16 15:23:54,619 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [350/512], Loss: 0.9849\n",
      "2025-05-16 15:23:54,712 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [360/512], Loss: 0.9833\n",
      "2025-05-16 15:23:54,806 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [370/512], Loss: 0.9818\n",
      "2025-05-16 15:23:54,900 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [380/512], Loss: 0.9807\n",
      "2025-05-16 15:23:54,993 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [390/512], Loss: 1.0491\n",
      "2025-05-16 15:23:55,087 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [400/512], Loss: 0.9906\n",
      "2025-05-16 15:23:55,181 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [410/512], Loss: 0.9831\n",
      "2025-05-16 15:23:55,275 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [420/512], Loss: 0.9797\n",
      "2025-05-16 15:23:55,370 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [430/512], Loss: 0.9777\n",
      "2025-05-16 15:23:55,464 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [440/512], Loss: 0.9759\n",
      "2025-05-16 15:23:55,558 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [450/512], Loss: 0.9749\n",
      "2025-05-16 15:23:55,652 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [460/512], Loss: 0.9742\n",
      "2025-05-16 15:23:55,747 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [470/512], Loss: 0.9737\n",
      "2025-05-16 15:23:55,851 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [480/512], Loss: 0.9738\n",
      "2025-05-16 15:23:55,944 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [490/512], Loss: 0.9735\n",
      "2025-05-16 15:23:56,038 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [500/512], Loss: 0.9727\n",
      "2025-05-16 15:23:56,132 - bioneuralnet.downstream_task.dpmon - INFO - Epoch [510/512], Loss: 0.9720\n",
      "2025-05-16 15:23:56,155 - bioneuralnet.downstream_task.dpmon - INFO - Training Accuracy: 0.9558\n",
      "2025-05-16 15:23:56,157 - bioneuralnet.downstream_task.dpmon - INFO - Model saved to /home/vicente/Github/BioNeuralNet/TCGA_BRCA/results/results1/dpm_model_iter_3.pth\n",
      "2025-05-16 15:23:56,161 - bioneuralnet.downstream_task.dpmon - INFO - Best Accuracy: 0.9558\n",
      "2025-05-16 15:23:56,162 - bioneuralnet.downstream_task.dpmon - INFO - Average Accuracy: 0.8587\n",
      "2025-05-16 15:23:56,162 - bioneuralnet.downstream_task.dpmon - INFO - Standard Deviation of Accuracy: 0.1670\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DPMON results:\n",
      "Accuracy: 0.9557867360208062\n",
      "F1 weighted: 0.9360974742812752\n",
      "F1 macro: 0.7772360237077294\n"
     ]
    }
   ],
   "source": [
    "from bioneuralnet.downstream_task import DPMON\n",
    "\n",
    "save = Path(\"/home/vicente/Github/BioNeuralNet/TCGA_BRCA/results\")\n",
    "brca_pam50 = brca_pam50.rename(columns={\"pam50\": \"phenotype\"})\n",
    "\n",
    "dpmon = DPMON(\n",
    "    adjacency_matrix=A_train,\n",
    "    omics_list=[meth_sel, rna_sel, mirna_sel],\n",
    "    phenotype_data=brca_pam50,\n",
    "    clinical_data=clinical,\n",
    "    repeat_num=3,\n",
    "    tune=True, gpu=True, cuda=0,\n",
    "    output_dir=Path(save/\"results1\"),\n",
    ")\n",
    "\n",
    "predictions_df, avg_accuracy = dpmon.run()\n",
    "actual = predictions_df[\"Actual\"]\n",
    "pred = predictions_df[\"Predicted\"]\n",
    "dp_acc = (accuracy_score(actual, pred), 0)\n",
    "dp_f1w = (f1_score(actual, pred, average='weighted'), 0)\n",
    "dp_f1m = (f1_score(actual, pred, average='macro'), 0)\n",
    "\n",
    "print(f\"DPMON results:\")\n",
    "print(f\"Accuracy: {dp_acc[0]}\")\n",
    "print(f\"F1 weighted: {dp_f1w[0]}\")\n",
    "print(f\"F1 macro: {dp_f1m[0]}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".testing",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
