Metadata-Version: 2.1
Name: sgt
Version: 2.0.3
Summary: Sequence Graph Transform (SGT) is a sequence embedding function. SGT extracts the short- and long-term sequence features and embeds them in a finite-dimensional feature space. The long and short term patterns embedded in SGT can be tuned without any increase in the computation.
Home-page: https://github.com/cran2367/sgt
Author: Chitta Ranjan
Author-email: cran2367@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown

# Sequence Graph Transform (SGT) &mdash; Sequence Embedding for Clustering, Classification, and Search

#### Maintained by: Chitta Ranjan 
Email: <cran2367@gmail.com>
| LinkedIn: [https://www.linkedin.com/in/chitta-ranjan-b0851911/](https://www.linkedin.com/in/chitta-ranjan-b0851911/)


The following will cover,

1. [SGT Class Definition](#sgt-class-def)
2. [Installation](#install-sgt)
3. [Test Examples](#installation-test-examples)
4. [Sequence Clustering Example](#sequence-clustering)
5. [Sequence Classification Example](#sequence-classification)
6. [Sequence Search Example](#sequence-search)
7. [SGT - Spark for Distributed Computing](#sgt-spark)


## <a name="sgt-class-def"></a> SGT Class Definition

Sequence Graph Transform (SGT) is a sequence embedding function. SGT extracts the short- and long-term sequence features and embeds them in a finite-dimensional feature space. The long and short term patterns embedded in SGT can be tuned without any increase in the computation."


```
class SGT():
    '''
    Compute embedding of a single or a collection of discrete item 
    sequences. A discrete item sequence is a sequence made from a set
    discrete elements, also known as alphabet set. For example,
    suppose the alphabet set is the set of roman letters, 
    {A, B, ..., Z}. This set is made of discrete elements. Examples of
    sequences from such a set are AABADDSA, UADSFJPFFFOIHOUGD, etc.
    Such sequence datasets are commonly found in online industry,
    for example, item purchase history, where the alphabet set is
    the set of all product items. Sequence datasets are abundant in
    bioinformatics as protein sequences.
    Using the embeddings created here, classification and clustering
    models can be built for sequence datasets.
    Read more in https://arxiv.org/pdf/1608.03533.pdf
    '''

    Parameters
    ----------
    Input:

    alphabets       Optional, except if mode is Spark. 
                    The set of alphabets that make up all 
                    the sequences in the dataset. If not passed, the
                    alphabet set is automatically computed as the 
                    unique set of elements that make all the sequences.
                    A list or 1d-array of the set of elements that make up the      
                    sequences. For example, np.array(["A", "B", "C"].
                    If mode is 'spark', the alphabets are necessary.

    kappa           Tuning parameter, kappa > 0, to change the extraction of 
                    long-term dependency. Higher the value the lesser
                    the long-term dependency captured in the embedding.
                    Typical values for kappa are 1, 5, 10.

    lengthsensitive Default false. This is set to true if the embedding of
                    should have the information of the length of the sequence.
                    If set to false then the embedding of two sequences with
                    similar pattern but different lengths will be the same.
                    lengthsensitive = false is similar to length-normalization.

    flatten         Default True. If True the SGT embedding is flattened and returned as
                    a vector. Otherwise, it is returned as a matrix with the row and col
                    names same as the alphabets. The matrix form is used for            
                    interpretation purposes. Especially, to understand how the alphabets
                    are "related". Otherwise, for applying machine learning or deep
                    learning algorithms, the embedding vectors are required.

    mode            Choices in {'default', 'multiprocessing'}. Note: 'multiprocessing' 
                    mode requires pandas==1.0.3+ and pandarallel libraries.

    processors      Used if mode is 'multiprocessing'. By default, the 
                    number of processors used in multiprocessing is
                    number of available - 1.
    '''


    Attributes
    ----------
    def fit(sequence)

    Extract Sequence Graph Transform features using Algorithm-2 in https://arxiv.org/abs/1608.03533.
    Input:
    sequence        An array of discrete elements. For example,
                    np.array(["B","B","A","C","A","C","A","A","B","A"].

    Output: 
    sgt embedding   sgt matrix or vector (depending on Flatten==False or True) of the sequence


    --
    def fit_transform(corpus)

    Extract SGT embeddings for all sequences in a corpus. It finds
    the alphabets encompassing all the sequences in the corpus, if not inputted. 
    However, if the mode is 'spark', then the alphabets list has to be
    explicitly given in Sgt object declaration.

    Input:
    corpus          A list of sequences. Each sequence is a list of alphabets.

    Output:
    sgt embedding of all sequences in the corpus.


    --
    def transform(corpus)

    Find SGT embeddings of a new data sample belonging to the same population
    of the corpus that was fitted initially.
```

## <a name="install-sgt"></a> Install SGT

Install SGT in Python by running,

```$ pip install sgt```


```python
import sgt
sgt.__version__
from sgt import SGT

```




    '2.0.0'





```python
# -*- coding: utf-8 -*-
# Authors: Chitta Ranjan <cran2367@gmail.com>
#
# License: BSD 3 clause
```


## <a name="installation-test-examples"></a> Installation Test Examples

In the following, there are a few test examples to verify the installation.


```python
# Learning a sgt embedding as a matrix with 
# rows and columns as the sequence alphabets. 
# This embedding shows the relationship between 
# the alphabets. The higher the value the 
# stronger the relationship.

sgt = SGT(flatten=False)
sequence = np.array(["B","B","A","C","A","C","A","A","B","A"])
sgt.fit(sequence)
```




<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>A</th>
      <th>B</th>
      <th>C</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>A</th>
      <td>0.090616</td>
      <td>0.131002</td>
      <td>0.261849</td>
    </tr>
    <tr>
      <th>B</th>
      <td>0.086569</td>
      <td>0.123042</td>
      <td>0.052544</td>
    </tr>
    <tr>
      <th>C</th>
      <td>0.137142</td>
      <td>0.028263</td>
      <td>0.135335</td>
    </tr>
  </tbody>
</table>
</div>




```python
# SGT embedding to a vector. The vector
# embedding is useful for directly applying
# a machine learning algorithm.

sgt = SGT(flatten=True)
sequence = np.array(["B","B","A","C","A","C","A","A","B","A"])
sgt.fit(sequence)
```




    (A, A)    0.090616
    (A, B)    0.131002
    (A, C)    0.261849
    (B, A)    0.086569
    (B, B)    0.123042
    (B, C)    0.052544
    (C, A)    0.137142
    (C, B)    0.028263
    (C, C)    0.135335
    dtype: float64




```python
'''
SGT embedding on a corpus of sequences.
Test the two processing modes within the
SGT class: 'default', 'multiprocessing'.

'''

# A sample corpus of two sequences.
corpus = pd.DataFrame([[1, ["B","B","A","C","A","C","A","A","B","A"]], 
                       [2, ["C", "Z", "Z", "Z", "D"]]], 
                      columns=['id', 'sequence'])
corpus
```




<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>id</th>
      <th>sequence</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>[B, B, A, C, A, C, A, A, B, A]</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2</td>
      <td>[C, Z, Z, Z, D]</td>
    </tr>
  </tbody>
</table>
</div>




```python
# Learning the sgt embeddings as vector for
# all sequences in a corpus.
# mode: 'default'
sgt = SGT(kappa=1, 
          flatten=True, 
          lengthsensitive=False, 
          mode='default')
sgt.fit_transform(corpus)
```




<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>id</th>
      <th>(A, A)</th>
      <th>(A, B)</th>
      <th>(A, C)</th>
      <th>(A, D)</th>
      <th>(A, Z)</th>
      <th>(B, A)</th>
      <th>(B, B)</th>
      <th>(B, C)</th>
      <th>(B, D)</th>
      <th>...</th>
      <th>(D, A)</th>
      <th>(D, B)</th>
      <th>(D, C)</th>
      <th>(D, D)</th>
      <th>(D, Z)</th>
      <th>(Z, A)</th>
      <th>(Z, B)</th>
      <th>(Z, C)</th>
      <th>(Z, D)</th>
      <th>(Z, Z)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1.0</td>
      <td>0.090616</td>
      <td>0.131002</td>
      <td>0.261849</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.086569</td>
      <td>0.123042</td>
      <td>0.052544</td>
      <td>0.0</td>
      <td>...</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.000000</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2.0</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.0</td>
      <td>...</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.184334</td>
      <td>0.290365</td>
    </tr>
  </tbody>
</table>
<p>2 rows × 26 columns</p>
</div>




```python
# Learning the sgt embeddings as vector for
# all sequences in a corpus.
# mode: 'multiprocessing'

import pandarallel  # required library for multiprocessing

sgt = SGT(kappa=1, 
          flatten=True, 
          lengthsensitive=False,
          mode='multiprocessing')

sgt.fit_transform(corpus)
```

    INFO: Pandarallel will run on 7 workers.
    INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.





<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>id</th>
      <th>(A, A)</th>
      <th>(A, B)</th>
      <th>(A, C)</th>
      <th>(A, D)</th>
      <th>(A, Z)</th>
      <th>(B, A)</th>
      <th>(B, B)</th>
      <th>(B, C)</th>
      <th>(B, D)</th>
      <th>...</th>
      <th>(D, A)</th>
      <th>(D, B)</th>
      <th>(D, C)</th>
      <th>(D, D)</th>
      <th>(D, Z)</th>
      <th>(Z, A)</th>
      <th>(Z, B)</th>
      <th>(Z, C)</th>
      <th>(Z, D)</th>
      <th>(Z, Z)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1.0</td>
      <td>0.090616</td>
      <td>0.131002</td>
      <td>0.261849</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.086569</td>
      <td>0.123042</td>
      <td>0.052544</td>
      <td>0.0</td>
      <td>...</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.000000</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2.0</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.0</td>
      <td>...</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>0.184334</td>
      <td>0.290365</td>
    </tr>
  </tbody>
</table>
<p>2 rows × 26 columns</p>
</div>

## Load Libraries for Illustrative Examples


```python
from sgt import SGT

import numpy as np
import pandas as pd
from itertools import chain
from itertools import product as iterproduct
import warnings

import pickle

########
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing import sequence

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
import sklearn.metrics
import time

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(7) # fix random seed for reproducibility

# from sgt import Sgt
```


## <a name="sequence-clustering"></a> Sequence Clustering

A form of unsupervised learning from sequences is clustering. For example, in 

- user weblogs sequences: clustering the weblogs segments users into groups with similar browsing behavior. This helps in targeted marketing, anomaly detection, and other web customizations.

- protein sequences: clustering proteins with similar structures help researchers study the commonalities between species. It also helps in faster search in some search algorithms.

In the following, clustering on a protein sequence dataset will be shown.



### Protein Sequence Clustering

The data used here is taken from www.uniprot.org. This is a public database for proteins. The data contains the protein sequences and their functions.


```python
# Loading data
corpus = pd.read_csv('data/protein_classification.csv')

# Data preprocessing
corpus = corpus.loc[:,['Entry','Sequence']]
corpus.columns = ['id', 'sequence']
corpus['sequence'] = corpus['sequence'].map(list)
corpus
```




<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>id</th>
      <th>sequence</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>M7MCX3</td>
      <td>[M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>K6PL84</td>
      <td>[M, E, I, E, K, N, Y, R, M, N, S, L, F, E, F, ...</td>
    </tr>
    <tr>
      <th>2</th>
      <td>R4W5V3</td>
      <td>[M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...</td>
    </tr>
    <tr>
      <th>3</th>
      <td>T2A126</td>
      <td>[M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...</td>
    </tr>
    <tr>
      <th>4</th>
      <td>L0SHD5</td>
      <td>[M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>2107</th>
      <td>A0A081R612</td>
      <td>[M, M, N, M, Q, N, M, M, R, Q, A, Q, K, L, Q, ...</td>
    </tr>
    <tr>
      <th>2108</th>
      <td>A0A081QQM2</td>
      <td>[M, M, N, M, Q, N, M, M, R, Q, A, Q, K, L, Q, ...</td>
    </tr>
    <tr>
      <th>2109</th>
      <td>J1A517</td>
      <td>[M, M, R, Q, A, Q, K, L, Q, K, Q, M, E, Q, S, ...</td>
    </tr>
    <tr>
      <th>2110</th>
      <td>F5U1T6</td>
      <td>[M, M, N, M, Q, S, M, M, K, Q, A, Q, K, L, Q, ...</td>
    </tr>
    <tr>
      <th>2111</th>
      <td>J3A2T7</td>
      <td>[M, M, N, M, Q, N, M, M, K, Q, A, Q, K, L, Q, ...</td>
    </tr>
  </tbody>
</table>
<p>2112 rows × 2 columns</p>
</div>




```python
%%time
# Compute SGT embeddings
sgt_ = SGT(kappa=1, 
           lengthsensitive=False, 
           mode='multiprocessing')
sgtembedding_df = sgt_.fit_transform(corpus)
```

    INFO: Pandarallel will run on 7 workers.
    INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
    CPU times: user 324 ms, sys: 68 ms, total: 392 ms
    Wall time: 9.02 s



```python
sgtembedding_df
```




<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>id</th>
      <th>(A, A)</th>
      <th>(A, C)</th>
      <th>(A, D)</th>
      <th>(A, E)</th>
      <th>(A, F)</th>
      <th>(A, G)</th>
      <th>(A, H)</th>
      <th>(A, I)</th>
      <th>(A, K)</th>
      <th>...</th>
      <th>(Y, M)</th>
      <th>(Y, N)</th>
      <th>(Y, P)</th>
      <th>(Y, Q)</th>
      <th>(Y, R)</th>
      <th>(Y, S)</th>
      <th>(Y, T)</th>
      <th>(Y, V)</th>
      <th>(Y, W)</th>
      <th>(Y, Y)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>M7MCX3</td>
      <td>0.020180</td>
      <td>0.0</td>
      <td>0.009635</td>
      <td>0.013529</td>
      <td>0.009360</td>
      <td>0.003205</td>
      <td>2.944887e-10</td>
      <td>0.002226</td>
      <td>0.000379</td>
      <td>...</td>
      <td>0.009196</td>
      <td>0.007964</td>
      <td>0.036788</td>
      <td>0.000195</td>
      <td>0.001513</td>
      <td>0.020665</td>
      <td>0.000542</td>
      <td>0.007479</td>
      <td>0.0</td>
      <td>0.010419</td>
    </tr>
    <tr>
      <th>1</th>
      <td>K6PL84</td>
      <td>0.001604</td>
      <td>0.0</td>
      <td>0.012637</td>
      <td>0.006323</td>
      <td>0.006224</td>
      <td>0.004819</td>
      <td>3.560677e-03</td>
      <td>0.001124</td>
      <td>0.012136</td>
      <td>...</td>
      <td>0.135335</td>
      <td>0.006568</td>
      <td>0.038901</td>
      <td>0.011298</td>
      <td>0.012578</td>
      <td>0.009913</td>
      <td>0.001079</td>
      <td>0.000023</td>
      <td>0.0</td>
      <td>0.007728</td>
    </tr>
    <tr>
      <th>2</th>
      <td>R4W5V3</td>
      <td>0.012448</td>
      <td>0.0</td>
      <td>0.008408</td>
      <td>0.016363</td>
      <td>0.027469</td>
      <td>0.003205</td>
      <td>2.944887e-10</td>
      <td>0.004249</td>
      <td>0.013013</td>
      <td>...</td>
      <td>0.008114</td>
      <td>0.007128</td>
      <td>0.000000</td>
      <td>0.000203</td>
      <td>0.001757</td>
      <td>0.022736</td>
      <td>0.000249</td>
      <td>0.012652</td>
      <td>0.0</td>
      <td>0.008533</td>
    </tr>
    <tr>
      <th>3</th>
      <td>T2A126</td>
      <td>0.010545</td>
      <td>0.0</td>
      <td>0.012560</td>
      <td>0.014212</td>
      <td>0.013728</td>
      <td>0.000000</td>
      <td>2.944887e-10</td>
      <td>0.007223</td>
      <td>0.000309</td>
      <td>...</td>
      <td>0.000325</td>
      <td>0.009669</td>
      <td>0.000000</td>
      <td>0.003182</td>
      <td>0.001904</td>
      <td>0.015607</td>
      <td>0.000577</td>
      <td>0.007479</td>
      <td>0.0</td>
      <td>0.008648</td>
    </tr>
    <tr>
      <th>4</th>
      <td>L0SHD5</td>
      <td>0.020180</td>
      <td>0.0</td>
      <td>0.008628</td>
      <td>0.015033</td>
      <td>0.009360</td>
      <td>0.003205</td>
      <td>2.944887e-10</td>
      <td>0.002226</td>
      <td>0.000379</td>
      <td>...</td>
      <td>0.009196</td>
      <td>0.007964</td>
      <td>0.036788</td>
      <td>0.000195</td>
      <td>0.001513</td>
      <td>0.020665</td>
      <td>0.000542</td>
      <td>0.007479</td>
      <td>0.0</td>
      <td>0.010419</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>2107</th>
      <td>A0A081R612</td>
      <td>0.014805</td>
      <td>0.0</td>
      <td>0.004159</td>
      <td>0.017541</td>
      <td>0.012701</td>
      <td>0.013099</td>
      <td>0.000000e+00</td>
      <td>0.017043</td>
      <td>0.004732</td>
      <td>...</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.0</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>2108</th>
      <td>A0A081QQM2</td>
      <td>0.010774</td>
      <td>0.0</td>
      <td>0.004283</td>
      <td>0.014732</td>
      <td>0.014340</td>
      <td>0.014846</td>
      <td>0.000000e+00</td>
      <td>0.016806</td>
      <td>0.005406</td>
      <td>...</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.0</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>2109</th>
      <td>J1A517</td>
      <td>0.010774</td>
      <td>0.0</td>
      <td>0.004283</td>
      <td>0.014732</td>
      <td>0.014340</td>
      <td>0.014846</td>
      <td>0.000000e+00</td>
      <td>0.014500</td>
      <td>0.005406</td>
      <td>...</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.0</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>2110</th>
      <td>F5U1T6</td>
      <td>0.015209</td>
      <td>0.0</td>
      <td>0.005175</td>
      <td>0.023888</td>
      <td>0.011410</td>
      <td>0.011510</td>
      <td>0.000000e+00</td>
      <td>0.021145</td>
      <td>0.009280</td>
      <td>...</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.0</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>2111</th>
      <td>J3A2T7</td>
      <td>0.005240</td>
      <td>0.0</td>
      <td>0.012301</td>
      <td>0.013178</td>
      <td>0.014744</td>
      <td>0.014705</td>
      <td>0.000000e+00</td>
      <td>0.000981</td>
      <td>0.007957</td>
      <td>...</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.0</td>
      <td>0.000000</td>
    </tr>
  </tbody>
</table>
<p>2112 rows × 401 columns</p>
</div>




```python
# Set the id column as the dataframe index
sgtembedding_df = sgtembedding_df.set_index('id')
sgtembedding_df
```




<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>(A, A)</th>
      <th>(A, C)</th>
      <th>(A, D)</th>
      <th>(A, E)</th>
      <th>(A, F)</th>
      <th>(A, G)</th>
      <th>(A, H)</th>
      <th>(A, I)</th>
      <th>(A, K)</th>
      <th>(A, L)</th>
      <th>...</th>
      <th>(Y, M)</th>
      <th>(Y, N)</th>
      <th>(Y, P)</th>
      <th>(Y, Q)</th>
      <th>(Y, R)</th>
      <th>(Y, S)</th>
      <th>(Y, T)</th>
      <th>(Y, V)</th>
      <th>(Y, W)</th>
      <th>(Y, Y)</th>
    </tr>
    <tr>
      <th>id</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>M7MCX3</th>
      <td>0.020180</td>
      <td>0.0</td>
      <td>0.009635</td>
      <td>0.013529</td>
      <td>0.009360</td>
      <td>0.003205</td>
      <td>2.944887e-10</td>
      <td>0.002226</td>
      <td>0.000379</td>
      <td>0.021703</td>
      <td>...</td>
      <td>0.009196</td>
      <td>0.007964</td>
      <td>0.036788</td>
      <td>0.000195</td>
      <td>0.001513</td>
      <td>0.020665</td>
      <td>0.000542</td>
      <td>0.007479</td>
      <td>0.0</td>
      <td>0.010419</td>
    </tr>
    <tr>
      <th>K6PL84</th>
      <td>0.001604</td>
      <td>0.0</td>
      <td>0.012637</td>
      <td>0.006323</td>
      <td>0.006224</td>
      <td>0.004819</td>
      <td>3.560677e-03</td>
      <td>0.001124</td>
      <td>0.012136</td>
      <td>0.018427</td>
      <td>...</td>
      <td>0.135335</td>
      <td>0.006568</td>
      <td>0.038901</td>
      <td>0.011298</td>
      <td>0.012578</td>
      <td>0.009913</td>
      <td>0.001079</td>
      <td>0.000023</td>
      <td>0.0</td>
      <td>0.007728</td>
    </tr>
    <tr>
      <th>R4W5V3</th>
      <td>0.012448</td>
      <td>0.0</td>
      <td>0.008408</td>
      <td>0.016363</td>
      <td>0.027469</td>
      <td>0.003205</td>
      <td>2.944887e-10</td>
      <td>0.004249</td>
      <td>0.013013</td>
      <td>0.031118</td>
      <td>...</td>
      <td>0.008114</td>
      <td>0.007128</td>
      <td>0.000000</td>
      <td>0.000203</td>
      <td>0.001757</td>
      <td>0.022736</td>
      <td>0.000249</td>
      <td>0.012652</td>
      <td>0.0</td>
      <td>0.008533</td>
    </tr>
    <tr>
      <th>T2A126</th>
      <td>0.010545</td>
      <td>0.0</td>
      <td>0.012560</td>
      <td>0.014212</td>
      <td>0.013728</td>
      <td>0.000000</td>
      <td>2.944887e-10</td>
      <td>0.007223</td>
      <td>0.000309</td>
      <td>0.028531</td>
      <td>...</td>
      <td>0.000325</td>
      <td>0.009669</td>
      <td>0.000000</td>
      <td>0.003182</td>
      <td>0.001904</td>
      <td>0.015607</td>
      <td>0.000577</td>
      <td>0.007479</td>
      <td>0.0</td>
      <td>0.008648</td>
    </tr>
    <tr>
      <th>L0SHD5</th>
      <td>0.020180</td>
      <td>0.0</td>
      <td>0.008628</td>
      <td>0.015033</td>
      <td>0.009360</td>
      <td>0.003205</td>
      <td>2.944887e-10</td>
      <td>0.002226</td>
      <td>0.000379</td>
      <td>0.021703</td>
      <td>...</td>
      <td>0.009196</td>
      <td>0.007964</td>
      <td>0.036788</td>
      <td>0.000195</td>
      <td>0.001513</td>
      <td>0.020665</td>
      <td>0.000542</td>
      <td>0.007479</td>
      <td>0.0</td>
      <td>0.010419</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>A0A081R612</th>
      <td>0.014805</td>
      <td>0.0</td>
      <td>0.004159</td>
      <td>0.017541</td>
      <td>0.012701</td>
      <td>0.013099</td>
      <td>0.000000e+00</td>
      <td>0.017043</td>
      <td>0.004732</td>
      <td>0.014904</td>
      <td>...</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.0</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>A0A081QQM2</th>
      <td>0.010774</td>
      <td>0.0</td>
      <td>0.004283</td>
      <td>0.014732</td>
      <td>0.014340</td>
      <td>0.014846</td>
      <td>0.000000e+00</td>
      <td>0.016806</td>
      <td>0.005406</td>
      <td>0.014083</td>
      <td>...</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.0</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>J1A517</th>
      <td>0.010774</td>
      <td>0.0</td>
      <td>0.004283</td>
      <td>0.014732</td>
      <td>0.014340</td>
      <td>0.014846</td>
      <td>0.000000e+00</td>
      <td>0.014500</td>
      <td>0.005406</td>
      <td>0.014083</td>
      <td>...</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.0</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>F5U1T6</th>
      <td>0.015209</td>
      <td>0.0</td>
      <td>0.005175</td>
      <td>0.023888</td>
      <td>0.011410</td>
      <td>0.011510</td>
      <td>0.000000e+00</td>
      <td>0.021145</td>
      <td>0.009280</td>
      <td>0.017466</td>
      <td>...</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.0</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>J3A2T7</th>
      <td>0.005240</td>
      <td>0.0</td>
      <td>0.012301</td>
      <td>0.013178</td>
      <td>0.014744</td>
      <td>0.014705</td>
      <td>0.000000e+00</td>
      <td>0.000981</td>
      <td>0.007957</td>
      <td>0.017112</td>
      <td>...</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.0</td>
      <td>0.000000</td>
    </tr>
  </tbody>
</table>
<p>2112 rows × 400 columns</p>
</div>



We perform PCA on the sequence embeddings and then do kmeans clustering.


```python
pca = PCA(n_components=2)
pca.fit(sgtembedding_df)

X=pca.transform(sgtembedding_df)

print(np.sum(pca.explained_variance_ratio_))
df = pd.DataFrame(data=X, columns=['x1', 'x2'])
df.head()
```

    0.6432744907364981





<div>

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>x1</th>
      <th>x2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0.384913</td>
      <td>-0.269873</td>
    </tr>
    <tr>
      <th>1</th>
      <td>0.022764</td>
      <td>0.135995</td>
    </tr>
    <tr>
      <th>2</th>
      <td>0.177792</td>
      <td>-0.172454</td>
    </tr>
    <tr>
      <th>3</th>
      <td>0.168074</td>
      <td>-0.147334</td>
    </tr>
    <tr>
      <th>4</th>
      <td>0.383616</td>
      <td>-0.271163</td>
    </tr>
  </tbody>
</table>
</div>




```python
kmeans = KMeans(n_clusters=3, max_iter =300)
kmeans.fit(df)

labels = kmeans.predict(df)
centroids = kmeans.cluster_centers_

fig = plt.figure(figsize=(5, 5))
colmap = {1: 'r', 2: 'g', 3: 'b'}
colors = list(map(lambda x: colmap[x+1], labels))
plt.scatter(df['x1'], df['x2'], color=colors, alpha=0.5, edgecolor=colors)
```




    <matplotlib.collections.PathCollection at 0x14c5a77f0>




![png](output_23_1.png)


## <a name="sequence-classification"></a> Sequence Classification using Deep Learning in TensorFlow

The protein data set used above is also labeled. The labels represent the protein functions. Similarly, there are other labeled sequence data sets. For example, DARPA shared an intrusion weblog data set. It contains weblog sequences with positive labels if the log represents a network intrusion.

In such problems supervised learning is employed. Classification is a supervised learning we will demonstrate here.

### Protein Sequence Classification

The data set is taken from https://www.uniprot.org . The protein sequences in the data set have one of the two functions,
- Binds to DNA and alters its conformation. May be involved in regulation of gene expression, nucleoid organization and DNA protection.
- Might take part in the signal recognition particle (SRP) pathway. This is inferred from the conservation of its genetic proximity to ftsY/ffh. May be a regulatory protein.

There are a total of 2113 samples. The sequence lengths vary between 100-700.


```python
# Loading data
data = pd.read_csv('data/protein_classification.csv')


# Data preprocessing
y = data['Function [CC]']
encoder = LabelEncoder()
encoder.fit(y)
encoded_y = encoder.transform(y)

corpus = data.loc[:,['Entry','Sequence']]
corpus.columns = ['id', 'sequence']
corpus['sequence'] = corpus['sequence'].map(list)
```

#### Sequence embeddings


```python
# Sequence embedding
sgt_ = SGT(kappa=1, 
           lengthsensitive=False, 
           mode='multiprocessing')
sgtembedding_df = sgt_.fit_transform(corpus)
X = sgtembedding_df.set_index('id')
```

    INFO: Pandarallel will run on 7 workers.
    INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


We will perform a 10-fold cross-validation to measure the performance of the classification model.


```python
kfold = 10
X = X
y = encoded_y

random_state = 1

test_F1 = np.zeros(kfold)
skf = KFold(n_splits = kfold, shuffle = True, random_state = random_state)
k = 0
epochs = 50
batch_size = 128

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model = Sequential()
    model.add(Dense(64, input_shape = (X_train.shape[1],))) 
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(32))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    model.fit(X_train, y_train ,batch_size=batch_size, epochs=epochs, verbose=0)

    y_pred = model.predict_proba(X_test).round().astype(int)
    y_train_pred = model.predict_proba(X_train).round().astype(int)

    test_F1[k] = sklearn.metrics.f1_score(y_test, y_pred)
    k+=1

print ('Average f1 score', np.mean(test_F1))
```

    Average f1 score 1.0


### Weblog Classification for Intrusion Detection

This data sample is taken from https://www.ll.mit.edu/r-d/datasets/1998-darpa-intrusion-detection-evaluation-dataset. 
This is a network intrusion data containing audit logs and any attack as a positive label. Since, network intrusion is a rare event, the data is unbalanced. Here we will,
- build a sequence classification model to predict a network intrusion.

Each sequence contains in the data is a series of activity, for example, {login, password}. The _alphabets_ in the input data sequences are already encoded into integers. The original sequences data file is also present in the `/data` directory.


```python
# Loading data
data = pd.read_csv('data/darpa_data.csv')
data.columns
```




    Index(['timeduration', 'seqlen', 'seq', 'class'], dtype='object')




```python
data['id'] = data.index
```


```python
# Data preprocessing
y = data['class']
encoder = LabelEncoder()
encoder.fit(y)
encoded_y = encoder.transform(y)

corpus = data.loc[:,['id','seq']]
corpus.columns = ['id', 'sequence']
corpus['sequence'] = corpus['sequence'].map(list)
```

#### Sequence embeddings
In this data, the sequence embeddings should be **length-sensitive**. 

The lengths are important here because sequences with similar patterns but different lengths can have different labels. Consider a simple example of two sessions: `{login, pswd, login, pswd,...}` and `{login, pswd,...(repeated several times)..., login, pswd}`. 

While the first session can be a regular user mistyping the password once, the other session is possibly an attack to guess the password. Thus, the sequence lengths are as important as the patterns.

Therefore, `lengthsensitive=True` is used here.


```python
# Sequence embedding
sgt_ = SGT(kappa=5, 
           lengthsensitive=True, 
           mode='multiprocessing')
sgtembedding_df = sgt_.fit_transform(corpus)
sgtembedding_df = sgtembedding_df.set_index('id')
sgtembedding_df
```

    INFO: Pandarallel will run on 7 workers.
    INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.





<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>(0, 0)</th>
      <th>(0, 1)</th>
      <th>(0, 2)</th>
      <th>(0, 3)</th>
      <th>(0, 4)</th>
      <th>(0, 5)</th>
      <th>(0, 6)</th>
      <th>(0, 7)</th>
      <th>(0, 8)</th>
      <th>(0, 9)</th>
      <th>...</th>
      <th>(~, 1)</th>
      <th>(~, 2)</th>
      <th>(~, 3)</th>
      <th>(~, 4)</th>
      <th>(~, 5)</th>
      <th>(~, 6)</th>
      <th>(~, 7)</th>
      <th>(~, 8)</th>
      <th>(~, 9)</th>
      <th>(~, ~)</th>
    </tr>
    <tr>
      <th>id</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0.0</th>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
      <td>0.000000</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
      <td>...</td>
      <td>0.485034</td>
      <td>0.486999</td>
      <td>0.485802</td>
      <td>0.483097</td>
      <td>0.483956</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.178609</td>
    </tr>
    <tr>
      <th>1.0</th>
      <td>0.000000</td>
      <td>0.025622</td>
      <td>0.228156</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
      <td>1.310714e-09</td>
      <td>0.000000</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
      <td>...</td>
      <td>0.447620</td>
      <td>0.452097</td>
      <td>0.464568</td>
      <td>0.367296</td>
      <td>0.525141</td>
      <td>0.455018</td>
      <td>0.374364</td>
      <td>0.414081</td>
      <td>0.549981</td>
      <td>0.172479</td>
    </tr>
    <tr>
      <th>2.0</th>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
      <td>0.000000</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
      <td>...</td>
      <td>0.525605</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.193359</td>
      <td>0.071469</td>
    </tr>
    <tr>
      <th>3.0</th>
      <td>0.077999</td>
      <td>0.208974</td>
      <td>0.230338</td>
      <td>1.830519e-01</td>
      <td>1.200926e-17</td>
      <td>1.696880e-01</td>
      <td>0.093646</td>
      <td>7.985870e-02</td>
      <td>2.896813e-05</td>
      <td>3.701710e-05</td>
      <td>...</td>
      <td>0.474072</td>
      <td>0.468353</td>
      <td>0.463594</td>
      <td>0.177507</td>
      <td>0.551270</td>
      <td>0.418652</td>
      <td>0.309652</td>
      <td>0.384657</td>
      <td>0.378225</td>
      <td>0.170362</td>
    </tr>
    <tr>
      <th>4.0</th>
      <td>0.000000</td>
      <td>0.023695</td>
      <td>0.217819</td>
      <td>2.188276e-33</td>
      <td>0.000000e+00</td>
      <td>6.075992e-11</td>
      <td>0.000000</td>
      <td>0.000000e+00</td>
      <td>5.681668e-39</td>
      <td>0.000000e+00</td>
      <td>...</td>
      <td>0.464120</td>
      <td>0.468229</td>
      <td>0.452170</td>
      <td>0.000000</td>
      <td>0.501242</td>
      <td>0.000000</td>
      <td>0.300534</td>
      <td>0.161961</td>
      <td>0.000000</td>
      <td>0.167082</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>106.0</th>
      <td>0.000000</td>
      <td>0.024495</td>
      <td>0.219929</td>
      <td>2.035190e-17</td>
      <td>1.073271e-18</td>
      <td>5.656994e-11</td>
      <td>0.000000</td>
      <td>0.000000e+00</td>
      <td>5.047380e-29</td>
      <td>0.000000e+00</td>
      <td>...</td>
      <td>0.502213</td>
      <td>0.544343</td>
      <td>0.477281</td>
      <td>0.175901</td>
      <td>0.461103</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.162796</td>
      <td>0.000000</td>
      <td>0.167687</td>
    </tr>
    <tr>
      <th>107.0</th>
      <td>0.110422</td>
      <td>0.227478</td>
      <td>0.217549</td>
      <td>1.723963e-01</td>
      <td>1.033292e-14</td>
      <td>3.896725e-07</td>
      <td>0.083685</td>
      <td>2.940589e-08</td>
      <td>8.864072e-02</td>
      <td>4.813990e-29</td>
      <td>...</td>
      <td>0.490398</td>
      <td>0.522016</td>
      <td>0.466808</td>
      <td>0.470603</td>
      <td>0.479795</td>
      <td>0.480057</td>
      <td>0.194888</td>
      <td>0.172397</td>
      <td>0.164873</td>
      <td>0.172271</td>
    </tr>
    <tr>
      <th>108.0</th>
      <td>0.005646</td>
      <td>0.202424</td>
      <td>0.196786</td>
      <td>2.281242e-01</td>
      <td>1.133936e-01</td>
      <td>1.862098e-01</td>
      <td>0.000000</td>
      <td>1.212869e-01</td>
      <td>9.180520e-08</td>
      <td>0.000000e+00</td>
      <td>...</td>
      <td>0.432834</td>
      <td>0.434953</td>
      <td>0.439615</td>
      <td>0.390864</td>
      <td>0.481764</td>
      <td>0.600875</td>
      <td>0.166766</td>
      <td>0.165368</td>
      <td>0.000000</td>
      <td>0.171729</td>
    </tr>
    <tr>
      <th>109.0</th>
      <td>0.000000</td>
      <td>0.025616</td>
      <td>0.238176</td>
      <td>3.889176e-55</td>
      <td>1.332427e-60</td>
      <td>1.408003e-09</td>
      <td>0.000000</td>
      <td>9.845377e-60</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
      <td>...</td>
      <td>0.421318</td>
      <td>0.439985</td>
      <td>0.467953</td>
      <td>0.440951</td>
      <td>0.527165</td>
      <td>0.864717</td>
      <td>0.407155</td>
      <td>0.399335</td>
      <td>0.251304</td>
      <td>0.171885</td>
    </tr>
    <tr>
      <th>110.0</th>
      <td>0.000000</td>
      <td>0.022868</td>
      <td>0.203513</td>
      <td>9.273472e-64</td>
      <td>0.000000e+00</td>
      <td>1.240870e-09</td>
      <td>0.000000</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
      <td>...</td>
      <td>0.478090</td>
      <td>0.454871</td>
      <td>0.459109</td>
      <td>0.000000</td>
      <td>0.490534</td>
      <td>0.370357</td>
      <td>0.000000</td>
      <td>0.162997</td>
      <td>0.000000</td>
      <td>0.162089</td>
    </tr>
  </tbody>
</table>
<p>111 rows × 121 columns</p>
</div>



#### Applying PCA on the embeddings
The embeddings are sparse and high-dimensional. PCA is, therefore, applied for dimension reduction.


```python
from sklearn.decomposition import PCA
pca = PCA(n_components=35)
pca.fit(sgtembedding_df)
X = pca.transform(sgtembedding_df)
print(np.sum(pca.explained_variance_ratio_))
```

    0.9962446146783123


#### Building a Multi-Layer Perceptron Classifier
The PCA transforms of the embeddings are used directly as inputs to an MLP classifier.


```python
kfold = 3
random_state = 11

X = X
y = encoded_y

test_F1 = np.zeros(kfold)
time_k = np.zeros(kfold)
skf = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=random_state)
k = 0
epochs = 300
batch_size = 15

# class_weight = {0 : 1., 1: 1.,}  # The weights can be changed and made inversely proportional to the class size to improve the accuracy.
class_weight = {0 : 0.12, 1: 0.88,}

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model = Sequential()
    model.add(Dense(128, input_shape=(X_train.shape[1],))) 
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))
    model.summary()
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    start_time = time.time()
    model.fit(X_train, y_train ,batch_size=batch_size, epochs=epochs, verbose=1, class_weight=class_weight)
    end_time = time.time()
    time_k[k] = end_time-start_time

    y_pred = model.predict_proba(X_test).round().astype(int)
    y_train_pred = model.predict_proba(X_train).round().astype(int)
    test_F1[k] = sklearn.metrics.f1_score(y_test, y_pred)
    k += 1
```

    Model: "sequential_10"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    dense_30 (Dense)             (None, 128)               4608      
    _________________________________________________________________
    activation_30 (Activation)   (None, 128)               0         
    _________________________________________________________________
    dropout_20 (Dropout)         (None, 128)               0         
    _________________________________________________________________
    dense_31 (Dense)             (None, 1)                 129       
    _________________________________________________________________
    activation_31 (Activation)   (None, 1)                 0         
    =================================================================
    Total params: 4,737
    Trainable params: 4,737
    Non-trainable params: 0
    _________________________________________________________________
    WARNING:tensorflow:sample_weight modes were coerced from
      ...
        to  
      ['...']
    Train on 74 samples
    Epoch 1/300
    74/74 [==============================] - 0s 7ms/sample - loss: 0.1487 - accuracy: 0.5270
    Epoch 2/300
    74/74 [==============================] - 0s 120us/sample - loss: 0.1421 - accuracy: 0.5000
    ...
    74/74 [==============================] - 0s 118us/sample - loss: 0.0299 - accuracy: 0.8784
    Epoch 300/300
    74/74 [==============================] - 0s 133us/sample - loss: 0.0296 - accuracy: 0.8649



```python
print ('Average f1 score', np.mean(test_F1))
print ('Average Run time', np.mean(time_k))
```

    Average f1 score 0.6341880341880342
    Average Run time 3.880180994669596


#### Building an LSTM Classifier on the sequences for comparison
We built an LSTM Classifier on the sequences to compare the accuracy.


```python
X = data['seq']
encoded_X = np.ndarray(shape=(len(X),), dtype=list)
for i in range(0,len(X)):
    encoded_X[i]=X.iloc[i].split("~")
X
```




    0      1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~...
    1      6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~...
    2      19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~1...
    3      6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~...
    4      5~5~17~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~...
                                 ...                        
    106    10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~3...
    107    5~5~2~5~17~6~5~6~5~5~2~6~17~3~2~2~3~5~2~3~5~6~...
    108    6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~...
    109    6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5...
    110    5~6~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~...
    Name: seq, Length: 111, dtype: object




```python
max_seq_length = np.max(data['seqlen'])
encoded_X = tf.keras.preprocessing.sequence.pad_sequences(encoded_X, maxlen=max_seq_length)
```


```python
kfold = 3
random_state = 11

test_F1 = np.zeros(kfold)
time_k = np.zeros(kfold)

epochs = 50
batch_size = 15
skf = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=random_state)
k = 0

for train_index, test_index in skf.split(encoded_X, y):
    X_train, X_test = encoded_X[train_index], encoded_X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    embedding_vecor_length = 32
    top_words=50
    model = Sequential()
    model.add(Embedding(top_words, embedding_vecor_length, input_length=max_seq_length))
    model.add(LSTM(32))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    model.summary()

    start_time = time.time()
    model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1)
    end_time=time.time()
    time_k[k]=end_time-start_time

    y_pred = model.predict_proba(X_test).round().astype(int)
    y_train_pred=model.predict_proba(X_train).round().astype(int)
    test_F1[k]=sklearn.metrics.f1_score(y_test, y_pred)
    k+=1
```

    Model: "sequential_13"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    embedding (Embedding)        (None, 1773, 32)          1600      
    _________________________________________________________________
    lstm (LSTM)                  (None, 32)                8320      
    _________________________________________________________________
    dense_36 (Dense)             (None, 1)                 33        
    _________________________________________________________________
    activation_36 (Activation)   (None, 1)                 0         
    =================================================================
    Total params: 9,953
    Trainable params: 9,953
    Non-trainable params: 0
    _________________________________________________________________
    Train on 74 samples
    Epoch 1/50
    74/74 [==============================] - 5s 72ms/sample - loss: 0.6894 - accuracy: 0.5676
    Epoch 2/50
    74/74 [==============================] - 4s 48ms/sample - loss: 0.6590 - accuracy: 0.8784
    ...
    Epoch 50/50
    74/74 [==============================] - 4s 51ms/sample - loss: 0.1596 - accuracy: 0.9324



```python
print ('Average f1 score', np.mean(test_F1))
print ('Average Run time', np.mean(time_k))
```

    Average f1 score 0.36111111111111116
    Average Run time 192.46954011917114


We find that the LSTM classifier gives a significantly lower F1 score. This may be improved by changing the model. However, we find that the SGT embedding could work with a small and unbalanced data without the need of a complicated classifier model.

LSTM models typically require more data for training and also has significantly more computation time. The LSTM model above took 425.6 secs while the MLP model took just 9.1 secs.

## <a name="sequence-search"></a> Sequence Search

Sequence data sets are generally large. For example, sequences of listening history in music streaming services, such as Pandora, for more than 70M users are huge. In protein data bases there could be even larger size. For instance, the Uniprot data repository has more than 177M sequences.

Searching for similar sequences in such large data bases is challenging. SGT embedding provides a simple solution. In the following it will be shown on a protein data set that SGT embedding can be used to compute similarity between a query sequence and the sequence corpus using a dot product. The sequences with the highest dot product are returned as the most similar sequence to the query.

### Protein Sequence Search

In the following, a sample of 10k protein sequences are used for illustration. The data is taken from https://www.uniprot.org .


```python
# Loading data
data = pd.read_csv('data/protein-uniprot-reviewed-Ano-10k.tab', sep='\t')

# Data preprocessing
corpus = data.loc[:,['Entry','Sequence']]
corpus.columns = ['id', 'sequence']
corpus['sequence'] = corpus['sequence'].map(list)
corpus.head(3)
```




<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>id</th>
      <th>sequence</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>I2WKR6</td>
      <td>[M, V, H, K, S, D, S, D, E, L, A, A, L, R, A, ...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>A0A2A6M8K9</td>
      <td>[M, Q, E, S, L, V, V, R, R, E, T, H, I, A, A, ...</td>
    </tr>
    <tr>
      <th>2</th>
      <td>A0A3G5KEC3</td>
      <td>[M, A, S, G, A, Y, S, K, Y, L, F, Q, I, I, G, ...</td>
    </tr>
  </tbody>
</table>
</div>




```python
# Protein sequence alphabets
alphabets = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 
             'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 
             'W', 'X', 'Y', 'U', 'O']  # List of amino acids

# Alphabets are known and inputted 
# as arguments for faster computation
sgt_ = SGT(alphabets=alphabets, 
           lengthsensitive=True, 
           kappa=1, 
           flatten=True, 
           mode='multiprocessing')

sgtembedding_df = sgt_.fit_transform(corpus)
sgtembedding_df = sgtembedding_df.set_index('id')
```

    INFO: Pandarallel will run on 7 workers.
    INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.



```python
'''
Search proteins similar to a query protein.
The approach is to find the SGT embedding of the
query protein and find its similarity with the
embeddings of the protein database.
'''

query_protein = 'MSHVFPIVIDDNFLSPQDLVSAARSGCSLRLHTGVVDKIDRAHRFVLEIAGAEALHYGINTGFGSLCTTHIDPADLSTLQHNLLKSHACGVGPTVSEEVSRVVTLIKLLTFRTGNSGVSLSTVNRIIDLWNHGVVGAIAQKGTVGASGDLAPLAHLFLPLIGLGQVWHRGVLRPSREVMDELKLAPLTLQPKDGLCLTNGVQYLNAWGALSTVRAKRLVALADLCAAMSMMGFSAARSFIEAQIHQTSLHPERGHVALHLRTLTHGSNHADLPHCNPAMEDPYSFRCAPQVHGAARQVVGYLETVIGNECNSVSDNPLVFPDTRQILTCGNLHGQSTAFALDFAAIGITDLSNISERRTYQLLSGQNGLPGFLVAKPGLNSGFMVVQYTSAALLNENKVLSNPASVDTIPTCHLQEDHVSMGGTSAYKLQTILDNCETILAIELMTACQAIDMNPGLQLSERGRAIYEAVREEIPFVKEDHLMAGLISKSRDLCQHSTVIAQQLAEMQAQ'

# Step 1. Compute sgt embedding for the query protein.
query_protein_sgt_embedding = sgt_.fit(list(query_protein))

# Step 2. Compute the dot product of query embedding 
# with the protein embedding database.
similarity = sgtembedding_df.dot(query_protein_sgt_embedding)

# Step 3. Return the top k protein names based on similarity.
similarity.sort_values(ascending=False)
```




    id
    K0ZGN5        2773.749663
    A0A0Y1CPH7    1617.451379
    A0A5R8LCJ1    1566.833152
    A0A290WY40    1448.772820
    A0A073K6N6    1392.267250
                     ...     
    A0A1S7UBK4     160.074989
    A0A2S7T1R9     156.580584
    A0A0E0UQV6     155.834932
    A0A1Y5Y0S0     148.862049
    B0NRP3         117.656497
    Length: 10063, dtype: float64



## <a name="sgt-spark"></a> SGT - Spark for Distributed Computing

As mentioned in the previous section, sequence data sets can be large. SGT complexity is linear with the number of sequences in a data set. Still if the data size is large the computation becomes high. For example, for a set of 1M protein sequences the default SGT mode takes over 24 hours.

Using distributed computing with Spark the runtime can be significantly reduced. For instance, SGT-Spark on the same 1M protein data set took less than 29 minutes.

In the following, Spark implementation for SGT is shown. First, it is applied on a smaller 10k data set for comparison. Then it is applied on 1M data set without any syntactical change.


```python
'''
Load the data and remove header.
'''
data = sc.textFile('data/protein-uniprot-reviewed-Ano-10k.tab')

header = data.first() #extract header
data = data.filter(lambda row: row != header)   #filter out header
data.take(1)  # See one sample

```

<div class="ansiout"><span class="ansired">Out[</span><span class="ansired">3</span><span class="ansired">]: </span>[&apos;I2WKR6\tI2WKR6_ECOLX\tunreviewed\tType III restriction enzyme, res subunit (EC 3.1.21.5)\tEC90111_4246\tEscherichia coli 9.0111\t786\tMVHKSDSDELAALRAENVRLVSLLEAHGIEWRRKPQSPVPRVSVLSTNEKVALFRRLFRGRDDVWALRWESKTSGKSGYSPACANEWQLGICGKPRIKCGDCAHRQLIPVSDLVIYHHLAGTHTAGMYPLLEDDSCYFLAVDFDEAEWQKDASAFMRSCDELGVPAALEISRSRQGAHVWIFFASRVSAREARRLGTAIISYTCSRTRQLRLGSYDRLFPNQDTMPKGGFGNLIALPLQKRPRELGGSVFVDMNLQPYPDQWAFLVSVIPMNVQDIEPTILRATGSIHPLDVNFINEEDLGTPWEEKKSSGNRLNIAVTEPLIITLANQIYFEKAQLPQALVNRLIRLAAFPNPEFYKAQAMRMSVWNKPRVIGCAENYPQHIALPRGCLDSALSFLRYNNIAAELIDKRFAGTECNAVFTGNLRAEQEEAVSALLRYDTGVLCAPTAFGKTVTAAAVIARRKVNTLILVHRTELLKQWQERLAVFLQVGDSIGIIGGGKHKPCGNIDIAVVQSISRHGEVEPLVRNYGQIIVDECHHIGAVSFSAILKETNARYLLGLTATPIRRDGLHPIIFMYCGAIRHTAARPKESLHNLEVLTRSRFTSGHLPSDARIQDIFREIALDHDRTVAIAEEAMKAFGQGRKVLVLTERTDHLDDIASVMNTLKLSPFVLHSRLSKKKRTMLISGLNALPPDSPRILLSTGRLIGEGFDHPPLDTLILAMPVSWKGTLQQYAGRLHREHTGKSDVRIIDFVDTAYPVLLRMWDKRQRGYKAMGYRIVADGEGLSF&apos;]</div>



```python
# Repartition for increasing the parallel processes
data = data.repartition(500)
```


```python
def preprocessing(line):
    '''
    Original data are lines where each line has \t
    separated values. We are interested in preserving
    the first value (entry id), tmp[0], and the last value
    (the sequence), tmp[-1].
    '''
    tmp = line.split('\t')
    id = tmp[0]
    sequence = list(tmp[-1])
    return (id, sequence)

processeddata = data.map(lambda line: preprocessing(line))
processeddata.take(1)  # See one sample

```


<div class="ansiout"><span class="ansired">Out[</span><span class="ansired">5</span><span class="ansired">]: </span>[(&apos;A0A2E9WIJ1&apos;,
  [&apos;M&apos;,
   &apos;Y&apos;,
   &apos;I&apos;,
   &apos;F&apos;,
   &apos;L&apos;,
   &apos;T&apos;,
   &apos;L&apos;,
	...   
   &apos;A&apos;,
   &apos;K&apos;,
   &apos;L&apos;,
   &apos;D&apos;,
   &apos;K&apos;,
   &apos;N&apos;,
   &apos;D&apos;])]</div>





```python
# Protein sequence alphabets
alphabets = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 
             'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 
             'W', 'X', 'Y', 'U', 'O']  # List of amino acids
```


```python
'''
Spark approach.
In this approach the alphabets argument has to
be passed to the SGT class definition.
The SGT.fit() is then called in parallel.
'''
sgt_ = sgt.SGT(alphabets=alphabets, 
               kappa=1, 
               lengthsensitive=True, 
               flatten=True)
rdd = processeddata.map(lambda x: (x[0], list(sgt_.fit(x[1]))))
sgtembeddings = rdd.collect()
# Command took 29.66 seconds -- by cranjan@processminer.com at 4/22/2020, 12:31:23 PM on databricks
```

### Compare with the default SGT mode


```python
# Loading data
data = pd.read_csv('data/protein-uniprot-reviewed-Ano-10k.tab', sep='\t')

# Data preprocessing
corpus = data.loc[:,['Entry','Sequence']]
corpus.columns = ['id', 'sequence']
corpus['sequence'] = corpus['sequence'].map(list)

```


```python
sgt_ = sgt.SGT(alphabets=alphabets, 
               lengthsensitive=True, 
               kappa=1, 
               flatten=True, 
               mode='default')

sgtembedding_df = sgt_.fit_transform(corpus)
# Command took 13.08 minutes -- by cranjan@processminer.com at 4/22/2020, 1:48:02 PM on databricks
```

### 1M Protein Database

Protein 1M sequence data set is embedded here. The data set is available [here](https://mega.nz/file/1qAXhSAS#l7E60cLJzMGtFQzeHZL9PI8yX4tRQcAMFRN2xeHK81w).

```python
'''
Load the data and remove header.
'''
data = sc.textFile('data/protein-uniprot-reviewed-Ano-1M.tab')

header = data.first() #extract header
data = data.filter(lambda row: row != header)   #filter out header
data.take(1)  # See one sample
```



```python
# Repartition for increasing the parallel processes
data = data.repartition(10000)
```


```python
processeddata = data.map(lambda line: preprocessing(line))
processeddata.take(1)  # See one sample

# [('A0A2E9WIJ1',
#   ['M','Y','I','F','L','T','L','A','L','F','S',...,'F','S','I','F','A','K','L','D','K','N','D'])]
```


```python
# Protein sequence alphabets
alphabets = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 
             'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 
             'W', 'X', 'Y', 'U', 'O']  # List of amino acids
```


```python
'''
Spark approach.
In this approach the alphabets argument has to
be passed to the SGT class definition.
The SGT.fit() is then called in parallel.
'''
sgt_ = sgt.SGT(alphabets=alphabets, 
               kappa=1, 
               lengthsensitive=True, 
               flatten=True)
rdd = processeddata.map(lambda x: (x[0], list(sgt_.fit(x[1]))))
sgtembeddings = rdd.collect()
# Command took 28.98 minutes -- by cranjan@processminer.com at 4/22/2020, 3:16:41 PM on databricks
```


```python
'''OPTIONAL.
Save the embeddings for future use or 
production deployment.'''
# Save for deployment
# pickle.dump(sgtembeddings, 
#             open("data/protein-sgt-embeddings-1M.pkl", "wb"))
# The pickle dump is shared at https://mega.nz/file/hiAxAAoI#SStAIn_FZjAHvXSpXfdy8VpISG6rusHRf9HlUSqwcsw
# sgtembeddings = pickle.load(open("data/protein-sgt-embeddings-1M.pkl", "rb"))
```

The pickle dump is shared [here](https://mega.nz/file/hiAxAAoI#SStAIn_FZjAHvXSpXfdy8VpISG6rusHRf9HlUSqwcsw).

### Sequence Search using SGT - Spark

Since `sgtembeddings` on the 1M data set is large it is recommended to use distributed computing to find similar proteins during a search.


```python
sgtembeddings_rdd = sc.parallelize(list(dict(sgtembeddings).items()))
sgtembeddings_rdd = sgtembeddings_rdd.repartition(10000)
```


```python
'''
Search proteins similar to a query protein.
The approach is to find the SGT embedding of the
query protein and find its similarity with the
embeddings of the protein database.
'''

query_protein = 'MSHVFPIVIDDNFLSPQDLVSAARSGCSLRLHTGVVDKIDRAHRFVLEIAGAEALHYGINTGFGSLCTTHIDPADLSTLQHNLLKSHACGVGPTVSEEVSRVVTLIKLLTFRTGNSGVSLSTVNRIIDLWNHGVVGAIAQKGTVGASGDLAPLAHLFLPLIGLGQVWHRGVLRPSREVMDELKLAPLTLQPKDGLCLTNGVQYLNAWGALSTVRAKRLVALADLCAAMSMMGFSAARSFIEAQIHQTSLHPERGHVALHLRTLTHGSNHADLPHCNPAMEDPYSFRCAPQVHGAARQVVGYLETVIGNECNSVSDNPLVFPDTRQILTCGNLHGQSTAFALDFAAIGITDLSNISERRTYQLLSGQNGLPGFLVAKPGLNSGFMVVQYTSAALLNENKVLSNPASVDTIPTCHLQEDHVSMGGTSAYKLQTILDNCETILAIELMTACQAIDMNPGLQLSERGRAIYEAVREEIPFVKEDHLMAGLISKSRDLCQHSTVIAQQLAEMQAQ'

# Step 1. Compute sgt embedding for the query protein.
query_protein_sgt_embedding = sgt_.fit(list(query_protein))

# Step 2. Broadcast the embedding to the cluster.
query_protein_sgt_embedding_broadcasted = sc.broadcast(list(query_protein_sgt_embedding))

# Step 3. Compute similarity between each sequence embedding and the query.
similarity = sgtembeddings_rdd.map(lambda x: (x[0], 
                                              np.dot(query_protein_sgt_embedding_broadcasted.value, 
                                                     x[1]))).collect()

# Step 4. Show the most similar sequences with the query.
pd.DataFrame(similarity).sort_values(by=1, ascending=False)
```


