Code Examples
=============

Head file FinalReport
---------------------
::

    [Header]
    GSGT Version	2.0.4
    Processing Date	1/16/2023 9:19 AM
    Content		GGP_HDv3_C.bpm
    Num SNPs	139376
    Total SNPs	140668
    Num Samples	25
    Total Samples	26
    File 	1 of 25
    [Data]
    SNP Name	Sample ID	Allele1 - Forward	Allele2 - Forward	Allele1 - Top	Allele2 - Top	Allele1 - AB	Allele2 - AB	GC Score	X	Y
    ARS-BFGL-BAC-10172	HO840M003135245650	G	G	G	G	B	B	0.9420	0.069	0.801
    ARS-BFGL-BAC-1020	HO840M003135245650	G	G	G	G	B	B	0.9489	0.033	0.700
    ARS-BFGL-BAC-10245	HO840M003135245650	C	C	G	G	B	B	0.7277	0.152	1.504
    ARS-BFGL-BAC-10345	HO840M003135245650	A	C	A	C	A	B	0.9411	0.598	0.572
    ARS-BFGL-BAC-10375	HO840M003135245650	A	G	A	G	A	B	0.9348	0.430	0.494
    ...

Processing the Finalreport.txt
------------------------------

.. code-block:: python

        from snplib.finalreport import FinalReport

        obj_report = FinalReport()
        obj_report.handle("path/to/finalreport.txt")

        header = obj_report.header
        data = obj_report.snp_data

Output Dataframe::

    SNP Name	Sample ID	Allele1 - Forward	Allele2 - Forward	Allele1 - Top	Allele2 - Top	Allele1 - AB	Allele2 - AB	GC Score	X	Y
    ARS-BFGL-BAC-10172	HO840M003135245650	G	G	G	G	B	B	0.9420	0.069	0.801
    ARS-BFGL-BAC-1020	HO840M003135245650	G	G	G	G	B	B	0.9489	0.033	0.700
    ARS-BFGL-BAC-10245	HO840M003135245650	C	C	G	G	B	B	0.7277	0.152	1.504
    ARS-BFGL-BAC-10345	HO840M003135245650	A	C	A	C	A	B	0.9411	0.598	0.572
    ARS-BFGL-BAC-10375	HO840M003135245650	A	G	A	G	A	B	0.9348	0.430	0.494


Select alleles - AB or Top or Forward

.. code-block:: python

        alleles_ab = FinalReport(allele="AB")
        alleles_ab.handle("path/to/finalreport.txt")
        data_ab = alleles_ab.snp_data

        alleles_top = FinalReport(allele="Top")
        alleles_top.handle("path/to/finalreport.txt")
        data_top = alleles_top.snp_data

        alleles_forward = FinalReport(allele="Forward")
        alleles_forward.handle("path/to/finalreport.txt")
        data_forward = alleles_forward.snp_data

Output::

        SNP Name    Sample ID    Allele1 - AB  Allele2 - AB     GC Score    X Y
        ARS-BFGL-BAC-10172 HO840M003135245650 B B 0.9420 0.069 0.801
        ARS-BFGL-BAC-1020 HO840M003135245650 B B 0.9489 0.033 0.700
        ARS-BFGL-BAC-10245 HO840M003135245650 B B 0.7277 0.152 1.504
        ARS-BFGL-BAC-10345 HO840M003135245650 A B 0.9411 0.598 0.572
        ARS-BFGL-BAC-10375 HO840M003135245650 A B 0.9348 0.430 0.494f

        ...

Preparation SNP files
---------------------

After processing the raw data, FinalReports.txt, for further analysis
several steps of SNP (Single Nucleotide Polymorphism) file preparation are
necessary.

Data formatting
---------------

The received data often requires formatting to bring it to a standardized form.
The proposed module includes data formatting for the programs blupf90 and
plink - GBLUP, ssGBLUP, GWAS.

blupf90 format
______________
The input data for obtaining the ``snp.txt`` file used for the genomic
blupf90 evaluation is the data file - processed file ``finalreport.txt``

**uga**

.. code-block:: python

    import pandas as pd
    from snplib.format import Snp

    data_finalreport = pd.read_csv("file.txt", sep="\t")

    obj = Snp(fmt="uga")
    obj_snp.process(data_finalreport)
    obj_snp.to_file("./snp.txt")

Data after snp processing in ``uga`` (blupf90) format - obj_snp.data::

      SAMPLE_ID                SNP
    0     14814  02011015010000500
    1     14815  01110152120222512

Default result::

                    SNP_NAME SAMPLE_ID SNP
    0               ABCA12     14814   0
    1   ARS-BFGL-BAC-13031     14814   2
    2   ARS-BFGL-BAC-13039     14814   0
    3   ARS-BFGL-BAC-13049     14814   1
                    ...
    17              ABCA12     14815   0
    18  ARS-BFGL-BAC-13031     14815   1
    19  ARS-BFGL-BAC-13039     14815   1
    20  ARS-BFGL-BAC-13049     14815   1
                    ...

plink format
____________

This page describes specialized PLINK input and output file formats which are
identifiable by file extension. https://www.cog-genomics.org/plink/1.9/formats
Распространненные фомраты для проведения GWAS анализа - ``ped``, ``map``, ``fam``, ``lgen``...

**map** - https://www.cog-genomics.org/plink/1.9/formats#map

.. code-block:: python

    import pandas as pd
    from snplib.format import make_map

    input_data = pd.read_csv(DIR_FILES / "./file_bovinesnp50.csv")
    data_map = make_map(input_data)

Output data view::

        Chr                Name  morgans  MapInfo
         0  BovineHD0100037694        0        0
         0  BovineHD0100037699        0        0
         0  BovineHD0100037703        0        0
         0  BovineHD0100037704        0        0

.. note::
    file_bovinesnp50.csv - The file that is taken on the Illumina website with full
    information about the chip
    https://support.illumina.com/downloads/bovinesnp50-v3-0-product-files.html


**ped** - https://www.cog-genomics.org/plink/1.9/formats#ped

.. code-block:: python

    import pandas as pd
    from snplib.format import make_ped

    input_data = pd.read_csv("file.txt")
    data_ped = make_ped(
        input_data, "SAMPLE_ID", "SNP", fid_col="SAMPLE_ID"
    )

    or

    data_ped = make_ped(
        input_data,
        "SAMPLE_ID",
        "SNP",
        fid_col="FAMILY_ID",
        father_col="father",
        mother_col="mother",
        sex_col="sex"
    )

Input data view::

   SAMPLE_ID          SNP
        1100  A A B B 0 0
        1101  A A B B B B
        1102  A A 0 0 B B
        1103  A A B B B B

    or

   SAMPLE_ID          SNP  FAMILY_ID  father  mother  sex
        1100  A A B B 0 0       1100       1       5    1
        1101  A A B B B B       1101       2       6    2
        1102  A A 0 0 B B       1102       3       7    1
        1103  A A B B B B       1103       4       8    0

Output data view::

    fid   sid father mother sex not_used          snp
   1100  1100      0      0   0        0  A A B B 0 0
   1101  1101      0      0   0        0  A A B B B B
   1102  1102      0      0   0        0  A A 0 0 B B
   1103  1103      0      0   0        0  A A B B B B

    or

    fid   sid father mother sex not_used          snp
   1100  1100      1      5   1        0  A A B B 0 0
   1101  1101      2      6   2        0  A A B B B B
   1102  1102      3      7   1        0  A A 0 0 B B
   1103  1103      4      8   0        0  A A B B B B


**fam** - https://www.cog-genomics.org/plink/1.9/formats#fam

.. code-block:: python

    import pandas as pd
    from snplib.format import make_fam

    input_data = pd.read_csv("file.txt", sep=" ")
    data_fam = make_fam(input_data, "SAMPLE_ID", "SAMPLE_ID")

    or

    make_fam(
        input_data,
        "SAMPLE_ID",
        "FAMILY_ID",
        father_col="father",
        mother_col="mother",
        sex_col="sex",
        pheno_col="pheno"
    )

Input data view::

   SAMPLE_ID  SNP
        1100  025
        1101  022
        1102  052
        1103  022

    or

   SAMPLE_ID  SNP  FAMILY_ID  father  mother  sex  pheno
       1100  025       1100       1       5    1     12
       1101  022       1101       2       6    2     13
       1102  052       1102       3       7    1     14
       1103  022       1103       4       8    0     15

Output data view::

     fid   sid father mother sex pheno
    1100  1100      0      0   0    -9
    1101  1101      0      0   0    -9
    1102  1102      0      0   0    -9
    1103  1103      0      0   0    -9

    or

     fid   sid father mother sex pheno
    1100  1100      1      5   1    12
    1101  1101      2      6   2    13
    1102  1102      3      7   1    14
    1103  1103      4      8   0    15


**lgen** - https://www.cog-genomics.org/plink/1.9/formats#lgen

.. code-block:: python

    import pandas as pd
    from snplib.format import make_lgen

    input_data = pd.read_csv("file.txt", sep=" ")
    data_lgen = make_lgen(
        input_data, "Sample ID", "SNP Name", ["Allele1 - AB", "Allele2 - AB"]
    )

Input data view::

     "SNP Name" "Sample ID" "Allele1 - AB" "Allele2 - AB" "GC Score" "GT Score"
                  ABCA12 107232207 A A 0.4048 0.8164
      ARS-BFGL-BAC-13031 107232207 B B 0.9083 0.8712
      ARS-BFGL-BAC-13039 107232207 A A 0.9005 0.9096
      ARS-BFGL-BAC-13049 107232207 A B 0.9295 0.8926
        ...
                   ABCA12 107237284 A A 0.4048 0.8164
       ARS-BFGL-BAC-13031 107237284 A B 0.9566 0.9257
       ARS-BFGL-BAC-13039 107237284 A B 0.3098 0.8555
       ARS-BFGL-BAC-13049 107237284 A B 0.8613 0.8319
        ...


Output data view::

    fid       sid            snp_name allele1 allele2
     1  107232207              ABCA12       A       A
     1  107232207  ARS-BFGL-BAC-13031       B       B
     1  107232207  ARS-BFGL-BAC-13039       A       A
     1  107232207  ARS-BFGL-BAC-13049       A       B
     1  107232207  ARS-BFGL-BAC-13059       A       B

     ...

     1  107237284              ABCA12       A       A
     1  107237284  ARS-BFGL-BAC-13031       A       B
     1  107237284  ARS-BFGL-BAC-13039       A       B
     1  107237284  ARS-BFGL-BAC-13049       A       B
     1  107237284  ARS-BFGL-BAC-13059       A       A
     ...



Statistics
----------

Poor quality or uninformative SNPs can be excluded from the analysis. This
helps to reduce noise and improve the accuracy of the results.


Call Rate
_________

The call rate for a given SNP is defined as the proportion of
individuals in the study for which the corresponding SNP information is
not missing. In the following example, we filter using a call rate of 95%,
meaning we retain SNPs for which there is less than 5% missing data.

**call rate marker**

Of the say, 54K markers in the chip, 50K have been genotyped for a
particular animal, the “call rate animal” is 50K/54K=93%

in_data::

        SNP_NAME SAMPLE_ID SNP
                ABCA12 1100 0
                 APAF1 1100 2
    ARS-BFGL-BAC-10172 1100 5
                ABCA12 1101 0
                 APAF1 1101 2
    ARS-BFGL-BAC-10172 1101 2
                ABCA12 1102 0
                 APAF1 1102 5
    ARS-BFGL-BAC-10172 1102 2
                ABCA12 1103 0
                 APAF1 1103 2
    ARS-BFGL-BAC-10172 1103 2
                ABCA12 1104 5
                 APAF1 1104 1
    ARS-BFGL-BAC-10172 1104 1
                ABCA12 1105 0
                 APAF1 1105 2
    ARS-BFGL-BAC-10172 1105 5
                ABCA12 1106 0
                 APAF1 1106 1
    ARS-BFGL-BAC-10172 1106 2
                ABCA12 1107 5
                 APAF1 1107 2
    ARS-BFGL-BAC-10172 1107 1
                ABCA12 1108 0
                 APAF1 1108 2
    ARS-BFGL-BAC-10172 1108 2
                ABCA12 1109 0
                 APAF1 1109 2
    ARS-BFGL-BAC-10172 1109 2
                ABCA12 1110 5
                 APAF1 1110 2
    ARS-BFGL-BAC-10172 1110 2

.. code-block:: python

    import pandas as pd
    from snplib.statistics import call_rate

    input_data = pd.read_csv("file.txt", sep=" ")
    result = call_rate(data=input_data, id_col="SNP_NAME", snp_col="SNP")

result::

                 SNP_NAME       SNP
                   ABCA12  0.727273
                    APAF1  0.909091
       ARS-BFGL-BAC-10172  0.818182

**call rate animal**

Of the say, 900 animals genotyped for marker CL635944_160.1, how many
have actually been successfully read? Assume that 600 have been read, then
the “call rate marker” is 600/900 = 67%

in_data::

                  SNP_NAME SAMPLE_ID SNP
                    ABCA12     14814   0
        ARS-BFGL-BAC-13031     14814   2
        ARS-BFGL-BAC-13039     14814   0
        ARS-BFGL-BAC-13049     14814   1
        ARS-BFGL-BAC-13059     14814   1
        ARS-BFGL-BAC-13086     14814   0
        ARS-BFGL-BAC-13093     14814   1
        ARS-BFGL-BAC-13110     14814   5
        ARS-BFGL-BAC-13111     14814   0
        ARS-BFGL-BAC-13113     14814   1
        ARS-BFGL-BAC-15633     14814   0
        ARS-BFGL-BAC-15634     14814   0
        ARS-BFGL-BAC-15637     14814   0
        ARS-BFGL-BAC-15659     14814   0
        ARS-BFGL-BAC-15668     14814   5
        ARS-BFGL-BAC-15708     14814   0
        ARS-BFGL-BAC-15718     14814   0
                    ABCA12     14815   0
        ARS-BFGL-BAC-13031     14815   1
        ARS-BFGL-BAC-13039     14815   1
        ARS-BFGL-BAC-13049     14815   1
        ARS-BFGL-BAC-13059     14815   0
        ARS-BFGL-BAC-13086     14815   1
        ARS-BFGL-BAC-13093     14815   5
        ARS-BFGL-BAC-13110     14815   2
        ARS-BFGL-BAC-13111     14815   1
        ARS-BFGL-BAC-13113     14815   2
        ARS-BFGL-BAC-15633     14815   0
        ARS-BFGL-BAC-15634     14815   2
        ARS-BFGL-BAC-15637     14815   2
        ARS-BFGL-BAC-15659     14815   2
        ARS-BFGL-BAC-15668     14815   5
        ARS-BFGL-BAC-15708     14815   1
        ARS-BFGL-BAC-15718     14815   2

.. code-block:: python

    import pandas as pd
    from snplib.statistics import call_rate

    input_data = pd.read_csv("file.txt", sep=" ")
    result = call_rate(data=data_df, id_col="SAMPLE_ID", snp_col="SNP")

result::

      SAMPLE_ID       SNP
          14814  0.882353
          14815  0.882353


Frequence Allele
________________

The allele frequency represents the incidence of a gene variant in a
population.

**allele freq**

in_data::

        SNP_NAME SAMPLE_ID SNP
                ABCA12 1100 0
                 APAF1 1100 2
    ARS-BFGL-BAC-10172 1100 5
                ABCA12 1101 0
                 APAF1 1101 2
    ARS-BFGL-BAC-10172 1101 2
                ABCA12 1102 0
                 APAF1 1102 5
    ARS-BFGL-BAC-10172 1102 2
                ABCA12 1103 0
                 APAF1 1103 2
    ARS-BFGL-BAC-10172 1103 2
                ABCA12 1104 5
                 APAF1 1104 1
    ARS-BFGL-BAC-10172 1104 1
                ABCA12 1105 0
                 APAF1 1105 2
    ARS-BFGL-BAC-10172 1105 5
                ABCA12 1106 0
                 APAF1 1106 1
    ARS-BFGL-BAC-10172 1106 2
                ABCA12 1107 5
                 APAF1 1107 2
    ARS-BFGL-BAC-10172 1107 1
                ABCA12 1108 0
                 APAF1 1108 2
    ARS-BFGL-BAC-10172 1108 2
                ABCA12 1109 0
                 APAF1 1109 2
    ARS-BFGL-BAC-10172 1109 2
                ABCA12 1110 5
                 APAF1 1110 2
    ARS-BFGL-BAC-10172 1110 2

.. code-block:: python

    import pandas as pd
    from snplib.statistics import allele_freq

    input_data = pd.read_csv("file.txt", sep=" ")
    result = allele_freq(data=input_data, id_col="SNP_NAME", seq_col="SNP")

result::

                 SNP_NAME    SNP
                   ABCA12  0.000
                    APAF1  0.900
       ARS-BFGL-BAC-10172  0.889

The minor allele frequency is therefore the frequency at which the
minor allele occurs within a population.

**maf**

.. code-block:: python

    from snplib.statistics import minor_allele_freq as maf

    result = maf(0.22)  # result == 0.22


HWE (Hardy-Weinberg equilibrium)
________________________________

The Hardy-Weinberg equilibrium is a principle stating that the genetic
variation in a population will remain constant from one generation to the
next in the absence of disturbing factors.
https://www.nature.com/scitable/definition/hardy-weinberg-equilibrium-122/

To test the hypothesis that the data are within the HWE, a statistic a chi2
distribution with 1 degree of freedom:

.. code-block:: python

    from snplib.statistics import hwe_test

    result = hwe_test(seq_snp=pd.Series(list(map(int, "2212120"))), freq=0.714)  # True
    result = hwe_test(seq_snp=pd.Series(list(map(int, "02011015010000500"))), freq=0.2)  # True
    result = hwe_test(seq_snp=pd.Series(list(map(int, "000000000102"))), freq=0.125)  # False


The p-value used here is:

.. code-block:: python

    from snplib.statistics import hwe

    hom1 = 10
    hets = 500
    hom2 = 5000

    result = hwe(hets, hom1, hom2)  # 0.6515718999145375 (p-value)

Once the data have been prepared, statistical analysis to identify associations,
patterns, or relationships between SNPs and the phenotypes or diseases of
interest (GWAS). phenotypes or diseases of interest (GWAS).

Parentage
---------
https://www.icar.org/Documents/GenoEx/ICAR%20Guidelines%20for%20Parentage%20Verification%20and%20Parentage%20Discovery%20based%20on%20SNP.pdf


.. note::
    A list of isag verification and discovery macerators can be found here.
    See Appendix list - https://www.icar.org/Guidelines/04-DNA-Technology.pdf

    List of SNP to be used for either parentage verification or parentage discovery (appendix 11):
    https://www.icar.org/Guidelines/04-DNA-Technology-App-11-SNP-list-for-parentage-verification-or-discovery.pdf


Verification
____________

Verification of paternity according to ICAR recommendations.

input data::

                       SNP_Name      ID41988163  ID10512586
    0                    ABCA12               0           0
    1                     APAF1               2           2
    2        ARS-BFGL-BAC-10172               2           2
    3         ARS-BFGL-BAC-1020               1           1
    4        ARS-BFGL-BAC-10245               1           1
    ..                      ...             ...         ...
    239  Hapmap55441-rs29010990               1           1
    240  Hapmap59876-rs29018046               1           0
    241  Hapmap60017-rs29023471               2           1
    242           UA-IFASA-5034               0           1
    243           UA-IFASA-6532               0           0


.. code-block:: python

    from snplib.parentage import Verification, isag_verif

    input_data = pd.read_csv("file.txt", sep=" ")

    obj_verification = Verification(isag_marks=isag_verif().markers)
    result = obj_verification.check_on(
        data=input_data,
        descendant="ID41988163",
        parent="ID10512586",
        snp_name_col="SNP_Name"
    )

    # Result
    print(obj_verification.num_conflicts)  # 31
    print(obj_verification.status)  # "Excluded"


Discovery
_________

Search for paternity according to ICAR recommendations.

input data::

                   SNP_Name      ID41988163  ID10512586
    0                ABCA12               0           0
    1                 APAF1               2           2
    2    ARS-BFGL-BAC-10172               2           2
    3     ARS-BFGL-BAC-1020               1           1
    4    ARS-BFGL-BAC-10245               1           1
    ..                  ...             ...         ...
    617       UA-IFASA-5034               0           1
    618       UA-IFASA-6154               2           0
    619       UA-IFASA-6532               0           0
    620       UA-IFASA-8658               1           0
    621       UA-IFASA-8833               0           0

.. code-block:: python

    from snplib.parentage import Discovery, isag_disc

    input_data = pd.read_csv("file.txt", sep=" ")

    obj_discovery = Discovery(isag_marks=isag_disc().markers)
    result = obj_discovery.search_parent(
        data=input_data,
        descendant="ID41988163",
        parents="ID10512586",
        snp_name_col="SNP_Name"
    )

    # Result
    print(obj_discovery.num_conflicts)  # 77
    print(obj_discovery.status)  # "Excluded"
    print(obj_discovery.perc_conflicts)  # 14.86 %
