Code Examples

Head file FinalReport

[Header]
GSGT Version        2.0.4
Processing Date     1/16/2023 9:19 AM
Content             GGP_HDv3_C.bpm
Num SNPs    139376
Total SNPs  140668
Num Samples 25
Total Samples       26
File        1 of 25
[Data]
SNP Name    Sample ID       Allele1 - Forward       Allele2 - Forward       Allele1 - Top   Allele2 - Top   Allele1 - AB    Allele2 - AB    GC Score        X       Y
ARS-BFGL-BAC-10172  HO840M003135245650      G       G       G       G       B       B       0.9420  0.069   0.801
ARS-BFGL-BAC-1020   HO840M003135245650      G       G       G       G       B       B       0.9489  0.033   0.700
ARS-BFGL-BAC-10245  HO840M003135245650      C       C       G       G       B       B       0.7277  0.152   1.504
ARS-BFGL-BAC-10345  HO840M003135245650      A       C       A       C       A       B       0.9411  0.598   0.572
ARS-BFGL-BAC-10375  HO840M003135245650      A       G       A       G       A       B       0.9348  0.430   0.494
...

Processing the Finalreport.txt

from snplib.finalreport import FinalReport

obj_report = FinalReport()
obj_report.handle("path/to/finalreport.txt")

header = obj_report.header
data = obj_report.snp_data

Output Dataframe:

SNP Name    Sample ID       Allele1 - Forward       Allele2 - Forward       Allele1 - Top   Allele2 - Top   Allele1 - AB    Allele2 - AB    GC Score        X       Y
ARS-BFGL-BAC-10172  HO840M003135245650      G       G       G       G       B       B       0.9420  0.069   0.801
ARS-BFGL-BAC-1020   HO840M003135245650      G       G       G       G       B       B       0.9489  0.033   0.700
ARS-BFGL-BAC-10245  HO840M003135245650      C       C       G       G       B       B       0.7277  0.152   1.504
ARS-BFGL-BAC-10345  HO840M003135245650      A       C       A       C       A       B       0.9411  0.598   0.572
ARS-BFGL-BAC-10375  HO840M003135245650      A       G       A       G       A       B       0.9348  0.430   0.494

Select alleles - AB or Top or Forward

alleles_ab = FinalReport(allele="AB")
alleles_ab.handle("path/to/finalreport.txt")
data_ab = alleles_ab.snp_data

alleles_top = FinalReport(allele="Top")
alleles_top.handle("path/to/finalreport.txt")
data_top = alleles_top.snp_data

alleles_forward = FinalReport(allele="Forward")
alleles_forward.handle("path/to/finalreport.txt")
data_forward = alleles_forward.snp_data

Output:

SNP Name    Sample ID    Allele1 - AB  Allele2 - AB     GC Score    X Y
ARS-BFGL-BAC-10172 HO840M003135245650 B B 0.9420 0.069 0.801
ARS-BFGL-BAC-1020 HO840M003135245650 B B 0.9489 0.033 0.700
ARS-BFGL-BAC-10245 HO840M003135245650 B B 0.7277 0.152 1.504
ARS-BFGL-BAC-10345 HO840M003135245650 A B 0.9411 0.598 0.572
ARS-BFGL-BAC-10375 HO840M003135245650 A B 0.9348 0.430 0.494f

...

Preparation SNP files

After processing the raw data, FinalReports.txt, for further analysis several steps of SNP (Single Nucleotide Polymorphism) file preparation are necessary.

Data formatting

The received data often requires formatting to bring it to a standardized form. The proposed module includes data formatting for the programs blupf90 and plink - GBLUP, ssGBLUP, GWAS.

blupf90 format

The input data for obtaining the snp.txt file used for the genomic blupf90 evaluation is the data file - processed file finalreport.txt

uga

import pandas as pd
from snplib.format import Snp

data_finalreport = pd.read_csv("file.txt", sep="\t")

obj = Snp(fmt="uga")
obj_snp.process(data_finalreport)
obj_snp.to_file("./snp.txt")

Data after snp processing in uga (blupf90) format - obj_snp.data:

  SAMPLE_ID                SNP
0     14814  02011015010000500
1     14815  01110152120222512

Default result:

                SNP_NAME SAMPLE_ID SNP
0               ABCA12     14814   0
1   ARS-BFGL-BAC-13031     14814   2
2   ARS-BFGL-BAC-13039     14814   0
3   ARS-BFGL-BAC-13049     14814   1
                ...
17              ABCA12     14815   0
18  ARS-BFGL-BAC-13031     14815   1
19  ARS-BFGL-BAC-13039     14815   1
20  ARS-BFGL-BAC-13049     14815   1
                ...

Statistics

Poor quality or uninformative SNPs can be excluded from the analysis. This helps to reduce noise and improve the accuracy of the results.

Call Rate

The call rate for a given SNP is defined as the proportion of individuals in the study for which the corresponding SNP information is not missing. In the following example, we filter using a call rate of 95%, meaning we retain SNPs for which there is less than 5% missing data.

call rate marker

Of the say, 54K markers in the chip, 50K have been genotyped for a particular animal, the “call rate animal” is 50K/54K=93%

in_data:

    SNP_NAME SAMPLE_ID SNP
            ABCA12 1100 0
             APAF1 1100 2
ARS-BFGL-BAC-10172 1100 5
            ABCA12 1101 0
             APAF1 1101 2
ARS-BFGL-BAC-10172 1101 2
            ABCA12 1102 0
             APAF1 1102 5
ARS-BFGL-BAC-10172 1102 2
            ABCA12 1103 0
             APAF1 1103 2
ARS-BFGL-BAC-10172 1103 2
            ABCA12 1104 5
             APAF1 1104 1
ARS-BFGL-BAC-10172 1104 1
            ABCA12 1105 0
             APAF1 1105 2
ARS-BFGL-BAC-10172 1105 5
            ABCA12 1106 0
             APAF1 1106 1
ARS-BFGL-BAC-10172 1106 2
            ABCA12 1107 5
             APAF1 1107 2
ARS-BFGL-BAC-10172 1107 1
            ABCA12 1108 0
             APAF1 1108 2
ARS-BFGL-BAC-10172 1108 2
            ABCA12 1109 0
             APAF1 1109 2
ARS-BFGL-BAC-10172 1109 2
            ABCA12 1110 5
             APAF1 1110 2
ARS-BFGL-BAC-10172 1110 2
import pandas as pd
from snplib.statistics import call_rate

input_data = pd.read_csv("file.txt", sep=" ")
result = call_rate(data=input_data, id_col="SNP_NAME", snp_col="SNP")

result:

          SNP_NAME       SNP
            ABCA12  0.727273
             APAF1  0.909091
ARS-BFGL-BAC-10172  0.818182

call rate animal

Of the say, 900 animals genotyped for marker CL635944_160.1, how many have actually been successfully read? Assume that 600 have been read, then the “call rate marker” is 600/900 = 67%

in_data:

          SNP_NAME SAMPLE_ID SNP
            ABCA12     14814   0
ARS-BFGL-BAC-13031     14814   2
ARS-BFGL-BAC-13039     14814   0
ARS-BFGL-BAC-13049     14814   1
ARS-BFGL-BAC-13059     14814   1
ARS-BFGL-BAC-13086     14814   0
ARS-BFGL-BAC-13093     14814   1
ARS-BFGL-BAC-13110     14814   5
ARS-BFGL-BAC-13111     14814   0
ARS-BFGL-BAC-13113     14814   1
ARS-BFGL-BAC-15633     14814   0
ARS-BFGL-BAC-15634     14814   0
ARS-BFGL-BAC-15637     14814   0
ARS-BFGL-BAC-15659     14814   0
ARS-BFGL-BAC-15668     14814   5
ARS-BFGL-BAC-15708     14814   0
ARS-BFGL-BAC-15718     14814   0
            ABCA12     14815   0
ARS-BFGL-BAC-13031     14815   1
ARS-BFGL-BAC-13039     14815   1
ARS-BFGL-BAC-13049     14815   1
ARS-BFGL-BAC-13059     14815   0
ARS-BFGL-BAC-13086     14815   1
ARS-BFGL-BAC-13093     14815   5
ARS-BFGL-BAC-13110     14815   2
ARS-BFGL-BAC-13111     14815   1
ARS-BFGL-BAC-13113     14815   2
ARS-BFGL-BAC-15633     14815   0
ARS-BFGL-BAC-15634     14815   2
ARS-BFGL-BAC-15637     14815   2
ARS-BFGL-BAC-15659     14815   2
ARS-BFGL-BAC-15668     14815   5
ARS-BFGL-BAC-15708     14815   1
ARS-BFGL-BAC-15718     14815   2
import pandas as pd
from snplib.statistics import call_rate

input_data = pd.read_csv("file.txt", sep=" ")
result = call_rate(data=data_df, id_col="SAMPLE_ID", snp_col="SNP")

result:

SAMPLE_ID       SNP
    14814  0.882353
    14815  0.882353

Frequence Allele

The allele frequency represents the incidence of a gene variant in a population.

allele freq

in_data:

    SNP_NAME SAMPLE_ID SNP
            ABCA12 1100 0
             APAF1 1100 2
ARS-BFGL-BAC-10172 1100 5
            ABCA12 1101 0
             APAF1 1101 2
ARS-BFGL-BAC-10172 1101 2
            ABCA12 1102 0
             APAF1 1102 5
ARS-BFGL-BAC-10172 1102 2
            ABCA12 1103 0
             APAF1 1103 2
ARS-BFGL-BAC-10172 1103 2
            ABCA12 1104 5
             APAF1 1104 1
ARS-BFGL-BAC-10172 1104 1
            ABCA12 1105 0
             APAF1 1105 2
ARS-BFGL-BAC-10172 1105 5
            ABCA12 1106 0
             APAF1 1106 1
ARS-BFGL-BAC-10172 1106 2
            ABCA12 1107 5
             APAF1 1107 2
ARS-BFGL-BAC-10172 1107 1
            ABCA12 1108 0
             APAF1 1108 2
ARS-BFGL-BAC-10172 1108 2
            ABCA12 1109 0
             APAF1 1109 2
ARS-BFGL-BAC-10172 1109 2
            ABCA12 1110 5
             APAF1 1110 2
ARS-BFGL-BAC-10172 1110 2
import pandas as pd
from snplib.statistics import allele_freq

input_data = pd.read_csv("file.txt", sep=" ")
result = allele_freq(data=input_data, id_col="SNP_NAME", seq_col="SNP")

result:

          SNP_NAME    SNP
            ABCA12  0.000
             APAF1  0.900
ARS-BFGL-BAC-10172  0.889

The minor allele frequency is therefore the frequency at which the minor allele occurs within a population.

maf

from snplib.statistics import minor_allele_freq as maf

result = maf(0.22)  # result == 0.22

HWE (Hardy-Weinberg equilibrium)

The Hardy-Weinberg equilibrium is a principle stating that the genetic variation in a population will remain constant from one generation to the next in the absence of disturbing factors. https://www.nature.com/scitable/definition/hardy-weinberg-equilibrium-122/

To test the hypothesis that the data are within the HWE, a statistic a chi2 distribution with 1 degree of freedom:

from snplib.statistics import hwe_test

result = hwe_test(seq_snp=pd.Series(list(map(int, "2212120"))), freq=0.714)  # True
result = hwe_test(seq_snp=pd.Series(list(map(int, "02011015010000500"))), freq=0.2)  # True
result = hwe_test(seq_snp=pd.Series(list(map(int, "000000000102"))), freq=0.125)  # False

The p-value used here is:

from snplib.statistics import hwe

hom1 = 10
hets = 500
hom2 = 5000

result = hwe(hets, hom1, hom2)  # 0.6515718999145375 (p-value)

Once the data have been prepared, statistical analysis to identify associations, patterns, or relationships between SNPs and the phenotypes or diseases of interest (GWAS). phenotypes or diseases of interest (GWAS).

Parentage

https://www.icar.org/Documents/GenoEx/ICAR%20Guidelines%20for%20Parentage%20Verification%20and%20Parentage%20Discovery%20based%20on%20SNP.pdf

Note

A list of isag verification and discovery macerators can be found here. See Appendix list - https://www.icar.org/Guidelines/04-DNA-Technology.pdf

List of SNP to be used for either parentage verification or parentage discovery (appendix 11): https://www.icar.org/Guidelines/04-DNA-Technology-App-11-SNP-list-for-parentage-verification-or-discovery.pdf

Verification

Verification of paternity according to ICAR recommendations.

input data:

                   SNP_Name      ID41988163  ID10512586
0                    ABCA12               0           0
1                     APAF1               2           2
2        ARS-BFGL-BAC-10172               2           2
3         ARS-BFGL-BAC-1020               1           1
4        ARS-BFGL-BAC-10245               1           1
..                      ...             ...         ...
239  Hapmap55441-rs29010990               1           1
240  Hapmap59876-rs29018046               1           0
241  Hapmap60017-rs29023471               2           1
242           UA-IFASA-5034               0           1
243           UA-IFASA-6532               0           0
from snplib.parentage import Verification, isag_verif

input_data = pd.read_csv("file.txt", sep=" ")

obj_verification = Verification(isag_marks=isag_verif().markers)
result = obj_verification.check_on(
    data=input_data,
    descendant="ID41988163",
    parent="ID10512586",
    snp_name_col="SNP_Name"
)

# Result
print(obj_verification.num_conflicts)  # 31
print(obj_verification.status)  # "Excluded"

Discovery

Search for paternity according to ICAR recommendations.

input data:

               SNP_Name      ID41988163  ID10512586
0                ABCA12               0           0
1                 APAF1               2           2
2    ARS-BFGL-BAC-10172               2           2
3     ARS-BFGL-BAC-1020               1           1
4    ARS-BFGL-BAC-10245               1           1
..                  ...             ...         ...
617       UA-IFASA-5034               0           1
618       UA-IFASA-6154               2           0
619       UA-IFASA-6532               0           0
620       UA-IFASA-8658               1           0
621       UA-IFASA-8833               0           0
from snplib.parentage import Discovery, isag_disc

input_data = pd.read_csv("file.txt", sep=" ")

obj_discovery = Discovery(isag_marks=isag_disc().markers)
result = obj_discovery.search_parent(
    data=input_data,
    descendant="ID41988163",
    parents="ID10512586",
    snp_name_col="SNP_Name"
)

# Result
print(obj_discovery.num_conflicts)  # 77
print(obj_discovery.status)  # "Excluded"
print(obj_discovery.perc_conflicts)  # 14.86 %