Code Examples
Head file FinalReport
[Header]
GSGT Version 2.0.4
Processing Date 1/16/2023 9:19 AM
Content GGP_HDv3_C.bpm
Num SNPs 139376
Total SNPs 140668
Num Samples 25
Total Samples 26
File 1 of 25
[Data]
SNP Name Sample ID Allele1 - Forward Allele2 - Forward Allele1 - Top Allele2 - Top Allele1 - AB Allele2 - AB GC Score X Y
ARS-BFGL-BAC-10172 HO840M003135245650 G G G G B B 0.9420 0.069 0.801
ARS-BFGL-BAC-1020 HO840M003135245650 G G G G B B 0.9489 0.033 0.700
ARS-BFGL-BAC-10245 HO840M003135245650 C C G G B B 0.7277 0.152 1.504
ARS-BFGL-BAC-10345 HO840M003135245650 A C A C A B 0.9411 0.598 0.572
ARS-BFGL-BAC-10375 HO840M003135245650 A G A G A B 0.9348 0.430 0.494
...
Processing the Finalreport.txt
from snplib.finalreport import FinalReport
obj_report = FinalReport()
obj_report.handle("path/to/finalreport.txt")
header = obj_report.header
data = obj_report.snp_data
Output Dataframe:
SNP Name Sample ID Allele1 - Forward Allele2 - Forward Allele1 - Top Allele2 - Top Allele1 - AB Allele2 - AB GC Score X Y
ARS-BFGL-BAC-10172 HO840M003135245650 G G G G B B 0.9420 0.069 0.801
ARS-BFGL-BAC-1020 HO840M003135245650 G G G G B B 0.9489 0.033 0.700
ARS-BFGL-BAC-10245 HO840M003135245650 C C G G B B 0.7277 0.152 1.504
ARS-BFGL-BAC-10345 HO840M003135245650 A C A C A B 0.9411 0.598 0.572
ARS-BFGL-BAC-10375 HO840M003135245650 A G A G A B 0.9348 0.430 0.494
Select alleles - AB or Top or Forward
alleles_ab = FinalReport(allele="AB")
alleles_ab.handle("path/to/finalreport.txt")
data_ab = alleles_ab.snp_data
alleles_top = FinalReport(allele="Top")
alleles_top.handle("path/to/finalreport.txt")
data_top = alleles_top.snp_data
alleles_forward = FinalReport(allele="Forward")
alleles_forward.handle("path/to/finalreport.txt")
data_forward = alleles_forward.snp_data
Output:
SNP Name Sample ID Allele1 - AB Allele2 - AB GC Score X Y
ARS-BFGL-BAC-10172 HO840M003135245650 B B 0.9420 0.069 0.801
ARS-BFGL-BAC-1020 HO840M003135245650 B B 0.9489 0.033 0.700
ARS-BFGL-BAC-10245 HO840M003135245650 B B 0.7277 0.152 1.504
ARS-BFGL-BAC-10345 HO840M003135245650 A B 0.9411 0.598 0.572
ARS-BFGL-BAC-10375 HO840M003135245650 A B 0.9348 0.430 0.494f
...
Preparation SNP files
After processing the raw data, FinalReports.txt, for further analysis several steps of SNP (Single Nucleotide Polymorphism) file preparation are necessary.
Data formatting
The received data often requires formatting to bring it to a standardized form. The proposed module includes data formatting for the programs blupf90 and plink - GBLUP, ssGBLUP, GWAS.
blupf90 format
The input data for obtaining the snp.txt file used for the genomic
blupf90 evaluation is the data file - processed file finalreport.txt
uga
import pandas as pd
from snplib.format import Snp
data_finalreport = pd.read_csv("file.txt", sep="\t")
obj = Snp(fmt="uga")
obj_snp.process(data_finalreport)
obj_snp.to_file("./snp.txt")
Data after snp processing in uga (blupf90) format - obj_snp.data:
SAMPLE_ID SNP
0 14814 02011015010000500
1 14815 01110152120222512
Default result:
SNP_NAME SAMPLE_ID SNP
0 ABCA12 14814 0
1 ARS-BFGL-BAC-13031 14814 2
2 ARS-BFGL-BAC-13039 14814 0
3 ARS-BFGL-BAC-13049 14814 1
...
17 ABCA12 14815 0
18 ARS-BFGL-BAC-13031 14815 1
19 ARS-BFGL-BAC-13039 14815 1
20 ARS-BFGL-BAC-13049 14815 1
...
plink format
This page describes specialized PLINK input and output file formats which are
identifiable by file extension. https://www.cog-genomics.org/plink/1.9/formats
Распространненные фомраты для проведения GWAS анализа - ped, map, fam, lgen…
map - https://www.cog-genomics.org/plink/1.9/formats#map
import pandas as pd
from snplib.format import make_map
input_data = pd.read_csv(DIR_FILES / "./file_bovinesnp50.csv")
data_map = make_map(input_data)
Output data view:
Chr Name morgans MapInfo
0 BovineHD0100037694 0 0
0 BovineHD0100037699 0 0
0 BovineHD0100037703 0 0
0 BovineHD0100037704 0 0
Note
file_bovinesnp50.csv - The file that is taken on the Illumina website with full information about the chip https://support.illumina.com/downloads/bovinesnp50-v3-0-product-files.html
ped - https://www.cog-genomics.org/plink/1.9/formats#ped
import pandas as pd
from snplib.format import make_ped
input_data = pd.read_csv("file.txt")
data_ped = make_ped(
input_data, "SAMPLE_ID", "SNP", fid_col="SAMPLE_ID"
)
or
data_ped = make_ped(
input_data,
"SAMPLE_ID",
"SNP",
fid_col="FAMILY_ID",
father_col="father",
mother_col="mother",
sex_col="sex"
)
Input data view:
SAMPLE_ID SNP
1100 A A B B 0 0
1101 A A B B B B
1102 A A 0 0 B B
1103 A A B B B B
or
SAMPLE_ID SNP FAMILY_ID father mother sex
1100 A A B B 0 0 1100 1 5 1
1101 A A B B B B 1101 2 6 2
1102 A A 0 0 B B 1102 3 7 1
1103 A A B B B B 1103 4 8 0
Output data view:
fid sid father mother sex not_used snp
1100 1100 0 0 0 0 A A B B 0 0
1101 1101 0 0 0 0 A A B B B B
1102 1102 0 0 0 0 A A 0 0 B B
1103 1103 0 0 0 0 A A B B B B
or
fid sid father mother sex not_used snp
1100 1100 1 5 1 0 A A B B 0 0
1101 1101 2 6 2 0 A A B B B B
1102 1102 3 7 1 0 A A 0 0 B B
1103 1103 4 8 0 0 A A B B B B
fam - https://www.cog-genomics.org/plink/1.9/formats#fam
import pandas as pd
from snplib.format import make_fam
input_data = pd.read_csv("file.txt", sep=" ")
data_fam = make_fam(input_data, "SAMPLE_ID", "SAMPLE_ID")
or
make_fam(
input_data,
"SAMPLE_ID",
"FAMILY_ID",
father_col="father",
mother_col="mother",
sex_col="sex",
pheno_col="pheno"
)
Input data view:
SAMPLE_ID SNP
1100 025
1101 022
1102 052
1103 022
or
SAMPLE_ID SNP FAMILY_ID father mother sex pheno
1100 025 1100 1 5 1 12
1101 022 1101 2 6 2 13
1102 052 1102 3 7 1 14
1103 022 1103 4 8 0 15
Output data view:
fid sid father mother sex pheno
1100 1100 0 0 0 -9
1101 1101 0 0 0 -9
1102 1102 0 0 0 -9
1103 1103 0 0 0 -9
or
fid sid father mother sex pheno
1100 1100 1 5 1 12
1101 1101 2 6 2 13
1102 1102 3 7 1 14
1103 1103 4 8 0 15
lgen - https://www.cog-genomics.org/plink/1.9/formats#lgen
import pandas as pd
from snplib.format import make_lgen
input_data = pd.read_csv("file.txt", sep=" ")
data_lgen = make_lgen(
input_data, "Sample ID", "SNP Name", ["Allele1 - AB", "Allele2 - AB"]
)
Input data view:
"SNP Name" "Sample ID" "Allele1 - AB" "Allele2 - AB" "GC Score" "GT Score"
ABCA12 107232207 A A 0.4048 0.8164
ARS-BFGL-BAC-13031 107232207 B B 0.9083 0.8712
ARS-BFGL-BAC-13039 107232207 A A 0.9005 0.9096
ARS-BFGL-BAC-13049 107232207 A B 0.9295 0.8926
...
ABCA12 107237284 A A 0.4048 0.8164
ARS-BFGL-BAC-13031 107237284 A B 0.9566 0.9257
ARS-BFGL-BAC-13039 107237284 A B 0.3098 0.8555
ARS-BFGL-BAC-13049 107237284 A B 0.8613 0.8319
...
Output data view:
fid sid snp_name allele1 allele2
1 107232207 ABCA12 A A
1 107232207 ARS-BFGL-BAC-13031 B B
1 107232207 ARS-BFGL-BAC-13039 A A
1 107232207 ARS-BFGL-BAC-13049 A B
1 107232207 ARS-BFGL-BAC-13059 A B
...
1 107237284 ABCA12 A A
1 107237284 ARS-BFGL-BAC-13031 A B
1 107237284 ARS-BFGL-BAC-13039 A B
1 107237284 ARS-BFGL-BAC-13049 A B
1 107237284 ARS-BFGL-BAC-13059 A A
...
Statistics
Poor quality or uninformative SNPs can be excluded from the analysis. This helps to reduce noise and improve the accuracy of the results.
Call Rate
The call rate for a given SNP is defined as the proportion of individuals in the study for which the corresponding SNP information is not missing. In the following example, we filter using a call rate of 95%, meaning we retain SNPs for which there is less than 5% missing data.
call rate marker
Of the say, 54K markers in the chip, 50K have been genotyped for a particular animal, the “call rate animal” is 50K/54K=93%
in_data:
SNP_NAME SAMPLE_ID SNP
ABCA12 1100 0
APAF1 1100 2
ARS-BFGL-BAC-10172 1100 5
ABCA12 1101 0
APAF1 1101 2
ARS-BFGL-BAC-10172 1101 2
ABCA12 1102 0
APAF1 1102 5
ARS-BFGL-BAC-10172 1102 2
ABCA12 1103 0
APAF1 1103 2
ARS-BFGL-BAC-10172 1103 2
ABCA12 1104 5
APAF1 1104 1
ARS-BFGL-BAC-10172 1104 1
ABCA12 1105 0
APAF1 1105 2
ARS-BFGL-BAC-10172 1105 5
ABCA12 1106 0
APAF1 1106 1
ARS-BFGL-BAC-10172 1106 2
ABCA12 1107 5
APAF1 1107 2
ARS-BFGL-BAC-10172 1107 1
ABCA12 1108 0
APAF1 1108 2
ARS-BFGL-BAC-10172 1108 2
ABCA12 1109 0
APAF1 1109 2
ARS-BFGL-BAC-10172 1109 2
ABCA12 1110 5
APAF1 1110 2
ARS-BFGL-BAC-10172 1110 2
import pandas as pd
from snplib.statistics import call_rate
input_data = pd.read_csv("file.txt", sep=" ")
result = call_rate(data=input_data, id_col="SNP_NAME", snp_col="SNP")
result:
SNP_NAME SNP
ABCA12 0.727273
APAF1 0.909091
ARS-BFGL-BAC-10172 0.818182
call rate animal
Of the say, 900 animals genotyped for marker CL635944_160.1, how many have actually been successfully read? Assume that 600 have been read, then the “call rate marker” is 600/900 = 67%
in_data:
SNP_NAME SAMPLE_ID SNP
ABCA12 14814 0
ARS-BFGL-BAC-13031 14814 2
ARS-BFGL-BAC-13039 14814 0
ARS-BFGL-BAC-13049 14814 1
ARS-BFGL-BAC-13059 14814 1
ARS-BFGL-BAC-13086 14814 0
ARS-BFGL-BAC-13093 14814 1
ARS-BFGL-BAC-13110 14814 5
ARS-BFGL-BAC-13111 14814 0
ARS-BFGL-BAC-13113 14814 1
ARS-BFGL-BAC-15633 14814 0
ARS-BFGL-BAC-15634 14814 0
ARS-BFGL-BAC-15637 14814 0
ARS-BFGL-BAC-15659 14814 0
ARS-BFGL-BAC-15668 14814 5
ARS-BFGL-BAC-15708 14814 0
ARS-BFGL-BAC-15718 14814 0
ABCA12 14815 0
ARS-BFGL-BAC-13031 14815 1
ARS-BFGL-BAC-13039 14815 1
ARS-BFGL-BAC-13049 14815 1
ARS-BFGL-BAC-13059 14815 0
ARS-BFGL-BAC-13086 14815 1
ARS-BFGL-BAC-13093 14815 5
ARS-BFGL-BAC-13110 14815 2
ARS-BFGL-BAC-13111 14815 1
ARS-BFGL-BAC-13113 14815 2
ARS-BFGL-BAC-15633 14815 0
ARS-BFGL-BAC-15634 14815 2
ARS-BFGL-BAC-15637 14815 2
ARS-BFGL-BAC-15659 14815 2
ARS-BFGL-BAC-15668 14815 5
ARS-BFGL-BAC-15708 14815 1
ARS-BFGL-BAC-15718 14815 2
import pandas as pd
from snplib.statistics import call_rate
input_data = pd.read_csv("file.txt", sep=" ")
result = call_rate(data=data_df, id_col="SAMPLE_ID", snp_col="SNP")
result:
SAMPLE_ID SNP
14814 0.882353
14815 0.882353
Frequence Allele
The allele frequency represents the incidence of a gene variant in a population.
allele freq
in_data:
SNP_NAME SAMPLE_ID SNP
ABCA12 1100 0
APAF1 1100 2
ARS-BFGL-BAC-10172 1100 5
ABCA12 1101 0
APAF1 1101 2
ARS-BFGL-BAC-10172 1101 2
ABCA12 1102 0
APAF1 1102 5
ARS-BFGL-BAC-10172 1102 2
ABCA12 1103 0
APAF1 1103 2
ARS-BFGL-BAC-10172 1103 2
ABCA12 1104 5
APAF1 1104 1
ARS-BFGL-BAC-10172 1104 1
ABCA12 1105 0
APAF1 1105 2
ARS-BFGL-BAC-10172 1105 5
ABCA12 1106 0
APAF1 1106 1
ARS-BFGL-BAC-10172 1106 2
ABCA12 1107 5
APAF1 1107 2
ARS-BFGL-BAC-10172 1107 1
ABCA12 1108 0
APAF1 1108 2
ARS-BFGL-BAC-10172 1108 2
ABCA12 1109 0
APAF1 1109 2
ARS-BFGL-BAC-10172 1109 2
ABCA12 1110 5
APAF1 1110 2
ARS-BFGL-BAC-10172 1110 2
import pandas as pd
from snplib.statistics import allele_freq
input_data = pd.read_csv("file.txt", sep=" ")
result = allele_freq(data=input_data, id_col="SNP_NAME", seq_col="SNP")
result:
SNP_NAME SNP
ABCA12 0.000
APAF1 0.900
ARS-BFGL-BAC-10172 0.889
The minor allele frequency is therefore the frequency at which the minor allele occurs within a population.
maf
from snplib.statistics import minor_allele_freq as maf
result = maf(0.22) # result == 0.22
HWE (Hardy-Weinberg equilibrium)
The Hardy-Weinberg equilibrium is a principle stating that the genetic variation in a population will remain constant from one generation to the next in the absence of disturbing factors. https://www.nature.com/scitable/definition/hardy-weinberg-equilibrium-122/
To test the hypothesis that the data are within the HWE, a statistic a chi2 distribution with 1 degree of freedom:
from snplib.statistics import hwe_test
result = hwe_test(seq_snp=pd.Series(list(map(int, "2212120"))), freq=0.714) # True
result = hwe_test(seq_snp=pd.Series(list(map(int, "02011015010000500"))), freq=0.2) # True
result = hwe_test(seq_snp=pd.Series(list(map(int, "000000000102"))), freq=0.125) # False
The p-value used here is:
from snplib.statistics import hwe
hom1 = 10
hets = 500
hom2 = 5000
result = hwe(hets, hom1, hom2) # 0.6515718999145375 (p-value)
Once the data have been prepared, statistical analysis to identify associations, patterns, or relationships between SNPs and the phenotypes or diseases of interest (GWAS). phenotypes or diseases of interest (GWAS).
Parentage
Note
A list of isag verification and discovery macerators can be found here. See Appendix list - https://www.icar.org/Guidelines/04-DNA-Technology.pdf
List of SNP to be used for either parentage verification or parentage discovery (appendix 11): https://www.icar.org/Guidelines/04-DNA-Technology-App-11-SNP-list-for-parentage-verification-or-discovery.pdf
Verification
Verification of paternity according to ICAR recommendations.
input data:
SNP_Name ID41988163 ID10512586
0 ABCA12 0 0
1 APAF1 2 2
2 ARS-BFGL-BAC-10172 2 2
3 ARS-BFGL-BAC-1020 1 1
4 ARS-BFGL-BAC-10245 1 1
.. ... ... ...
239 Hapmap55441-rs29010990 1 1
240 Hapmap59876-rs29018046 1 0
241 Hapmap60017-rs29023471 2 1
242 UA-IFASA-5034 0 1
243 UA-IFASA-6532 0 0
from snplib.parentage import Verification, isag_verif
input_data = pd.read_csv("file.txt", sep=" ")
obj_verification = Verification(isag_marks=isag_verif().markers)
result = obj_verification.check_on(
data=input_data,
descendant="ID41988163",
parent="ID10512586",
snp_name_col="SNP_Name"
)
# Result
print(obj_verification.num_conflicts) # 31
print(obj_verification.status) # "Excluded"
Discovery
Search for paternity according to ICAR recommendations.
input data:
SNP_Name ID41988163 ID10512586
0 ABCA12 0 0
1 APAF1 2 2
2 ARS-BFGL-BAC-10172 2 2
3 ARS-BFGL-BAC-1020 1 1
4 ARS-BFGL-BAC-10245 1 1
.. ... ... ...
617 UA-IFASA-5034 0 1
618 UA-IFASA-6154 2 0
619 UA-IFASA-6532 0 0
620 UA-IFASA-8658 1 0
621 UA-IFASA-8833 0 0
from snplib.parentage import Discovery, isag_disc
input_data = pd.read_csv("file.txt", sep=" ")
obj_discovery = Discovery(isag_marks=isag_disc().markers)
result = obj_discovery.search_parent(
data=input_data,
descendant="ID41988163",
parents="ID10512586",
snp_name_col="SNP_Name"
)
# Result
print(obj_discovery.num_conflicts) # 77
print(obj_discovery.status) # "Excluded"
print(obj_discovery.perc_conflicts) # 14.86 %