Format package

The module that performs the SNP file is in a certain encoding, for calculations of Blupf90 programs. Formatting and preparing data/files for PLINK - GWAS.

class snplib.format.Snp(fmt: str | None = 'uga')[source]

Bases: object

The process of converting genomic map data - FinalReport.txt obtained from Illumin. Recoding allele data into quantitative data, saving in the format necessary for calculating gblup on blupf90.

Parameters:

fmt – Data format to use snp in plink and blupf90. Default value “uga”.

_ALLELE_CODE = {'--': 5, 'AA': 0, 'AB': 1, 'BA': 1, 'BB': 2}
_FIELDS = ['SNP_NAME', 'SAMPLE_ID', 'SNP']
_F_DTYPE = {'SAMPLE_ID': <class 'str'>, 'SNP': <class 'str'>, 'SNP_NAME': <class 'str'>}
static _add_space(value: str, max_len: int) str[source]

Adding spaces up to the maximum length of the value in the sample_id data.

Parameters:
  • value – Sample_id value

  • max_len – Max len sample_id value

Returns:

Return replacing value

static _format_uga(data: DataFrame) DataFrame[source]

Data format to use snp in plink and blupf90.

property data: DataFrame | None
process(data: DataFrame) None[source]

Data processing and formatting. Calculation of statistical information

Parameters:

data – Data from FinalReport file. Example: SNP Name Sample ID Allele1 - AB Allele2 - AB GC Score GT Score ABCA12 14814 A A 0.4048 0.8164 ARS-BFGL-BAC-13031 14814 B B 0.9083 0.8712 ARS-BFGL-BAC-13039 14814 A A 0.9005 0.9096 ARS-BFGL-BAC-13049 14814 A B 0.9295 0.8926

Returns:

Returns true if the data was formatted successfully and statistical information was calculated, false if an error.

to_file(file_path: str | Path) None[source]

Saving data to a file.

Parameters:

file_path – Path to file

snplib.format.make_fam(data: DataFrame, sid_col: str, fid_col: str = None, father_col: str = None, mother_col: str = None, sex_col: str = None, sex_val: int = 0, pheno_col: str = None, pheno_val: int = -9) DataFrame | None[source]

PLINK sample information file https://www.cog-genomics.org/plink/1.9/formats#fam

A text file with no header line, and one line per sample with the

following six fields:

  1. Family ID (‘FID’)

  2. Within-family ID (‘IID’; cannot be ‘0’)

  3. Within-family ID of father (‘0’ if father isn’t in dataset)

  4. Within-family ID of mother (‘0’ if mother isn’t in dataset)

  5. Sex code (‘1’ = male, ‘2’ = female, ‘0’ = unknown)

  6. Phenotype value (‘1’ = control, ‘2’ = case, ‘-9’/’0’/non-numeric =

    missing data if case/control)

Parameters:
  • data – Snp data that contain full or partial information on the animal

  • fid_col – Family ID, default value “1”. Must not contain underline - “_”

  • sid_col – Within-family ID (‘IID’; cannot be ‘0’). Must not contain underline - “_”

  • father_col – Within-family ID of father (‘0’ if father isn’t in dataset)

  • mother_col – Within-family ID of mother (‘0’ if mother isn’t in dataset)

  • sex_col – Sex column name in data

  • sex_val – Sex code (‘1’ = male, ‘2’ = female, ‘0’ = unknown)

  • pheno_col – Pheno column name in data

  • pheno_val – Phenotype value (‘1’ = control, ‘2’ = case, ‘-9’/’0’/non-numeric = missing data if case/control)

Returns:

Return data in formate .fam

snplib.format.make_lgen(data: DataFrame, sid_col: str, snp_name: str, alleles: list[str], fid_col: str = None) DataFrame | None[source]

PLINK long-format genotype file https://www.cog-genomics.org/plink/1.9/formats#lgen

A text file with no header line, and one line per genotype call (or

just not-homozygous-major calls if ‘lgen-ref’ was invoked) usually with the following five fields:

  1. Family ID

  2. Within-family ID

  3. Variant identifier

  4. Allele call 1 (‘0’ for missing)

  5. Allele call 2

There are several variations which are also handled by PLINK; see the

original discussion for details.

Parameters:
  • data – Data the after parsing FinalReport.txt

  • sid_col

  • snp_name

  • fid_col – Family ID, default value “1”

  • alleles

Returns:

  • Return data in formate .lgen

snplib.format.make_map(manifest: DataFrame) DataFrame[source]

PLINK text fileset variant information file https://www.cog-genomics.org/plink/1.9/formats#map

A text file with no header line, and one line per variant with the following 3-4 fields:

  1. Chromosome code. PLINK 1.9 also permits contig names here, but most

    older programs do not.

  2. Variant identifier

  3. Position in morgans or centimorgans (optional; also safe to use

    dummy value of ‘0’)

  4. Base-pair coordinate

All lines must have the same number of columns (so either no lines

contain the morgans/centimorgans column, or all of them do).

Parameters:

manifest – The file that is taken on the Illumina website with full

information about the chip https://support.illumina.com/downloads/bovinesnp50-v3-0-product-files.html

Returns:

Return data in formate .map

snplib.format.make_ped(data: DataFrame, sid_col: str, snp_col: str, fid_col: str = None, father_col: str = None, mother_col: str = None, sex_col: str = None) DataFrame | None[source]

Original standard text format for sample pedigree information and genotype calls. Normally must be accompanied by a .map file. https://www.cog-genomics.org/plink/1.9/formats#ped

The PED file has 6 fixed columns at the beginning followed by the SNP

information. The columns should be separated by a whitespace or a tab. The first six columns hold the following information:

  1. Family ID (if unknown use the same id as for the sample id in

    column two)

  2. Sample ID

  3. Paternal ID (if unknown use 0)

  4. Maternal ID (if unknown use 0)

  5. Sex (1=male; 2=female; 0=unknown)

  6. Affection (0=unknown; 1=unaffected; 2=affected)

  7. Genotypes (space or tab separated, 2 for each marker. 0/-9=missing)

Here is a brief example of a genotype PED file containing 5 samples

with 10 homozygous SNPs:

4304 4304 0 0 0 0 C C C C G G G G G G C C G G C C T T T T 6925 6925 0 0 0 0 C C C C T T G G A A C C G G C C T T T T 7319 7319 0 0 0 0 C C C C G G G G G G C C G G C C T T T T 6963 6963 0 0 0 0 A A C C T T G G A A C C G G C C T T T T 6968 6968 0 0 0 0 C C C C G G G G G G G G G G C C T T T T

Parameters:
  • data – Snp data that contain full or partial information on the animal

  • sid_col – Sample ID. Column name in data

  • snp_col – Snp column name in data

  • fid_col – Family ID column name in data (if unknown use the same id as for the sample id in column two)

  • father_col – Paternal ID column name in data (if unknown use 0)

  • mother_col – Maternal ID column name in data (if unknown use 0)

  • sex_col – Sex column name in data (if unknown use 0)

Returns:

Returns an array of data in ped format to work with the plink program