### Metadata	Value	Required	Example	Description
bioproject		yes	PRJNA366136	BioProject under which the genome will be listed
biosample		yes	SAMN38845068	BioSample of the genome
name		yes	Unclassified Caudoviricetes virus Fuxi-1	An identifier or name for the genome. INSDC requires every sample name from a single Submitter to be unique
organism		yes	Caudoviricetes sp. Fuxi-1	species-rank taxonomic assignment folllowed by identifier. If no species exists for this genome, use “<lowest fitting taxon> sp.”, in which <lowest fitting taxon> consists of the formal ICTV name of the lowest ranking taxon that can be confidently assigned
sra_reads			SRX2554840, SRX2554839	SRA identifiers for the original reads used in the genome assembly, can be a comma-separated list
lineage		yes	Viruses;  Duplodnaviria; Heunggongvirae; Uroviricota; Caudoviricetes; Unclassified Caudoviricetes 	Taxonomic classification based on latest ICTV recommendations
topology		yes	circular	Is the genome linear or circular
gcode		yes	11	Predicted genetic code used
completedness		yes	partial	Should be partial until completeness was experimentally verified
moltype		yes	genomic DNA	Type of nucleic acid (should be one of “genomic DNA” or “genomic RNA”, most likely)
metagenomic		yes	TRUE	Should be TRUE
metagenome_source		yes	metagenome	Describes the original source of a metagenome assembled genome (MAG). Examples: soil metagenome, gut metagenome
environmental_sample		yes	TRUE	Should be TRUE
host			Bifangarchaeales Bathyarchaeia	If known, please indicate the host taxon here


StructuredCommentPrefix	Assembly-Data
Genome Coverage		yes	24x	Average coverage of the corresponding genome in the original sample, can be extracted from the results of step05 if run
Sequencing Technology		yes	Illumina HiSeq 2500	e.g., ABI 3730; Illumina GAIIx; Nanopore (see https://www.ncbi.nlm.nih.gov/genbank/structuredcomment/)


StructuredCommentPrefix	MIUVIG:5.0-Data	For more information on these, please see https://genomicsstandardsconsortium.github.io/mixs/0010012/
investigation_type	miuvig	yes	Required for UViG submissions
assembly_qual		yes	High-quality draft genome	The assembly quality category is based on sets of criteria outlined for each assembly quality category. For MISAG/MIMAG; Finished: Single, validated, contiguous sequence per replicon without gaps or ambiguities with a consensus error rate equivalent to Q50 or better. High Quality Draft:Multiple fragments where gaps span repetitive regions. Presence of the 23S, 16S and 5S rRNA genes and at least 18 tRNAs. Medium Quality Draft:Many fragments with little to no review of assembly other than reporting of standard assembly statistics. Low Quality Draft:Many fragments with little to no review of assembly other than reporting of standard assembly statistics. Assembly statistics include, but are not limited to total assembly size, number of contigs, contig N50/L50, and maximum contig length. For MIUVIG; Finished: Single, validated, contiguous sequence per replicon without gaps or ambiguities, with extensive manual review and editing to annotate putative gene functions and transcriptional units. High-quality draft genome: One or multiple fragments, totaling 90% of the expected genome or replicon sequence or predicted complete. Genome fragment(s): One or multiple fragments, totalling < 90% of the expected genome or replicon sequence, or for which no genome size could be estimated
assembly_software		yes	metaSPAdes;3.11.1	Tool(s) used for assembly, including version number and parameters
collection_date		yes	2012-10-12	The time of sampling, either as an instance (single point in time) or interval. In case no exact time is available, the date/time can be right truncated i.e. all of these are valid times: 2008-01-23T19:23:10+00:00; 2008-01-23T19:23:10; 2008-01-23; 2008-01; 2008; Except: 2008-01; 2008 all are ISO8601 compliant
detec_type		yes	independent sequence (UViG)	Type of UViG detection
env_broad_scale		yes	aquatic biome [ENVO:00002030]	Report the major environmental system the sample or specimen came from. The system(s) identified should have a coarse spatial grain, to provide the general environmental context of where the sampling was done (e.g. in the desert or a rainforest). We recommend using subclasses of EnvO s biome class: http://purl.obolibrary.org/obo/ENVO_00000428. EnvO documentation about how to use the field: https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS
env_local_scale		yes	hot spring [ENVO:00000051]	Report the entity or entities which are in the sample or specimen s local vicinity and which you believe have significant causal influences on your sample or specimen. We recommend using EnvO terms which are of smaller spatial grain than your entry for env_broad_scale. Terms, such as anatomical sites, from other OBO Library ontologies which interoperate with EnvO (e.g. UBERON) are accepted in this field. EnvO documentation about how to use the field: https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS
env_medium		yes	spring water [ENVO:03600065]	Report the environmental material(s) immediately surrounding the sample or specimen at the time of sampling. We recommend using subclasses of 'environmental material' (http://purl.obolibrary.org/obo/ENVO_00010483). EnvO documentation about how to use the field: https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS . Terms from other OBO ontologies are permissible as long as they reference mass/volume nouns (e.g. air, water, blood) and not discrete, countable entities (e.g. a tree, a leaf, a table top)
geo_loc_name		yes	USA: Yellowstone National Park, Wyoming	The geographical origin of the sample as defined by the country or sea name followed by specific region name. Country or sea names should be chosen from the INSDC country list (http://insdc.org/country.html), or the GAZ ontology (http://purl.bioontology.org/ontology/GAZ)
lat_lon		yes	44.611006 N 110.440182 W	The geographical origin of the sample as defined by latitude and longitude. The values should be reported in decimal degrees and in WGS84 system
number_contig		yes	1	Total number of contigs in the cleaned/submitted assembly that makes up a given genome, SAG, MAG, or UViG
pred_genome_struc		yes	non-segmented	Expected structure of the viral genome
pred_genome_type		yes	dsDNA	Type of genome predicted for the UViG
project_name		yes	Hot spring thermophilic microbial communities from Obsidian Pool, Yellowstone National Park, USA - OP-RAMG-01	Name of the project within which the sequencing was organized
samp_name		yes	OP-RAMG-01	A local identifier or name that for the material sample used for extracting nucleic acids, and subsequent sequencing. It can refer either to the original material collected or to any derived sub-samples. It can have any format, but we suggest that you make it concise, unique and consistent within your lab, and as informative as possible. INSDC requires every sample name from a single Submitter to be unique. Use of a globally unique identifier for the field source_mat_id is recommended in addition to sample_name
samp_taxon_id		yes	433727	NCBI taxon id of the sample. Maybe be a single taxon or mixed taxa sample. Use 'synthetic metagenome for mock community/positive controls, or 'blank sample' for negative controls. See https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=410657 for a list of potential values
seq_meth		yes	Illumina NovaSeq 6000	Sequencing machine used. Where possible the term should be taken from the OBI list of DNA sequencers (http://purl.obolibrary.org/obo/OBI_0400103)
source_uvig		yes	metagenome (not viral targeted)	Type of dataset from which the UViG was obtained
vir_ident_software		yes	geNomad	Tool(s) used for the identification of UViG as a viral genome, software or protocol name including version number, parameters, and cutoffs used
virus_enrich_appr		yes	none	List of approaches used to enrich the sample for viruses, if any


assembly_name			IMG/M 3300027863	Name/version of the assembly provided by the submitter that is used in the genome browsers and in the community
compl_appr			Direct terminal repeats	The approach used to determine the completeness of a given genomic assembly, which would typically make use of a set of conserved marker genes or a closely related reference genome. For UViG completeness, include reference genome or group used, and contig feature suggesting a complete genome
compl_score			high; 100%	Completeness score is typically based on either the fraction of markers found as compared to a database or the percent of a genome found as compared to a closely related reference genome. High Quality Draft: >90%, Medium Quality Draft: >50%, and Low Quality Draft: < 50% should have the indicated completeness scores
compl_software			CheckV 0.8.1	Tools used for completion estimate, i.e. checkm, anvi'o, busco
annot			geNomad	Tool used for annotation, or for cases where annotation was provided by a community jamboree or model organism database rather than by a specific submitter
feat_pred			Prodigal;2.6.3;default parameters	Method used to predict UViGs features such as ORFs, integration site, etc
ref_db			Protein Data Bank (14_Apr_2022), Structural Classification of Proteins (2.07), Pfam-A_v35, uniprot_sprot_vir70 (3_Nov_2021) 	List of database(s) used for ORF annotation, along with version number and reference to website or publication
sim_search_meth			HHblits v3.3.0	Tool used to compare ORFs with database, along with version and cutoffs used
host_pred_appr			CRISPR spacer match	Tool or approach used for host prediction
host_pred_est_acc			CRISPR spacer match: 100% coverage and a maximum of one mismatch	For each tool or approach used for host prediction, estimated false discovery rates should be included, either computed de novo or from the literature
tax_class			vContact2 (references from NCBI RefSeq v207, genus rank classification, default parameters), with the addition archaea viruses from ICTV (VMR_MSL38_v1)	Method used for taxonomic classification, along with reference database used, classification rank, and thresholds used to classify new genomes
tax_ident			Whole genome; Major capsid proteins	The phylogenetic marker(s) used to assign an organism name to the SAG or MAG
sop			https://gitlab.com/ccoclet/mvp/	Standard operating procedures used in assembly and/or annotation of genomes, metagenomes or environmental sequences


depth				The vertical distance below local surface. For sediment or soil samples depth is measured from sediment or soil surface, respectively. Depth can be reported as an interval for subsurface samples
elev				Elevation of the sampling site is its height above a fixed reference point, most commonly the mean sea level. Elevation is mainly used when referring to points on the earth's surface, while altitude is used for points above the surface, such as an aircraft in flight or a spacecraft in orbit
alt				Heights of objects such as airplanes, space shuttles, rockets, atmospheric balloons and heights of places such as atmospheric layers and clouds. It is used to measure the height of an object which is above the earth's surface. In this context, the altitude measurement is the vertical distance between the earth's surface above sea level and the sampled position in the air
size_frac				Filtering pore size used in sample preparation
nucl_acid_amp				A link to a literature reference, electronic resource or a standard operating procedure (SOP), that describes the enzymatic amplification (PCR, TMA, NASBA) of specific nucleic acids
nucl_acid_ext				A link to a literature reference, electronic resource or a standard operating procedure (SOP), that describes the material separation to recover the nucleic acid fraction from a sample
associated resource			https://img.jgi.doe.gov/vr	A related resource that is referenced, cited, or otherwise associated to the sequence


adapters				Adapters provide priming sequences for both amplification and sequencing of the sample-library fragments. Both adapters should be reported; in uppercase letters
bin_param				The parameters that have been applied during the extraction of genomes from metagenomic datasets
bin_software				Tool(s) used for the extraction of genomes from metagenomic datasets, where possible include a product ID (PID) of the tool(s) used
mag_cov_software				Tool(s) used to determine the genome coverage if coverage is used as a binning parameter in the extraction of genomes from metagenomic datasets
reassembly_bin				Has an assembly been performed on a genome bin extracted from a metagenomic assembly?
estimated_size				The estimated size of the genome prior to sequencing. Of particular importance in the sequencing of (eukaryotic) genome which could remain in draft form for a long or unspecified period
lib_layout				Specify whether to expect single, paired, or other configuration of reads
lib_reads_seqd				Total number of clones sequenced from the library
lib_screen				Specific enrichment or screening methods applied before and/or after creating libraries
lib_size				Total number of clones in the library prepared for the project
lib_vector				Cloning vector type(s) used in construction of libraries
mid				Molecular barcodes, called Multiplex Identifiers (MIDs), that are used to specifically tag unique samples in a sequencing run. Sequence should be reported in uppercase letters
otu_class_appr			95% ANI;85% AF	Cutoffs and approach used when clustering species-level OTUs. Note that results from standard 95% ANI / 85% AF clustering should be provided alongside OTUS defined from another set of thresholds, even if the latter are the ones primarily used during the analysis
otu_db			NCBI Viral RefSeq v200	Reference database (i.e. sequences not generated as part of the current study) used to cluster new genomes in "species-level" OTUs, if any
otu_seq_comp_appr			blastn (v2.5.0+); -task megablast, -max_target_seqs 25000, and -perc_identity 90	Tool and thresholds used to compare sequences when computing "species-level" OTUs
specific_host				Report the host's taxonomic name and/or NCBI taxonomy ID
trna_ext_software			tRNAscan-SE v. 2.0	Tools used for tRNA identification
trnas			2	The total number of tRNAs identified from the SAG or MAG


biotic_relationship				Description of relationship(s) between the subject organism and other organism(s) it is associated with. E.g., parasite on species X; mutualist with species Y. The target organism is the subject of the relationship, and the other organism(s) is the object
experimental_factor				Variable aspects of an experiment design that can be used to describe an experiment, or set of experiments, in an increasingly detailed manner. This field accepts ontology terms from Experimental Factor Ontology (EFO) and/or Ontology for Biomedical Investigations (OBI)
host_disease_stat				List of diseases with which the host has been diagnosed; can include multiple diagnoses. The value of the field depends on host; for humans the terms should be chosen from the DO (Human Disease Ontology) at https://www.disease-ontology.org, non-human host diseases are free text
host_spec_range				The range and diversity of host species that an organism is capable of infecting, defined by NCBI taxonomy identifier
pathogenicity				To what is the entity pathogenic
pos_cont_type				The substance, mixture, product, or apparatus used to verify that a process which is part of an investigation delivers a true positive
neg_cont_type				The substance or equipment used as a negative control in an investigation
samp_collec_device				The device used to collect an environmental sample. This field accepts terms listed under environmental sampling device (http://purl.obolibrary.org/obo/ENVO). This field also accepts terms listed under specimen collection device (http://purl.obolibrary.org/obo/GENEPIO_0002094)
samp_collec_method				The method employed for collecting the sample
samp_mat_process				A brief description of any processing applied to the sample during or after retrieving the sample from environment, or a link to the relevant protocol(s) performed
samp_size				The total amount or size (volume (ml), mass (g) or area (m2) ) of sample collected
samp_vol_we_dna_ext				Volume (ml) or mass (g) of total collected sample processed for DNA extraction. Note: total sample collected should be entered under the term Sample Size (MIXS:0000001)
sc_lysis_approach				Method used to free DNA from interior of the cell(s) or particle(s)
sc_lysis_method				Name of the kit or standard protocol used for cell(s) or particle(s) lysis
ref_biomaterial				Primary publication if isolated before genome publication; otherwise, primary genome report
sort_tech				Method used to sort/isolate cells or particles of interest
source_mat_id				A unique identifier assigned to a material sample (as defined by http://rs.tdwg.org/dwc/terms/materialSampleID, and as opposed to a particular digital record of a material sample) used for extracting nucleic acids, and subsequent sequencing. The identifier can refer either to the original material collected or to any derived sub-samples. The INSDC qualifiers /specimen_voucher, /bio_material, or /culture_collection may or may not share the same value as the source_mat_id field. For instance, the /specimen_voucher qualifier and source_mat_id may both contain 'UAM:Herps:14' , referring to both the specimen voucher and sampled tissue with the same identifier. However, the /culture_collection qualifier may refer to a value from an initial culture (e.g. ATCC:11775) while source_mat_id would refer to an identifier from some derived culture from which the nucleic acids were extracted (e.g. xatc123 or ark:/2154/R2)
wga_amp_appr				Method used to amplify genomic DNA in preparation for sequencing
wga_amp_kit				Kit used to amplify genomic DNA in preparation for sequencing
