[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

Warning: The NCBI web site requires JavaScript to function. more...

U.S. flag

An official website of the United States government

What is table2asn?

table2asn is a command-line program that creates sequence records for submission to GenBank.

In general, table2asn will recognize files with the same basename as the input sequence file, and will output an ASN.1 (Abstract Syntax Notation 1) text file with the same basename and a .sqn suffix. Various optional output files can be generated, depending on the arguments used. For example: validation files (.val suffix) and a summary of the .val files (.stats suffix), a discrepancy report (.dr suffix), or a GenBank flatfile (.gbf suffix).

table2asn is the replacement of the older now-obsolete tool tbl2asn, with very similar operation. There are a few different argument values (discussed below) and several additional functions in table2asn compared to tbl2asn.

table2asn is available by anonymous FTP. Copy the version for your platform, uncompress the file, rename it to "table2asn" and set the permissions, as necessary for the platform. This page provides a brief description of the more common uses for table2asn, but more detail is provided in the table2asn_readme.txt file.

6 types of input data files

  1. Template file containing a text ASN.1 Submit-block object (suffix .sbt). [Required]
  2. Nucleotide sequence data in FASTA format (suffix .fsa). [Required]
  3. 5-column Feature Table (suffix .tbl). [Required only if including annotation in this format]
  4. Protein sequence (suffix .pep). [Optional; these are rarely needed.]
  5. Quality Scores (suffix .qvl.) [Optional]
  6. Source Table (suffix .src.) [Optional]

Generating the .sqn file for submission

  • The minimum requirements to generate an ASN .sqn file using table2asn are one .sbt file and one or more .fsa files.
  • The files are placed in a source directory and a series of command line arguments are used to generate the .sqn files.
  • table2asn will generate a .sqn for every .fsa file in the directory, plus any of the corresponding optional files that may be present. The other files must have the same file name prefix as their corresponding .fsa. (for example helicase.fsa and helicase.tbl).

Command Line Arguments

To get a summary of the command line arguments that table2asn has at its disposal, run: table2asn -help

Please be aware that the argument values of a few functions were changed in table2asn, compared to the older tbl2asn. You can see all the arguments by typing “table2asn -help”, but this table shows the ones that might have the most impact:

table2asn command line argument changes from tbl2asn
table2asn tbl2asn Function
-indir -p Path to the input files
-outdir -r Path for the resulting .sqn file(s) (if the -outdir argument is not used, the .sqn files will be saved in the source directory). When -outdir is used with -M n or -V v or -Z, the name of the output directory is the basename of the .stats and .dr files that are generated.
-r -R Enable remote data retrieval
-Z -Z File Discrepancy report
-help - Print usage and arguments

Here is a partial list of commonly used table2asn arguments:

-indir Path to the directory. If files are in the current directory -indir. should be used. NOTE: this argument is a change from tbl2asn (had been -p)
-outdir Path for the resulting .sqn file(s) (if the -outdir argument is not used, the .sqn files will be saved in the source directory). When -outdir is used with -M n or -V v or -Z, the name of the output directory is the basename of the .stats and .dr files that are generated. NOTE: this argument is a change from tbl2asn (had been -r)
-t Specifies the template file (.sbt). If the .sbt file is in a different directory the full path must be specified.
-i Creates single submission from indicated .fsa file in a directory of multiple .fsa files.
-o Can be used with -i to override the default name of the output .sqn file. When -o is used with -M n or -V v or -Z, the basename of the output file set by -o is used as the basename of all the output files.
-a Specifies the File type.
    a :Any format, including a single FASTA or ASN.1 (default)
    s :Batch set of unrelated sequences, eg for a genome assembly
Sample command line: -a s
-j Allows the addition of source qualifiers that will be the same for every sequence in the input fasta files. Example: -j "[organism=Saccharomyces cerevisiae] [strain=S288C]".
-V

Verification (combine any of the following letters):

    v :Validates the data records. The output is saved to files with a .val suffix. A summary file with the suffix .stats is also created with the number, severity and type of errors found in all the .val files.
    b :Generates GenBank flatfiles with a .gbf suffix.

Sample command line: -V vb

-c Cleanup (combine any of the following letters):
    f :Fix product names in specific categories of the Discrepancy Report. The output of changed product names is saved to files with a .fixedproducts suffix.
    x :Extend partial ends of features by one or two nucleotides to abut gaps or sequence ends.
    D :Correct Collection Dates (assume day first)
    d :Correct Collection Dates (assume month first)
Sample command line: -c fx
-y Adds a COMMENT to each submission. Example: -y "Contigs larger than 2kb have been annotated, representing approximately 87% of the total genome".
-Y Like -y, but adds a COMMENT from a file to each submission.
-f Uses specified 5-column feature table file as annotation input, so its basename need not match that of the .fsa file.
-Z Runs the sequence discrepancy report, which looks for subtle inconsistencies within a set of related records, and outputs a file with the .dr suffix. Recommended only for annotated genome and transcriptome submissions. See the Discrepancy Report page for information about its output. NOTE: this argument is changed from tbl2asn because it no longer requires (or accepts) an output file name.
-euk Asserts eukaryotic lineage for the discrepancy report tests.
-M Master Genome Flags:
    n :Normal. To be used for prokaryotic or eukaryotic genome submissions. Replaces -a s -V v -c f; invokes genome-specific discrepancy tests and FATAL calls when -Z is included. See the Genome Submission Guide
    t :TSA. Combines flags for TSA submissions (replaces -a s -V v -c f; invokes TSA-specific validations) See the TSA Submission Guide
Sample command line: -M n
-help Provides the full list of command line arguments.

Example Command Lines

  • Single non-genome submission: a particular .fsa file, and only 1 sequence in the .fsa file and the source information is in the definition line of the .fsa file:
    • table2asn -t template.sbt -i x.fsa -V v
  • Batch non-genome submission: a directory that contains .fsa files, and multiple sequences per file, and the source information is in the definition line of the .fsa files:
    • table2asn -t template.sbt -indir path_to_files -a s -V v
  • Genome submission: a directory that contains multiple .fsa files of a single genome, and one or more sequences per file and the source information is in the definition line of the .fsa files:
    • table2asn -t template.sbt -indir path_to_files -M n -Z
  • Genome submission for the most common gapped situation (= runs of 10 or more Ns represent a gap, and there are no gaps of completely unknown size, and the evidence for linkage across the gaps is "paired-ends"), and the source information is in the definition line of the .fsa files:
    • table2asn -t template -indir path_to_files -M n -Z -gaps-min 10 -l paired-ends

Before submitting your .sqn files to GenBank,

Creating the template file (.sbt)

Nucleotide sequence and FASTA defline formats (.fsa)

  • No size limit on nucleotide sequence, generally.
    • There is a technical length limit of 2,147,483,647bp (= 2^31)
  • Each sequence in a FASTA file must have a definition line (defline) beginning with a '>'.
  • Minimum requirements for the FASTA defline are:
    • SeqID (sequence identifier) which is the text between the '>' and the first space. The SeqIDs limits are:
      • Must be <50 characters
      • Can only include letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks (*), and number signs (#).
    • Organism and related information (unless organism information is included with -j at the command line or in a .src file )
    • Optional defline information is in this list of source modifiers and includes:

Biological

  • strain [strain=S288C]
  • isolate [isolate=CWS1]
  • chromosome [chromosome=XVI]
  • plasmid [plasmid-name=pBR322]

Other elements

  • topology [topology=circular]
  • location [location=mitochondrion]
  • molecule [moltype=mRNA] (DNA is the default)
  • technique [tech=wgs]
  • genetic code [gcode=4]

Here is the list of source modifiers. See the Taxonomy pages for the genetic code values.

Example FASTA

>Sc_16 [organism=Saccharomyces cerevisiae]
tataggcgaatcgagtatattattttttctcaacatatgtat
atgaacatgagaatatatttataggaatgtataaaattgtga
cctctcctgctattttagttactgattttatgtatgtagggg
gaataggggctgcctttcttaatgcagttttaattttttctt
ttaattttttcttagtaaaattatttaaagtaaagattaatg
gaataaccattgcgcttttttttacagtttttggtttttcat
tttttggaaaaaatattttaaatattttacctttttatttag
ggggtattttatatagtatctatacttcaacagatttttctg
aacatatagttcctattgctttttcaagtgcattagcccctt
ttgtaagcagtgttgctttttatggagaaatatcctatgaaa
catcatatataaatgcaattttaattggtattttaattggtt
ttatagtggttcctttgtctaaaagtctttatgactttcatg
agggatatgatttatataatttaggttttacagcaggtt

Example batch FASTA

>Sc_16_1 [organism=Saccharomyces cerevisiae]
tataggcgaatcgagtatattattttttctcaacatatgtat
atgaacatgagaatatatttataggaatgtataaaattgtga
cctctcctgctattttagttactgattttatgtatgtagggg
gaataggggctgcctttcttaatgcagttttaattttttctt
ttaattttttcttagtaaaattatttaaagtaaagattaatg
aacatatagttcctattgctttttcaagtgcattagcccctt
ttgtaagcagtgttgctttttatggagaaatatcctatgaaa
>Sc_16_2 [organism=Saccharomyces cerevisiae]
catcatatataaatgcaattttaattggtattttaattggtt
ttatagtggttcctttgtctaaaagtctttatgactttcatg
agggatatgatttatataatttaggttttacagcaggtt
gaataaccattgcgcttttttttacagtttttggtttttcat
tttttggaaaaaatattttaaatattttacctttttatttag
ggggtattttatatagtatctatacttcaacagatttttctg

Feature table format (.tbl)

table2asn reads features from a five-column tab-delimited table called a Feature table. The feature table specifies the location and type of each feature. table2asn will process the feature intervals and translate CDSs into proteins. The first line of the table should contain the following information:

>Features SeqID

The SeqID must match the nucleotide sequence SeqID in the corresponding .fsa file.

Example Feature Table

>Feature Sc_16 
69      543    gene
                        gene       sde3p
69      543    CDS
                        product SDE3P
                        protein_id     WS1030

Note that GenBank prokaryotic or eukaryotic genomes can use GFF3 files in a GenBank-specific format as annotation input, as described at Annotating Genomes with GFF3 or GTF files. In general, the qualifiers that can be included in a 5-column feature table (.tbl) file can be included in column 9 of the appropriate feature's row in a GFF3 file.

Protein sequence format (.pep)

  • This file is not usually needed because GenBank generally presents on the conceptual translation of the nucleotide sequence, which will be automatically generated by table2asn.
  • This file will substitute the automatically translated products of the CDS features with the provided protein sequences, so is only needed in unusual cases.
  • It is FASTA file of the protein sequence, where the SeqID must match protein_id in the .tbl file

Example FASTA

>WS1030 [gene=sde3p] [protein=SDE3P]
MYKIVTSPAILVTDFMYVGGIGAAFLNAVLIFSFNFFL
VKLFKVKINGITIAAFFTVFGFSFFGKNILNILPFYLG
GILYSIYTSTDFSEHIVPIAFSSALAPFVSSVAFYGEI
SYETSYINAILIGILIGFIVVPLSKSLYDFHEGYDLYN
LGFTAG

Quality scores table format (.qvl)

  • Provides Phrap/Consed quality scores.
  • Has a defline with the corresponding SeqID from the .fsa file.
  • Generates Seq-graph data that will be included with the nucleotide sequence of the .fsa file in the final .sqn file.
    >Sc_16
    51 63 70 82 82 82 90 90 90 90 86 86
    86 86 86 86 90 90 90 90 90 86 86 78...
    

Source table format (.src)

For sets of sequences, especially those with different sources, a tab-delimited source modifier table file can be created, with a name that has a .src extension. The first column in the file must be the SeqIDs of the sequences. The first row gives the names of the source qualifiers being added, separated by tabs. Any additional rows list the SeqID and source qualifiers for each sequence in the corresponding .fsa file.

SeqID     organism     strain     isolate
Sc_16     Zea mays     A69Y       JH90.6-2x12
Support Center

Last updated: 2024-04-23T18:38:34Z