About VGAS

VGAS (viral genome annotation system) is a system combing ab initio method and similarity-based method, which can perform the functions of virus gene finding and function annotating merely depending on the gene sequence itself. We developed it based on ZCURVE_V, a software we previously developed. This VGAS system is composed of the following six modules:
  • Seeking seed ORFs
  • Training the Fisher coefficients involved in the model to describe the characteristics of coding and non-coding regions
  • Scoring all ORFs
  • Performing homologous searching to get rid of some wrong genes and identify some correct genes as well as acquire function annotations
  • Checking the remaining ORFs for overlapping
  • Relocating start sites of predicted genes
  • Determining the final genes
The paper of ZCURVE_V, cited by a total of 47 papers, was published in BMC Bioinformatics in 2005. You can access the original version on http://tubic.tju.edu.cn/Zcurve_V/.

Features

Compared with ZCURVE_V, VGAS has been improved it three aspects:
  • Increasing the number of identifying variables from 33 to 45 to extract more information.
  • Adding blast function to achieve better results and provide the reference of function annotation.
  • Adding five times Fisher discriminant analysis in the process to make the result more precise.

Example of Input

  • You can use DNA(RNA) sequence as the input file without any other letter except A, C, G, T(U), N(if not sure).
  •     ACCCAACAAGGGGAAATAAGTCAACAAGAGAATTAAAATCTATTAAGTTG
        AAATTAGGATCAAAGACTCCAATCTACTGGGATTGTCTTTGATCAATTGT
        CATAACAATTATGGAGTTGTTTGACATCGCAGACGGGTTCGCTGATCATC
        AGATCAACCTAAGATCTTCTAAGCAGGCAACTGGAAGTCTCAGTGCAATC
        AAGGATCAAATCCTTGTATTGATCCCTGGAACACAAGATTCTGACATTTT
        AAGCAACCTCCTAATTGCCCTATTATCCTTGATCTTCAATGCAGGCTGTC
        CAGAGCCTATCTGCGCTGGGGCTTTCCTCAGCCTTTTAGTCCTCTTCACG
        AATAATCCAACTGCAGCCTTAGGAACTCATGCAAAGGACTCGGATACAGT
        CATCAGTACCTACACCATCACTGAGTTTTCAGGAATGGGTCCAGTATTAA
        ...
    

  • Alternatively, you can submit viral genome sequence in FASTA format with only one annotation.
  •     >gi|41387225|ref|NC_005084.2| Fer-de-lance virus, complete genome
    ACCCAACAAGGGGAAATAAGTCAACAAGAGAATTAAAATCTATTAAGTTG AAATTAGGATCAAAGACTCCAATCTACTGGGATTGTCTTTGATCAATTGT CATAACAATTATGGAGTTGTTTGACATCGCAGACGGGTTCGCTGATCATC AGATCAACCTAAGATCTTCTAAGCAGGCAACTGGAAGTCTCAGTGCAATC AAGGATCAAATCCTTGTATTGATCCCTGGAACACAAGATTCTGACATTTT AAGCAACCTCCTAATTGCCCTATTATCCTTGATCTTCAATGCAGGCTGTC CAGAGCCTATCTGCGCTGGGGCTTTCCTCAGCCTTTTAGTCCTCTTCACG AATAATCCAACTGCAGCCTTAGGAACTCATGCAAAGGACTCGGATACAGT CATCAGTACCTACACCATCACTGAGTTTTCAGGAATGGGTCCAGTATTAA ...

    Example of Output

  • Here is the usual outfile.
  •     
    Gene prediction results by  VGAS
    Running parameters:
       Minimum gene length: 90 bp
       Maximum squared Euclid distance: 6.90
       Double-stranded virus
       Start codon: ATG
       Stop codons: TAA, TAG and TGA
       Training mode: self-training
    
    Predicted protein-coding genes
         No        Start        Stop         Strand      Length    VZ Score          Annotation
    
         1          111         1526            +         1416      0.44041
         (bit_score:964.14;e-value:0.0;identity:100.0%;gi|34482038|ref|NP_899654.1| nucleocapsid protein N [Fer-de-Lance paramyxovirus])
         (bit_score:231.491;e-value:3.99903e-68;identity:27.65%;gi|41057594|ref|NP_958048.1| nucleocapsid protein [Mossman virus])
         (bit_score:227.254;e-value:1.26575e-66;identity:25.04%;gi|387935516|ref|YP_006347582.1| nucleocapsid protein [Nariva virus])
    
         2         1623         2126            +          504      0.41132
         (bit_score:343.969;e-value:6.77538e-121;identity:100.0%;gi|34482039|ref|NP_899655.1| predicted protein U [Fer-de-Lance paramyxovirus])
    
         3         2246         2929            +          684      0.33987
         (bit_score:468.389;e-value:4.46024e-168;identity:100.0%;gi|34482040|ref|NP_899657.1| cysteine-rich protein V [Fer-de-Lance paramyxovirus])
         (bit_score:328.176;e-value:5.75383e-110;identity:38.69%;gi|34482041|ref|NP_899656.1| phosphoprotein P [Fer-de-Lance paramyxovirus])
    
         4         2949         3533            +          585      0.42081
         (bit_score:396.356;e-value:2.82946e-137;identity:45.22%;gi|34482041|ref|NP_899656.1| phosphoprotein P [Fer-de-Lance paramyxovirus])
    
         5         3678         4733            +         1056      0.45722
         (bit_score:724.161;e-value:0.0;identity:100.0%;gi|34482042|ref|NP_899658.1| matrix protein M [Fer-de-Lance paramyxovirus])
         (bit_score:272.322;e-value:2.29295e-87;identity:40.88%;gi|77124343|ref|YP_338080.1| matrix protein [J-virus])
         (bit_score:268.47;e-value:6.9555e-86;identity:39.41%;gi|89888076|ref|YP_512249.1| matrix protein [Beilong virus])
    

  • This is the predicted genes if you choose the "Output predicted genes in FASTA format" option.
  • ----------------------------------------------------------------------
    PREDICTED GENES
    
    >Potential gene 1:111..1526, 1416 bp
    atggagttgtttgacatcgcagacgggttcgctgatcatcagatcaacctaagatcttctaagcaggcaa
    ctggaagtctcagtgcaatcaaggatcaaatccttgtattgatccctggaacacaagattctgacatttt
    ...
    >Potential gene 2:1623..2126, 504 bp
    atgatcagaacacgcatctacaaaccgacctacacaacaaccacaccacccacatgtcacacccccatca
    agatggaagaagacccgagagagaagatgcatccccaatcaatgtggagactggtgagactgagagcaca
    ...
    >Potential gene 3:2246..2929, 684 bp
    atggctaacttcaatggtttcgaagcaagcagccttattgatcaaggcttagacgacatagaggcaatcg
    gacagatgacctgcattagaccctctgaggagtcaccatacgtagagataccagacactggtatcgtacc
    ...
    

  • This is the protein sequence of the predicted genes if you choose the "Translate predicted genes into protein primary sequence " option.
  • ----------------------------------------------------------------------
    PROTEIN TRANSLATIONS OF PREDICTED GENES 
    
    >Potential protein 1:111..1526, 471 aa
    MELFDIADGFADHQINLRSSKQATGSLSAIKDQILVLIPGTQDSDILSNLLIALLSLIFNAGCPEPICAG
    AFLSLLVLFTNNPTAALGTHAKDSDTVISTYTITEFSGMGPVLMNRDQVEEFMTNKLNDLIRVIKFPDLF
    ...
    >Potential protein 2:1623..2126, 167 aa
    MIRTRIYKPTYTTTTPPTCHTPIKMEEDPREKMHPQSMWRLVRLRAQRLLSYSESTDLSTREFLEDVSKS
    VVVLFNRDGMSSISQWRTEDCAARRLGNLSKFAWDAVTKGRMDPCRLAFKMVTELGNDVAIRAEILTVVW
    ...
    >Potential protein 3:2246..2929, 227 aa
    MANFNGFEASSLIDQGLDDIEAIGQMTCIRPSEESPYVEIPDTGIVPGIVGKAIGEIESKTNGDGHTSAP
    TPHNTIKGNADKVKKSGETIPDKAEEPQPVQQQDRSKVKESNITMNPDSSGFKQLFNRDTELKTNSWKNT
    ...
    >Potential protein 4:2949..3533, 194 aa
    MAVSDLGVIVRKVIMGNSERDENLTALMMKMQKQLAIQEGKLETLQSTVGKIYAKVDLIKDHVSKYMILT
    REGGKDSQEHEPRRLIQSYTGPGKPEAVINEHGQIRLKGTTRSGTSWNTTPHDLVDPTRLTMSRDESNAT
    ...