About VGAS

VGAS (viral genome annotation system) is a system combing ab initio method and similarity-based method, which can perform the functions of virus gene finding and function annotating merely depending on the gene sequence itself. We developed it based on ZCURVE_V, a software we previously developed. This VGAS system is composed of the following six modules:

Seeking seed ORFs
Training the Fisher coefficients involved in the model to describe the characteristics of coding and non-coding regions
Scoring all ORFs
Performing homologous searching to get rid of some wrong genes and identify some correct genes as well as acquire function annotations
Checking the remaining ORFs for overlapping
Relocating start sites of predicted genes
Determining the final genes

The paper of ZCURVE_V, cited by a total of 47 papers, was published in BMC Bioinformatics in 2005. You can access the original version on http://tubic.tju.edu.cn/Zcurve_V/.

Features

Compared with ZCURVE_V, VGAS has been improved it three aspects:

Increasing the number of identifying variables from 33 to 45 to extract more information.
Adding blast function to achieve better results and provide the reference of function annotation.
Adding five times Fisher discriminant analysis in the process to make the result more precise.

Example of Input

You can use DNA(RNA) sequence as the input file without any other letter except A, C, G, T(U), N(if not sure).

    ACCCAACAAGGGGAAATAAGTCAACAAGAGAATTAAAATCTATTAAGTTG
    AAATTAGGATCAAAGACTCCAATCTACTGGGATTGTCTTTGATCAATTGT
    CATAACAATTATGGAGTTGTTTGACATCGCAGACGGGTTCGCTGATCATC
    AGATCAACCTAAGATCTTCTAAGCAGGCAACTGGAAGTCTCAGTGCAATC
    AAGGATCAAATCCTTGTATTGATCCCTGGAACACAAGATTCTGACATTTT
    AAGCAACCTCCTAATTGCCCTATTATCCTTGATCTTCAATGCAGGCTGTC
    CAGAGCCTATCTGCGCTGGGGCTTTCCTCAGCCTTTTAGTCCTCTTCACG
    AATAATCCAACTGCAGCCTTAGGAACTCATGCAAAGGACTCGGATACAGT
    CATCAGTACCTACACCATCACTGAGTTTTCAGGAATGGGTCCAGTATTAA
    ...

Alternatively, you can submit viral genome sequence in FASTA format with only one annotation.

    >gi|41387225|ref|NC_005084.2| Fer-de-lance virus, complete genome

    ACCCAACAAGGGGAAATAAGTCAACAAGAGAATTAAAATCTATTAAGTTG
    AAATTAGGATCAAAGACTCCAATCTACTGGGATTGTCTTTGATCAATTGT
    CATAACAATTATGGAGTTGTTTGACATCGCAGACGGGTTCGCTGATCATC
    AGATCAACCTAAGATCTTCTAAGCAGGCAACTGGAAGTCTCAGTGCAATC
    AAGGATCAAATCCTTGTATTGATCCCTGGAACACAAGATTCTGACATTTT
    AAGCAACCTCCTAATTGCCCTATTATCCTTGATCTTCAATGCAGGCTGTC
    CAGAGCCTATCTGCGCTGGGGCTTTCCTCAGCCTTTTAGTCCTCTTCACG
    AATAATCCAACTGCAGCCTTAGGAACTCATGCAAAGGACTCGGATACAGT
    CATCAGTACCTACACCATCACTGAGTTTTCAGGAATGGGTCCAGTATTAA
    ...

Example of Output

Here is the usual outfile.

    
Gene prediction results by  VGAS
Running parameters:
   Minimum gene length: 90 bp
   Maximum squared Euclid distance: 6.90
   Double-stranded virus
   Start codon: ATG
   Stop codons: TAA, TAG and TGA
   Training mode: self-training

Predicted protein-coding genes
     No        Start        Stop         Strand      Length    VZ Score          Annotation

     1          111         1526            +         1416      0.44041
     (bit_score:964.14;e-value:0.0;identity:100.0%;gi|34482038|ref|NP_899654.1| nucleocapsid protein N [Fer-de-Lance paramyxovirus])
     (bit_score:231.491;e-value:3.99903e-68;identity:27.65%;gi|41057594|ref|NP_958048.1| nucleocapsid protein [Mossman virus])
     (bit_score:227.254;e-value:1.26575e-66;identity:25.04%;gi|387935516|ref|YP_006347582.1| nucleocapsid protein [Nariva virus])

     2         1623         2126            +          504      0.41132
     (bit_score:343.969;e-value:6.77538e-121;identity:100.0%;gi|34482039|ref|NP_899655.1| predicted protein U [Fer-de-Lance paramyxovirus])

     3         2246         2929            +          684      0.33987
     (bit_score:468.389;e-value:4.46024e-168;identity:100.0%;gi|34482040|ref|NP_899657.1| cysteine-rich protein V [Fer-de-Lance paramyxovirus])
     (bit_score:328.176;e-value:5.75383e-110;identity:38.69%;gi|34482041|ref|NP_899656.1| phosphoprotein P [Fer-de-Lance paramyxovirus])

     4         2949         3533            +          585      0.42081
     (bit_score:396.356;e-value:2.82946e-137;identity:45.22%;gi|34482041|ref|NP_899656.1| phosphoprotein P [Fer-de-Lance paramyxovirus])

     5         3678         4733            +         1056      0.45722
     (bit_score:724.161;e-value:0.0;identity:100.0%;gi|34482042|ref|NP_899658.1| matrix protein M [Fer-de-Lance paramyxovirus])
     (bit_score:272.322;e-value:2.29295e-87;identity:40.88%;gi|77124343|ref|YP_338080.1| matrix protein [J-virus])
     (bit_score:268.47;e-value:6.9555e-86;identity:39.41%;gi|89888076|ref|YP_512249.1| matrix protein [Beilong virus])

This is the predicted genes if you choose the "Output predicted genes in FASTA format" option.

----------------------------------------------------------------------
PREDICTED GENES

>Potential gene 1:111..1526, 1416 bp
atggagttgtttgacatcgcagacgggttcgctgatcatcagatcaacctaagatcttctaagcaggcaa
ctggaagtctcagtgcaatcaaggatcaaatccttgtattgatccctggaacacaagattctgacatttt
...
>Potential gene 2:1623..2126, 504 bp
atgatcagaacacgcatctacaaaccgacctacacaacaaccacaccacccacatgtcacacccccatca
agatggaagaagacccgagagagaagatgcatccccaatcaatgtggagactggtgagactgagagcaca
...
>Potential gene 3:2246..2929, 684 bp
atggctaacttcaatggtttcgaagcaagcagccttattgatcaaggcttagacgacatagaggcaatcg
gacagatgacctgcattagaccctctgaggagtcaccatacgtagagataccagacactggtatcgtacc
...

This is the protein sequence of the predicted genes if you choose the "Translate predicted genes into protein primary sequence " option.

----------------------------------------------------------------------
PROTEIN TRANSLATIONS OF PREDICTED GENES 

>Potential protein 1:111..1526, 471 aa
MELFDIADGFADHQINLRSSKQATGSLSAIKDQILVLIPGTQDSDILSNLLIALLSLIFNAGCPEPICAG
AFLSLLVLFTNNPTAALGTHAKDSDTVISTYTITEFSGMGPVLMNRDQVEEFMTNKLNDLIRVIKFPDLF
...
>Potential protein 2:1623..2126, 167 aa
MIRTRIYKPTYTTTTPPTCHTPIKMEEDPREKMHPQSMWRLVRLRAQRLLSYSESTDLSTREFLEDVSKS
VVVLFNRDGMSSISQWRTEDCAARRLGNLSKFAWDAVTKGRMDPCRLAFKMVTELGNDVAIRAEILTVVW
...
>Potential protein 3:2246..2929, 227 aa
MANFNGFEASSLIDQGLDDIEAIGQMTCIRPSEESPYVEIPDTGIVPGIVGKAIGEIESKTNGDGHTSAP
TPHNTIKGNADKVKKSGETIPDKAEEPQPVQQQDRSKVKESNITMNPDSSGFKQLFNRDTELKTNSWKNT
...
>Potential protein 4:2949..3533, 194 aa
MAVSDLGVIVRKVIMGNSERDENLTALMMKMQKQLAIQEGKLETLQSTVGKIYAKVDLIKDHVSKYMILT
REGGKDSQEHEPRRLIQSYTGPGKPEAVINEHGQIRLKGTTRSGTSWNTTPHDLVDPTRLTMSRDESNAT
...