Shetti, a simple tool to parse, manipulate and search large datasets of sequences

Parsing and manipulating long and/or multiple protein or gene sequences can be a challenging process for experimental biologists and microbiologists lacking prior knowledge of bioinformatics and programming. Here we present a simple, easy, user-friendly and versatile tool to parse, manipulate and search within large datasets of long and multiple protein or gene sequences. The Shetti tool can be used to search for a sequence, species, protein/gene or pattern/motif. Moreover, it can also be used to construct a universal consensus or molecular signatures for proteins based on their physical characteristics. Shetti is an efficient and fast tool that can deal with large sets of long sequences efficiently. Shetti parses UniProt Knowledgebase and NCBI GenBank flat files and visualizes them as a table.

. Single-letter and single-physical properties symbol abbreviations used to search for patternmotifs.
The user can use amino acid residues directly. The alternative amino acids are bracketed, as in column 2. Some symbols also can be used, column 3. The tool then searches for these residues in sequences. For multiple patterns, they should be separated by ";". For details, please see the program's user guide.
Note: The symbols presented in this table are customized for searching purpose only in the tool. In PROSITE "<" and ">" denote N and C-terminals motifs. In Shetti, to terminal search, please use the check-boxes. These symbols or motifs might be changed based on the new literatures.

Residues
Representative Motif Symbol Physical properties Arg-Gly-Asp residues RGD Pro-Pro-any amino acid-Tyr PPxY P, and (E or D) residues ≈ PE or PD P[ED] P, but not (E or D) residues P{ED} T is repeated 3 times PGST (3)   If all the nucleic acid bases in a position are the same (conserved base), a representative symbol is saved in consensus. If the bases are different (divergent bases), the representative IUPAC base is written to the consensus. Details can be found in the program documentation.

Note:
The sequence should be in aligned fasta format. If all the residues in a position are the same (conserved), a single-letter amino acid is saved in consensus. If the residues are different (divergent), the representative single-letter is written to the consensus. This symbol represents a physical characteristic shared between the residues in this position. Otherwise, x residues, which means any residues that do not share common properties. If the same residues in motif are repeated it can be written as AC (3)T, which denotes ACCCT motif. Details and use case can be found in program documentation.

Note:
The sequence should be in aligned fasta format.

Amino acids in a position
Representative symbol "lower-case letter"

Physical properties
Gap "-" . "dot sign" E and D -"minus / hyphen" Negative -Acidic    The simplest form of fasta format is the following: > The first sequence header and description here The sequence contains protein amino acid residues or ACGTU nucleic acid bases..... > The second sequence header and description here The sequence contains protein amino acid residues or ACGTU nucleic acid bases.....

Notes:
Only the standard fasta formats mentioned in Fig. S5  The fasta formatted headers are only dedicated for headers visualization mode. Whatever the selected view mode, the sequences will be processed and manipulated in the same manner. The table and list modes are different ways to visualize the fasta header. They do NOT influence processing of the data. Whatever the view mode you selected to view the fasta headers, the headers and sequences are loaded into memory and parsed in the same manners.
For UniProt Knowledgebase and NCBI GenBank databases FLAT file format, further details can be found on the following links: http://web.expasy.org/docs/userman.html and http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html, last accessed August 2015.
For details also, please, see the program's documentation and the samples files enclosed in Shetti package compressed file. The simplest tree format is "(A:0.1,B:0.2);". This raw format can be then parsed by a tree visualization tool to present the tree (Fig. S7). The tree branches could be labelled with accession numbers only (A), or name of the species and the accession numbers (B).