A home for Biosciences students: Software and Databases for Computational Biology on the Internet

This page is a supplement to the book Computational Methods in Molecular Biology, edited by Steven Salzberg, David Searls, and Simon Kasif. The publisher is Elsevier Sciences. Please contact Steven Salzberg (salzberg@cs.jhu.edu) if you wish to have your software referenced on this site, or if you wish to change the description of your software already listed here.

____________________________________________________________________

Glimmer is a system that uses Interpolated Markov Models (IMMs) to identify coding regions in microbial DNA. IMMs are a generalization of Markov models that allow great flexibility in the choice of the "context"; i.e., how many previous bases to use in predicting the next base. Glimmer has been tested on the complete genomes of H. influenzae, E. coli, H. pylori, M. genitalium, and other genomes, and results to date have proven it to be highly accurate. Glimmer was the principal gene finder for the genomes of B. burgdorferi , T. pallidum, C. trachomatis, C. pneumoniae, D. radiodurans, T. maritima, and others. The complete system, including source code, is available from this site. A version of the system built for the malaria parasite, GlimmerM, is also available.
GENSCAN is a program designed to predict complete gene structures, including exons, introns, promoter and poly-adenylation signals, in genomic sequences. It differs from the majority of existing gene finding algorithms in that it allows for partial genes as well as complete genes and for the occurrence of multiple genes in a single sequence, on either or both DNA strands. Program versions suitable for vertebrate, nematode (experimental), maize and Arabidopsis sequences are currently available. The vertebrate version also works fairly well for Drosophila sequences. Sequences can be submitted on a web-based form at this site. The GENSCAN Web site is at Stanford University.
GeneSplicer A fast, flexible system for detecting splice sites in the genomic DNA of various eukaryotes. The system has been trained and tested successfully on Plasmodium falciparum (malaria), Arabidopsis thaliana, human, Drosophila, and rice. It was compared to six programs representing the leading splice site detectors for Arabidopsis thaliana and Human: NetPlantGene,NetGene,HSPL,NNSplice,GENIO and SpliceView. In each case GeneSplicer performed comparably to the best alternative,in terms of both accuracy and computational efficiency.
GlimmerHMMan interpolated Markov Model system for finding genes in many eukaryotes, including Plasmodium falciparum, Arabidopsis thaliana, rice, and others. GlimmerHMM additionally incorporates splice site models adapted from the GeneSplicer program and a decision tree adapted from GlimmerM. The program utilizes Interpolated Markov Models as well as the Maximal Dependence Decomposition technique for improving specificity in splice site identification. (includes states for exons, introns, and intergenic regions). The sources are freely available for download at this site.
TigrScan TigrScan models DNA using a Generalized Hidden Markov Model (GHMM). Alternate parses of DNA (into zero or more gene models) are evaluated under this model. The evaluation of an input string using an HMM/GHMM is referred to as a decoding algorithm. TigrScan implements a novel decoding algorithm for GHMMs that is both time-efficient and space-efficient, through the use of queues and propagators. The sources, training data and a users' guide are available here.
TransTermis a program that finds rho-independent transcription terminators in bacterial genomes. Each terminator found by the program is assigned a confidence value that provides an estimate of its probability of being a true terminator.
PIRATE(Prediction Informatics Resources And Techniques) is a central repository of open-source bioinformatics prediction programs and reusable software components, documentation, training data, experimental results, tips and tricks, and external links.
MUMmer A system for aligning whole genome sequences. Uses a suffix tree, the system is able to rapidly align sequences containing millions of nucleotides. Usage of the algorithm should facilitate analysis of syntenic chromosomal regionis, strain-to-strain comparisons, evolutionary comparisons, and genomic duplications.
Bambus BAMBUS is the first publicly available scaffolding program. It orders and orients contigs into scaffolds based on various types of linking information. Additionally, BAMBUS allows the users to build scaffolds in a hierarchical fashion by prioritizing the order in which links are used.
Genie, a gene finder based on generalized hidden Markov models, is at the Lawrence Berkley National Laboratory. It was developed in collaboration with the Computational Biology Group at the University of California, Santa Cruz. Genie uses a statistical model of genes called a Generalized Hidden Markov Model (GHMM) to find genes in vertebrate and human DNA. In a GHMM, probabilities are assigned to transitions between states and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize these probabilities using a standardized gene data set, which is available on this site. The page has a link to the Genie Web server, to which sequences may be submitted.
GRAIL provide analysis and putative annotation of DNA sequences both interactively and through the use of automated computation. GRAIL is a tool for the identification of genes, exons, and various features in DNA sequences.This system is at the Oak Ridge National Laboratory in Tennessee.
The FGENE family of programs finds splice sites, genes, promoters, and poly-A recognition regions in eukaryotic sequence data. The underlying technology uses linear discriminant analysis. You can submit sequences to FGENE using a Web interface found at http://www.softberry.com/.
The GeneID server contains the GeneID system for finding genes in eukaryotes. GeneID is a hierarchical rule-based system, with scoring matrices to identify signals and rules to score coding regions. You can use this page to submit a genomic DNA sequence to the GeneID program. The GeneID site is at www1.imim.es/software/geneid/ in Spain, and is also available at this page at Boston University.
GeneMark is a system for finding genes in bacterial DNA sequences. The algorithm is based on non-homogeneous 5th-order Markov chains, and it was used to locate the genes in the complete genomes of H. influenzae, M. genitalium, and several other complete genomes. The site includes documentation and a Web interface to which sequences can be submitted. This system is at the Georgia Institute of Technology in Atlanta, GA.
MarFinder uses statistical patterns to deduce the presence of MARs (Matrix Association Regions) in DNA sequences. MARs constitute a significant functional block and have been shown to facilitate the processes of differential gene expression and DNA replication. This tool and Web site are at the Futuresoft Corporation.
NetPlantGene is at the Technical University of Denmark. The NetPlantGene Web server uses neural networks to predict splice sites in Arabidopsis thaliana DNA. This site also contains programs for other sequence analysis problems as well, such as the recognition of signal peptides. NetPlantGene is to be replaced with NetGene2.
MZEF and Pombe. This page contains software tools designed to predict putative internal protein coding exons in genomic DNA sequences. Human, mouse and arabidopsis exons are predicted by a program called MZEF, and fission yeast exons are predicted by a program called Pombe. The site is located at the Cold Spring Harbor Laboratory.
Promoter Prediction by Neural Network (NNPP) is a method that finds eukaryotic and prokaryotic promoters in a DNA sequence. The basis of the NNPP program is a time-delay neural network. The time-delay network consists mainly of two feature layers, one for recognizing the TATA-box and one for recognizing the "Initiator", which is the region spanning the transcription start site. Both feature layers are combined into one output unit, which gives output scores between 0 and 1. This site is at the Lawrence Berkley National Laboratory. Also available at this site is the splice site predictor used by the Genie system. The output of this neural network is a score between 0 and 1 indicating a potential splice site.
SplicePredictor is a program designed to predict donor and acceptor splice sites in maize and Arabidopsis sequences. Sequences can be submitted on a web-based form at this site. The system is at Stanford University.
Combiner a program that predicts gene models using the output from other annotation software. It uses a statistical algorithm to identify patterns of evidence corresponding to gene models.
TESS (Transcription Element Search Software) is a set of software routines for locating and displaying transcription factor binding sites in DNA sequence. TESS uses the Transfac database as its store of transcription factors and their binding sites. This page is at the University of Pennsylvania's Computational Biology and Informatics Laboratory.
Genotator, a workbench for automated sequence annotation, provides a flexible, transparent system for automatically running a series of sequence analysis programs on genetic sequences. It also has a graphical display that allows users to view all of the automatically-generated annotations and add their own. Genotator's display allows annotated sequences to be examined at multiple levels of detail, from an overview of the entire sequence down to individual bases. By displaying the aligned output of multiple types of sequence analysis, Genotator provides an intuitive way to identify the significant regions (for example, probable exons) in a sequence. Genotator was developed by Nomi Harris at Lawrence Berkely National Laboratory.
WebGene (GenView, ORFGene, SpliceView) is Web interface for several coding region recognition programs, including:
- GenView: a system for protein-coding gene prediction
- ORFGene: gene structure prediction using information on homologous protein sequences
- SpliceView: prediction of splicing signals
- HCpolya: a hamming Clustering Method for Poly-A prediction in eukaryotic genes
This page is at the Instituto Tecnologie Biomediche Avanzate in Italy.
The Staden Package contains a wealth of useful programs for sequence assembly, DNA sequence comparison and analysis, protein sequence analysis, and sequencing project quality control. The site is mirrored in several locations around the world.
PatternHunter PatternHunter is a new general-purpose homology search tool, based on innovative and proprietary technologies. It provides all the tools necessary for fast and sensitive homology search in all flavors including DNA-DNA, Protein-Protein, translated DNA-protein, and translated DNA-DNA searches.
GeneHacker is a system for gene structure prediction in microbial genomes using hidden Markov model (HMM). An HMM adopted in GeneHacker describes start codon and the downstream di-codon frequencies in protein coding regions, and can identify the regions in uncharacterized DNA sequences.
Geneid is a program to predict genes along a DNA sequence in a large set of organisms.
GeneSeqerGeneSeqer is a gene identification tool based on spliced alignment or "spliced threading" of ESTs with a genomic query sequence.
Databases
GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.GenBank is part of the International Nucleotide Sequence Database Collaboration, which is comprised of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at the National Center for Biotechnology Information.
Genome Assembly Archive The NCBI Assembly Archive links the raw sequence information found in the Trace Archive with assembly information found in publicly available sequence repositories (GenBank/EMBL/DDBJ). The Assembly Viewer allows a user to see the multiple sequence alignments as well as the actual sequence chromatogram.
GeneCards is a database of human genes, their products and their involvement in diseases. It offers concise information about the functions of all human genes that have an approved symbol as well as selected others. It is especially useful for those who are searching for information working in functional genomics and proteomics. The data is collected with Knowledge Discovery and Data Mining's techniques and accessed by means of proprietary Guidance System that makes more or less intelligent suggestions to the user of where and how the information may be retrieved.
The EpoDB (Erythropoiesis Database) is a database of genes that relate to vertebrate red blood cells. A detailed description of EpoDB can be found on Chapter 5. The database includes DNA sequence, structural features and potential transcription factor binding sites. This Web site is at the University of Pennsylvania's CBIL.
The LENS (Linking ESTs and their associated Name Space) database links and resolves the names and identifiers of clones and ESTs generated in the I.M.A.G.E. Consortium/WashU/Merck EST project. The name space includes library and clone IDs and names from IMAGE Consortium, EST sequence IDs from Washington University, sequence entry accession numbers from dbEST/NCBI, and library and clone IDs from GDB. LENS allows for querying of IMAGE Consortium data via all the different IDs.
PDD, the NIMH-NCI Protein-Disease Database is at the Laboratory of Experimental and Computational Biology at the National Cancer Institute. This server is part of the NIMH-NCI Protein-Disease Database project for correlating diseases with proteins observable in serum, CSF, urine and other common human body fluids based on biomedical literature.
The TRANSFAC Database is at the Gesellschaft für Biotechnologische Forschung mbH (Germany). TRANSFAC is a transcription factor database. It compiles data about gene regulatory DNA sequences and protein factors binding to them. On this basis, programs are developed that help to identify putative promoter or enhancer structures and to suggest their features.
SRS a homogeneous interface to over 80 biological databases, developed at the European Bioinformatics Institute (EBI) at Hinxton, UK. The databases include those of sequence and sequence related, metabolic pathways, transcription factors, application results (e.g. BLAST), protein 3D structure, genome, mutations and locus-specific mutations.
Homstrad (HOMologous STRucture Alignment Database) is a web accessible database at University of Cambridge. Homstrad contains families of proteins of known structure that share sequence/structural similarity. The classifications proposed by various databases including SCOP, Pfam, PROSITE and SMART and the results from sequence similarity searches by PSI-BLAST and FUGUE are combined.

Motif Search:
PROSITE Search Form Allows you to rapidly compare a protein sequence against all patterns stored in the PROSITE pattern database. It answers the question: Which patterns from the PROSITE databse are found in my sequence? (EBI)
ScanProsite: Protein against Prosite form allows one to scan a protein sequence (either from SWISS-PROT or provided by the user) for the occurrences of patterns sorted in the PROSITE database. Pattern against SWISS-PROT scans in all of the SWISS-PROT database (including weekly releases) for the occurrence of a pattern that can originate from PROSITE or be provided by the user. (ExPASY)
ELPH is a general-purpose Gibbs sampler for finding motifs in a set of DNA or protein sequences. The program takes as input a set containing anywhere from a few dozen to thousands of sequences, and searches through them for the most common motif, assuming that each sequence contains one copy of the motif. We have used ELPH to find patterns such as ribosome binding sites (RBSs) and exon splicing enhancers (ESEs).
Motifs in protein databases program determines if a protein motif is present in a database of protein sequences. This program allows the user to define a protein motif and then determine if a DNA sequence might encode them or if they are present in a protein database. The programs do not search a library of predefined protein motifs. A motif is defined by entering the amino acids of interest at each position.(Alces)
MatInspectorA tool for the detection of transcription factor binding sites. It is able to locate matches of sequences of unlimited length and compare one, several or all sequences in a sequence file against all or selected subsets of matrices from a library of matrix descriptions of protein binding sites. (GSF)
MEME - Multiple EM for Motif Elicitation system allows one to discover motifs of highly conserved regions in groups of related DNA or protein sequences and search sequence databases using motifs using MAST: Works by calculating match scores for each sequence. The match scores are converted into various types of p-values and these are used to determine the overall match of the sequence to the group of motifs.(SDSC)
BCM Search Launcher : The Baylor College of Medicine has a variety of biology related search and analysis services including general protein sequence/Pattern searches and Species-Specific Protein Sequence Searches.(HGSC)
Screening pattern or alignment against PROTEIN databank This method of looking for all pattern entries in PROTEIN databank is almost the same as in PROSITE screening procedure. The one difference is that coincidence of pattern's and fragment's letter could be seen in a broad sense: as a similarity of letters according to a weight matrix selected by the user. (genebee)
PPSEARCH : Prosite Database Searches (sequence against databases of motifs). Allows you to search sequences for motifs or functional patterns in the prosite database (EBI)
COGnitor: Compare your sequence to COG- Clusters of Orthologous Groups database. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. (NCBI)
HMMER Sean Eddy: Profile hidden Markov models can be used to do sensitive database searching using statistical descriptions of a sequence family's consensus.The advantage of using HMMS is that HMMS have a formal probabilistic basis and can be trained from unaligned sequences, if a trusted alignment isn't yet known. They do however make poor models of RNAs because they cannot describe base pairs. HMMER is a freely distibutable implementation of profile HMM software for protein sequence analysis. (Washington Univ.)
emotif is a research system that forms motifs for subsets of aligned sequences. Emotif ranks the motifs that it finds by both their specificity and the number of supplied sequences that it covers.(Stanford Bioinformatics Group)
FunSiteP Promoter Recognition: Recognition and classification of eukaryotic promoters by searching transcription factor binding sites using transcription factor binding site consensi. (GSF)
SMART Simple Modular Architecture Research Tool: Allows rapid identification and annotation of signalling protein domain sequences. It is able to determine the modular architectures of single sequences or genomes. (EBI)
SAM : Sequence Alignment and Modeling System using HMM (Hidden Markov Model). SAM is a collection of software tools for creating, refining and using linear HMM for biological sequence analysis. Documentation for SAM can be found here.(Pasteur)
Mreps is a flexible and efficient software for identifying serial repeats (usually called tandem repeats) in DNA sequences. It is developed at LORIA in Adage group.

Secondary Structure Prediction:
THREADER2 is a program for predicting protein tertiary structure by recognizing the correct fold from a library of alternatives. Of course, if a fold similar to the native fold of the protein being predicted is not in the library, then this approach will not succeed. Fortunately, certain folds crop up time and time again, and so fold recognition methods for predicting protein structure can be very effective. In the first prediction contest held at Asilomar, organized by John Moult and colleagues, THREADER correctly identified 8 out of 11 target structures which either globally or locally resembled a previously observed fold. Preliminary analysis of the results from the second competition (CASP2) show that THREADER 2 has shown clear improvement in both fold recognition sensitivity AND sequence-structure alignment accuracy. In CASP2, the new version of THREADER recognized 4 folds correctly out of 6 targets with recognizable structures (including the difficult task of assigning a jelly-roll fold rather than other beta-sandwich topologies for one target). THREADER 2 produced more correct fold predictions (i.e. correct folds ranked at No. 1) than any other method.
Predict Protein is a service for sequence analysis and structure prediction. Once you submit a protein sequence, PredictProtein retrieves similar sequences in the database and predicts aspects of protein structure, residue solvent accessibility and helical transmembrane regions. (EMBL)
RAPTOR, Prospect Pro software tools to perform protein structure prediction. Based on a novel linear programming method, combines sequence-to-sequence and sequence-to-structure search methods with advanced analysis tools into one integrated software solution.
NNPREDICT Protein Secondary Structure Prediction: A program that predicts the secondary structure type for each residue in an amino acid sequence. The basis of the prediction is a two-layer, feed-forward neural network. NNPREDICT takes as input a protein sequence and returns a secondary structure prediction for each position in the sequence.(UCSF)
PSA Protein Structure Prediction Server: Predicts probable secondary structures and folding classes for a given amino acid sequence. It performs three types of protein structure/sequence analysis:
1. Analysis of full length amino acid sequences that are assumed to be monomeric globular, water-soluble proteins consisting of a single domain
2. Analysis of either complete sequences, or sequence fragments with a minimal set of modelled structural assumptions
3. Analysis of potential WD-repeat protein family sequences
(BMERC at Boston University)
SSCP Secondary Structural Content Prediction computes predictions for the content of helix, strand, and coil for a given protein using the amino acid composition as the only input of inofrmation. The method used by SSCP consists in the application of analytic vector decomposition methods applied on the composition vector of the query protein.
RNA secondary structure prediction: If a multiple alignment is given by the user, the information on conservative positions in it and compensation exchanges in some of those will be used - stems, including such positions, are given more chances to be included into the resulting secondary structure. The algorithm is the following: first all of the possible ways of fitting together different pieces of the sequences are looked for. Then locally optimal secondary structures are built from the helices found. Lastly, the final system construction is done optimizing the model energy of the system (includes inputs from conservative and complementary pairs with corresponding coefficients). (Genebee)
SoWhat: The SoWhat WWW server predicts distance constraints between amino acids in proteins from the amino acid sequence. It uses a neural network based method to predict contacts between C-alpha atoms from the amino acid sequence. (CBS Denmark)
Pasteur Institute:
- STRIDE: Protein secondary structure assignment from atomic coordinates
- DSSP: Definition of secondary structure of proteins given a set of 3D coordinates
- DSC: Discrimination of protein secondary structure class
- PREDATOR: Protein secondary structure prediction from a single sequence or a set of sequences
- environ: calculate accessible as well as buried surface area in protein structure
- confmat: Side chain packing optimization on a given main chain template for protein
- melting computes, for a nucleic acid duplex, the enthalpy, the entropy and the melting temperature of the helix-coil transitions.
Other Software and Information Sources:
The VSNS BioComputing Division offers educational services over the Internet in bioinformatics/biocomputing. They have offered award winning online courses in sequence analysis. The site includes a hypertext coursebook, covering topics such as pairwise sequence alignments, networking, and multiple alignment. You can also find a collection of online exercises, called ``Sequence Analysis with Distributed Resources'', and ``Biocomputing For Everyone'' and ``Biocomputing For Schools'' Websites.
The Banbury Cross Site is a web page for benchmarking gene identification software. Banbury Cross is at the Centre National De La Recherche Scientifique. This Benchmark site is intended to be a forum for scientists working in the field of gene identification and anonymous genomic sequence annotation, with the goal of improving current methods in the context of very large (in particular) vertebrate genomic sequences.
CBIL bioWidgets, at the University of Pennsylvania, is a collection of software libraries used for rapid development of graphical molecular biological applications. It includes:
- bioWidgets for Java(tm), a toolkit of biology-specific user interface widgets useful for rapid application development in Java(tm)
- bioTK, a toolkit of biology-specific user interface widgets useful for rapid application development in Tcl/Tk
- RSVP, a PostScript tool which lets your printer do nucleic acid sequence analysis; it generates very nice color diagrams of the results.
Human Genome Project Information at Oak Ridge National Laboratory contains many interesting and useful items about the U.S. Human Genome Project. They also have a more technical Research site.
FAKtory: A software environment for DNA Sequencing is at the University of Arizona. It is a prototype software environment in support of DNA sequencing. The environment consists of
1. their software library, FAK, for the core combinatorial problem of assembling fragments
2. a Tcl/Tk based interface
3. a software suite supporting a database of fragments and a processing pipeline that includes clipping, tagging, and vector removal modules.
A key feature of FAKtory is that it is highly customizable: the structure of the fragment database, the processing pipeline, and the operation of each phase of the pipeline may be specified by the user.

A home for Biosciences students

Sunday, September 28, 2008

Software and Databases for Computational Biology on the Internet

Databases

Motif Search:

Secondary Structure Prediction:

Other Software and Information Sources:

No comments:

------####$$$$------

Blog Archive

Featured

Followers

About Me