Sunday, September 28, 2008

The genome as a fossil, and the tree of life

Mutations, duplications, and evolutionary pressure


Adaptive evolution

Some genes evolve more quickly than other, for example, some histone genes are highly conserved whereas the immunoglobelin loci are extremely polymorphic. Why the difference? The answer is evolutionary pressure. In the first case evolutionary pressure preserves the adapted condition, a common feature for functionally important genes. In the other case with immunoglobelin, extensive variation in genes indicate that the gene product benefits from exchanging amino acids. Hence, evolutionary pressure works in two ways, either to preserve or to change. In either case evolutionary pressure leads to higher fitness of the organism at present environmental conditions.

The rate of evolutionary pressure can be detected by comparing the rates of silent and non-silent (amino acid changing) nucleotide substitutions in a phylogenetic tree:
• No selection (pressure), then the rate of silent and non-silent substitutions are equal.
• Selection for preserving an adapted gene shows a higher frequency of silent mutations.
• Positive selection induces highly adaptive gene products and shows a higher frequency of non-silent mutations.

A good example of positive selection are found in surface proteins of pathogens. These organisms benefits from changing the three dimensional structure of their surface proteins to avoid recognition by antibodies. In the influenza virus, two of the surface proteins have a very high frequency of non-silent mutations. This is also the case for the envelope protein of HIV 1.

Mutation rate

The mutations in a genome should not be looked at as something bad happening to it. If a bacteria is having a to good proofreading system where no errors occurs, the evolution would stop. A good reliable replication and repair is necessary to maintain a genome, however diversity among the genome lineage will improve the chance of survival of some members of a population if the environment changes. It is the survival a population that carries the genome forward in time.

The mutations in a genome is not evenly distributed, it is predicted that the addition and deletion of nucleotide bases in translated sequences are strongly excluded. But do this occur but the result is then often fatal? Even if addition/deletions are very rare, base substitution are more common. The mutations are not occurring at random, some sites in a gene, called hotspots are mutated more frequently then other locations within the gene. The difference in mutation rate between “hotspot” and “cold spots” is one to two orders of magnitude lower in the cold spots. The mutations at these so cold hotspots are occurring not at random, where some changes are occurring more frequently the others. Throughout the genome transition where pyrimidine are substituted by a pyrimidine is more common then transversions where the opposite is occurring (pyrimidine is changed to a pyrine).

The mutation rate are fairly constant when looking at the whole genome of different organisms and between prokaryotes, eukaryots and viruses when looking at the rate are expressed per cell division is roughly 0.01. However, when looking at the rate per sexual generation the rate varies highly.

Gene Duplication

Duplication of genetic material is an important mechanism for evolution of new genetic functions and expression patterns. Genes can be duplicated one by one, a whole chromosome and sometimes the whole genome (polyploidization). The following three alternative courses have been suggested for a duplicate gene pair:
• Nonfunctionalization, when one of the genes becomes silenced.
• Neofunctionalization, one copy acquires a new function.
• Subfunctionalization, both genes degenerate so that they together do the job of the original gene.

Given this it is interesting to know how often genes are duplicated and also how often duplication leads to a loss or a change in gene function respectively.

Some researchers have looked at number of duplicate genes retained a long time after a polyploidization event. These studies have estimated between 50-92% of the duplicates are lost. A more recent study by Lynch and Conery (2000) argue these studies are likely to be underestimates. They studied three whole and three partial genomes identifying duplicate genes and analysing the number of synonymous and nonsynonymous substitutions between the pairs. They estimated the half-life of duplicated genes to be 3 to 7 million years. Moreover, they concluded gene duplication to be very frequent. In the order of 10e-3 to 10e-2 duplications per gene over one million years. This is in the same order as the mutation rate per nucleotide site.

Vienna RNA Package

RNA Secondary Structure Prediction and Comparison


General information

The Vienna RNA Package consists of a C code library and several stand-alone programs for the prediction and comparison of RNA secondary structures.

RNA secondary structure prediction through energy minimization is the most used function in the package. We provide three kinds of dynamic programming algorithms for structure prediction: the minimum free energy algorithm of (Zuker & Stiegler 1981) which yields a single optimal structure, the partition function algorithm of (McCaskill 1990) which calculates base pair probabilities in the thermodynamic ensemble, and the suboptimal folding algorithm of (Wuchty et.al 1999) which generates all suboptimal structures within a given energy range of the optimal energy. For secondary structure comparison, the package contains several measures of distance (dissimilarities) using either string alignment or tree-editing (Shapiro & Zhang 1990). Finally, we provide an algorithm to design sequences with a predefined structure (inverse folding).

Documentation

For a detailed information, take a look at the HTML versions of the man pages for the programs and the manual of the library below.
  • RNAfold -- predict minimum energy secondary structures and pair probabilities
  • RNAeval -- evaluate energy of RNA secondary structures
  • RNAheat -- calculate the specific heat (melting curve) of an RNA sequence
  • RNAinverse -- inverse fold (design) sequences with predefined structure
  • RNAdistance -- compare secondary structures
  • RNApdist -- compare base pair probabilities
  • RNAsubopt -- complete suboptimal folding
  • RNAplot -- RNA structure drawings in PostScript, SVG, or GML
  • RNAcofold -- predict hybrid structure of two sequences
  • RNAduplex -- predict possible hybridization sites between two sequences
  • RNAalifold -- predict the consensus structure of several aligned sequences
  • RNALfold -- predict locally stable structure of long sequences
  • RNAplfold -- compute average pair probabilities for local base pairs in long sequences
  • RNApaln -- fast structural alignment of RNA sequences using string alignments
  • Several small but helpful Perl Utilities
If you want to include our code into your own programs, you should read the documentation for the RNAlib library.
When installing from source, see the installation instructions.

The package is free software and can be downloaded as C source code that should be easy to compile on almost any flavor of Unix and Linux. See the README file for details.

Web interfaces for online RNA folding and sequence design

  • Structure prediction for moderate size RNAs can be done interactively on our server. Try our RNAfold Web Interface.
  • Inverse folding of small RNAs is now also available as a web service. Design your own sequences on our Sequence Design server.
  • NewFor accurate prediction of consensus secondary structures from an alignment of related RNA sequences try our brand new alifold server.

To see what we do with our software, take a look at our preprint server. There you'll also find a preprint version of our (Hofacker et.al. 1994) paper describing the first version of the package.
The fold server is described in Nucleic Acids Res. 31: 3429-3431 (2003)

Version 2.0 of the ALIDOT utilities. An add-on for detecting conserved secondary structure motifs.

RNA folding software from elsewhere

Ole Matzura's has written a program for 32-bit Windows based on the RNA folding routines in the Vienna package with a nice graphical user interface, see the Rnadraw Homepage.
A lot of information on RNA folding can be found on Michael Zuker's RNA page, where you can also download his mfold program.
The RNAstructure program is re-implementation of mfold for windows including a GUI, it is available from the web site of the Turner group
The ESSA program provides several methods for drawing and analyzing RNA secondary structures.
A good starting point for information on RNA structures is the RNA world in Jena.

Software and Databases for Computational Biology on the Internet

This page is a supplement to the book Computational Methods in Molecular Biology, edited by Steven Salzberg, David Searls, and Simon Kasif. The publisher is Elsevier Sciences. Please contact Steven Salzberg (salzberg@cs.jhu.edu) if you wish to have your software referenced on this site, or if you wish to change the description of your software already listed here.

____________________________________________________________________
  • Glimmer is a system that uses Interpolated Markov Models (IMMs) to identify coding regions in microbial DNA. IMMs are a generalization of Markov models that allow great flexibility in the choice of the "context"; i.e., how many previous bases to use in predicting the next base. Glimmer has been tested on the complete genomes of H. influenzae, E. coli, H. pylori, M. genitalium, and other genomes, and results to date have proven it to be highly accurate. Glimmer was the principal gene finder for the genomes of B. burgdorferi , T. pallidum, C. trachomatis, C. pneumoniae, D. radiodurans, T. maritima, and others. The complete system, including source code, is available from this site. A version of the system built for the malaria parasite, GlimmerM, is also available.
  • GENSCAN is a program designed to predict complete gene structures, including exons, introns, promoter and poly-adenylation signals, in genomic sequences. It differs from the majority of existing gene finding algorithms in that it allows for partial genes as well as complete genes and for the occurrence of multiple genes in a single sequence, on either or both DNA strands. Program versions suitable for vertebrate, nematode (experimental), maize and Arabidopsis sequences are currently available. The vertebrate version also works fairly well for Drosophila sequences. Sequences can be submitted on a web-based form at this site. The GENSCAN Web site is at Stanford University.
  • GeneSplicer A fast, flexible system for detecting splice sites in the genomic DNA of various eukaryotes. The system has been trained and tested successfully on Plasmodium falciparum (malaria), Arabidopsis thaliana, human, Drosophila, and rice. It was compared to six programs representing the leading splice site detectors for Arabidopsis thaliana and Human: NetPlantGene,NetGene,HSPL,NNSplice,GENIO and SpliceView. In each case GeneSplicer performed comparably to the best alternative,in terms of both accuracy and computational efficiency.
  • GlimmerHMMan interpolated Markov Model system for finding genes in many eukaryotes, including Plasmodium falciparum, Arabidopsis thaliana, rice, and others. GlimmerHMM additionally incorporates splice site models adapted from the GeneSplicer program and a decision tree adapted from GlimmerM. The program utilizes Interpolated Markov Models as well as the Maximal Dependence Decomposition technique for improving specificity in splice site identification. (includes states for exons, introns, and intergenic regions). The sources are freely available for download at this site.
  • TigrScan TigrScan models DNA using a Generalized Hidden Markov Model (GHMM). Alternate parses of DNA (into zero or more gene models) are evaluated under this model. The evaluation of an input string using an HMM/GHMM is referred to as a decoding algorithm. TigrScan implements a novel decoding algorithm for GHMMs that is both time-efficient and space-efficient, through the use of queues and propagators. The sources, training data and a users' guide are available here.
  • TransTermis a program that finds rho-independent transcription terminators in bacterial genomes. Each terminator found by the program is assigned a confidence value that provides an estimate of its probability of being a true terminator.
  • PIRATE(Prediction Informatics Resources And Techniques) is a central repository of open-source bioinformatics prediction programs and reusable software components, documentation, training data, experimental results, tips and tricks, and external links.
  • MUMmer A system for aligning whole genome sequences. Uses a suffix tree, the system is able to rapidly align sequences containing millions of nucleotides. Usage of the algorithm should facilitate analysis of syntenic chromosomal regionis, strain-to-strain comparisons, evolutionary comparisons, and genomic duplications.
  • Bambus BAMBUS is the first publicly available scaffolding program. It orders and orients contigs into scaffolds based on various types of linking information. Additionally, BAMBUS allows the users to build scaffolds in a hierarchical fashion by prioritizing the order in which links are used.
  • Genie, a gene finder based on generalized hidden Markov models, is at the Lawrence Berkley National Laboratory. It was developed in collaboration with the Computational Biology Group at the University of California, Santa Cruz. Genie uses a statistical model of genes called a Generalized Hidden Markov Model (GHMM) to find genes in vertebrate and human DNA. In a GHMM, probabilities are assigned to transitions between states and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize these probabilities using a standardized gene data set, which is available on this site. The page has a link to the Genie Web server, to which sequences may be submitted.
  • GRAIL provide analysis and putative annotation of DNA sequences both interactively and through the use of automated computation. GRAIL is a tool for the identification of genes, exons, and various features in DNA sequences.This system is at the Oak Ridge National Laboratory in Tennessee.
  • The FGENE family of programs finds splice sites, genes, promoters, and poly-A recognition regions in eukaryotic sequence data. The underlying technology uses linear discriminant analysis. You can submit sequences to FGENE using a Web interface found at http://www.softberry.com/.
  • The GeneID server contains the GeneID system for finding genes in eukaryotes. GeneID is a hierarchical rule-based system, with scoring matrices to identify signals and rules to score coding regions. You can use this page to submit a genomic DNA sequence to the GeneID program. The GeneID site is at www1.imim.es/software/geneid/ in Spain, and is also available at this page at Boston University.
  • GeneMark is a system for finding genes in bacterial DNA sequences. The algorithm is based on non-homogeneous 5th-order Markov chains, and it was used to locate the genes in the complete genomes of H. influenzae, M. genitalium, and several other complete genomes. The site includes documentation and a Web interface to which sequences can be submitted. This system is at the Georgia Institute of Technology in Atlanta, GA.
  • MarFinder uses statistical patterns to deduce the presence of MARs (Matrix Association Regions) in DNA sequences. MARs constitute a significant functional block and have been shown to facilitate the processes of differential gene expression and DNA replication. This tool and Web site are at the Futuresoft Corporation.
  • NetPlantGene is at the Technical University of Denmark. The NetPlantGene Web server uses neural networks to predict splice sites in Arabidopsis thaliana DNA. This site also contains programs for other sequence analysis problems as well, such as the recognition of signal peptides. NetPlantGene is to be replaced with NetGene2.
  • MZEF and Pombe. This page contains software tools designed to predict putative internal protein coding exons in genomic DNA sequences. Human, mouse and arabidopsis exons are predicted by a program called MZEF, and fission yeast exons are predicted by a program called Pombe. The site is located at the Cold Spring Harbor Laboratory.
  • Promoter Prediction by Neural Network (NNPP) is a method that finds eukaryotic and prokaryotic promoters in a DNA sequence. The basis of the NNPP program is a time-delay neural network. The time-delay network consists mainly of two feature layers, one for recognizing the TATA-box and one for recognizing the "Initiator", which is the region spanning the transcription start site. Both feature layers are combined into one output unit, which gives output scores between 0 and 1. This site is at the Lawrence Berkley National Laboratory. Also available at this site is the splice site predictor used by the Genie system. The output of this neural network is a score between 0 and 1 indicating a potential splice site.
  • SplicePredictor is a program designed to predict donor and acceptor splice sites in maize and Arabidopsis sequences. Sequences can be submitted on a web-based form at this site. The system is at Stanford University.
  • Combiner a program that predicts gene models using the output from other annotation software. It uses a statistical algorithm to identify patterns of evidence corresponding to gene models.
  • TESS (Transcription Element Search Software) is a set of software routines for locating and displaying transcription factor binding sites in DNA sequence. TESS uses the Transfac database as its store of transcription factors and their binding sites. This page is at the University of Pennsylvania's Computational Biology and Informatics Laboratory.
  • Genotator, a workbench for automated sequence annotation, provides a flexible, transparent system for automatically running a series of sequence analysis programs on genetic sequences. It also has a graphical display that allows users to view all of the automatically-generated annotations and add their own. Genotator's display allows annotated sequences to be examined at multiple levels of detail, from an overview of the entire sequence down to individual bases. By displaying the aligned output of multiple types of sequence analysis, Genotator provides an intuitive way to identify the significant regions (for example, probable exons) in a sequence. Genotator was developed by Nomi Harris at Lawrence Berkely National Laboratory.
  • WebGene (GenView, ORFGene, SpliceView) is Web interface for several coding region recognition programs, including:
    • GenView: a system for protein-coding gene prediction
    • ORFGene: gene structure prediction using information on homologous protein sequences
    • SpliceView: prediction of splicing signals
    • HCpolya: a hamming Clustering Method for Poly-A prediction in eukaryotic genes
    This page is at the Instituto Tecnologie Biomediche Avanzate in Italy.
  • The Staden Package contains a wealth of useful programs for sequence assembly, DNA sequence comparison and analysis, protein sequence analysis, and sequencing project quality control. The site is mirrored in several locations around the world.
  • PatternHunter PatternHunter is a new general-purpose homology search tool, based on innovative and proprietary technologies. It provides all the tools necessary for fast and sensitive homology search in all flavors including DNA-DNA, Protein-Protein, translated DNA-protein, and translated DNA-DNA searches.
  • GeneHacker is a system for gene structure prediction in microbial genomes using hidden Markov model (HMM). An HMM adopted in GeneHacker describes start codon and the downstream di-codon frequencies in protein coding regions, and can identify the regions in uncharacterized DNA sequences.
  • Geneid is a program to predict genes along a DNA sequence in a large set of organisms.
  • GeneSeqerGeneSeqer is a gene identification tool based on spliced alignment or "spliced threading" of ESTs with a genomic query sequence.

  • Databases

  • GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.GenBank is part of the International Nucleotide Sequence Database Collaboration, which is comprised of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at the National Center for Biotechnology Information.
  • Genome Assembly Archive The NCBI Assembly Archive links the raw sequence information found in the Trace Archive with assembly information found in publicly available sequence repositories (GenBank/EMBL/DDBJ). The Assembly Viewer allows a user to see the multiple sequence alignments as well as the actual sequence chromatogram.
  • GeneCards is a database of human genes, their products and their involvement in diseases. It offers concise information about the functions of all human genes that have an approved symbol as well as selected others. It is especially useful for those who are searching for information working in functional genomics and proteomics. The data is collected with Knowledge Discovery and Data Mining's techniques and accessed by means of proprietary Guidance System that makes more or less intelligent suggestions to the user of where and how the information may be retrieved.
  • The EpoDB (Erythropoiesis Database) is a database of genes that relate to vertebrate red blood cells. A detailed description of EpoDB can be found on Chapter 5. The database includes DNA sequence, structural features and potential transcription factor binding sites. This Web site is at the University of Pennsylvania's CBIL.
  • The LENS (Linking ESTs and their associated Name Space) database links and resolves the names and identifiers of clones and ESTs generated in the I.M.A.G.E. Consortium/WashU/Merck EST project. The name space includes library and clone IDs and names from IMAGE Consortium, EST sequence IDs from Washington University, sequence entry accession numbers from dbEST/NCBI, and library and clone IDs from GDB. LENS allows for querying of IMAGE Consortium data via all the different IDs.
  • PDD, the NIMH-NCI Protein-Disease Database is at the Laboratory of Experimental and Computational Biology at the National Cancer Institute. This server is part of the NIMH-NCI Protein-Disease Database project for correlating diseases with proteins observable in serum, CSF, urine and other common human body fluids based on biomedical literature.
  • The TRANSFAC Database is at the Gesellschaft für Biotechnologische Forschung mbH (Germany). TRANSFAC is a transcription factor database. It compiles data about gene regulatory DNA sequences and protein factors binding to them. On this basis, programs are developed that help to identify putative promoter or enhancer structures and to suggest their features.
  • SRS a homogeneous interface to over 80 biological databases, developed at the European Bioinformatics Institute (EBI) at Hinxton, UK. The databases include those of sequence and sequence related, metabolic pathways, transcription factors, application results (e.g. BLAST), protein 3D structure, genome, mutations and locus-specific mutations.
  • Homstrad (HOMologous STRucture Alignment Database) is a web accessible database at University of Cambridge. Homstrad contains families of proteins of known structure that share sequence/structural similarity. The classifications proposed by various databases including SCOP, Pfam, PROSITE and SMART and the results from sequence similarity searches by PSI-BLAST and FUGUE are combined.

    Motif Search:

  • PROSITE Search Form Allows you to rapidly compare a protein sequence against all patterns stored in the PROSITE pattern database. It answers the question: Which patterns from the PROSITE databse are found in my sequence? (EBI)
  • ScanProsite: Protein against Prosite form allows one to scan a protein sequence (either from SWISS-PROT or provided by the user) for the occurrences of patterns sorted in the PROSITE database. Pattern against SWISS-PROT scans in all of the SWISS-PROT database (including weekly releases) for the occurrence of a pattern that can originate from PROSITE or be provided by the user. (ExPASY)
  • ELPH is a general-purpose Gibbs sampler for finding motifs in a set of DNA or protein sequences. The program takes as input a set containing anywhere from a few dozen to thousands of sequences, and searches through them for the most common motif, assuming that each sequence contains one copy of the motif. We have used ELPH to find patterns such as ribosome binding sites (RBSs) and exon splicing enhancers (ESEs).
  • Motifs in protein databases program determines if a protein motif is present in a database of protein sequences. This program allows the user to define a protein motif and then determine if a DNA sequence might encode them or if they are present in a protein database. The programs do not search a library of predefined protein motifs. A motif is defined by entering the amino acids of interest at each position.(Alces)
  • MatInspectorA tool for the detection of transcription factor binding sites. It is able to locate matches of sequences of unlimited length and compare one, several or all sequences in a sequence file against all or selected subsets of matrices from a library of matrix descriptions of protein binding sites. (GSF)
  • MEME - Multiple EM for Motif Elicitation system allows one to discover motifs of highly conserved regions in groups of related DNA or protein sequences and search sequence databases using motifs using MAST: Works by calculating match scores for each sequence. The match scores are converted into various types of p-values and these are used to determine the overall match of the sequence to the group of motifs.(SDSC)
  • BCM Search Launcher : The Baylor College of Medicine has a variety of biology related search and analysis services including general protein sequence/Pattern searches and Species-Specific Protein Sequence Searches.(HGSC)
  • Screening pattern or alignment against PROTEIN databank This method of looking for all pattern entries in PROTEIN databank is almost the same as in PROSITE screening procedure. The one difference is that coincidence of pattern's and fragment's letter could be seen in a broad sense: as a similarity of letters according to a weight matrix selected by the user. (genebee)
  • PPSEARCH : Prosite Database Searches (sequence against databases of motifs). Allows you to search sequences for motifs or functional patterns in the prosite database (EBI)
  • COGnitor: Compare your sequence to COG- Clusters of Orthologous Groups database. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. (NCBI)
  • HMMER Sean Eddy: Profile hidden Markov models can be used to do sensitive database searching using statistical descriptions of a sequence family's consensus.The advantage of using HMMS is that HMMS have a formal probabilistic basis and can be trained from unaligned sequences, if a trusted alignment isn't yet known. They do however make poor models of RNAs because they cannot describe base pairs. HMMER is a freely distibutable implementation of profile HMM software for protein sequence analysis. (Washington Univ.)
  • emotif is a research system that forms motifs for subsets of aligned sequences. Emotif ranks the motifs that it finds by both their specificity and the number of supplied sequences that it covers.(Stanford Bioinformatics Group)
  • FunSiteP Promoter Recognition: Recognition and classification of eukaryotic promoters by searching transcription factor binding sites using transcription factor binding site consensi. (GSF)
  • SMART Simple Modular Architecture Research Tool: Allows rapid identification and annotation of signalling protein domain sequences. It is able to determine the modular architectures of single sequences or genomes. (EBI)
  • SAM : Sequence Alignment and Modeling System using HMM (Hidden Markov Model). SAM is a collection of software tools for creating, refining and using linear HMM for biological sequence analysis. Documentation for SAM can be found here.(Pasteur)
  • Mreps is a flexible and efficient software for identifying serial repeats (usually called tandem repeats) in DNA sequences. It is developed at LORIA in Adage group.

    Secondary Structure Prediction:

  • THREADER2 is a program for predicting protein tertiary structure by recognizing the correct fold from a library of alternatives. Of course, if a fold similar to the native fold of the protein being predicted is not in the library, then this approach will not succeed. Fortunately, certain folds crop up time and time again, and so fold recognition methods for predicting protein structure can be very effective. In the first prediction contest held at Asilomar, organized by John Moult and colleagues, THREADER correctly identified 8 out of 11 target structures which either globally or locally resembled a previously observed fold. Preliminary analysis of the results from the second competition (CASP2) show that THREADER 2 has shown clear improvement in both fold recognition sensitivity AND sequence-structure alignment accuracy. In CASP2, the new version of THREADER recognized 4 folds correctly out of 6 targets with recognizable structures (including the difficult task of assigning a jelly-roll fold rather than other beta-sandwich topologies for one target). THREADER 2 produced more correct fold predictions (i.e. correct folds ranked at No. 1) than any other method.
  • Predict Protein is a service for sequence analysis and structure prediction. Once you submit a protein sequence, PredictProtein retrieves similar sequences in the database and predicts aspects of protein structure, residue solvent accessibility and helical transmembrane regions. (EMBL)
  • RAPTOR, Prospect Pro software tools to perform protein structure prediction. Based on a novel linear programming method, combines sequence-to-sequence and sequence-to-structure search methods with advanced analysis tools into one integrated software solution.
  • NNPREDICT Protein Secondary Structure Prediction: A program that predicts the secondary structure type for each residue in an amino acid sequence. The basis of the prediction is a two-layer, feed-forward neural network. NNPREDICT takes as input a protein sequence and returns a secondary structure prediction for each position in the sequence.(UCSF)
  • PSA Protein Structure Prediction Server: Predicts probable secondary structures and folding classes for a given amino acid sequence. It performs three types of protein structure/sequence analysis:
    1. Analysis of full length amino acid sequences that are assumed to be monomeric globular, water-soluble proteins consisting of a single domain
    2. Analysis of either complete sequences, or sequence fragments with a minimal set of modelled structural assumptions
    3. Analysis of potential WD-repeat protein family sequences
    (BMERC at Boston University)
  • SSCP Secondary Structural Content Prediction computes predictions for the content of helix, strand, and coil for a given protein using the amino acid composition as the only input of inofrmation. The method used by SSCP consists in the application of analytic vector decomposition methods applied on the composition vector of the query protein.
  • RNA secondary structure prediction: If a multiple alignment is given by the user, the information on conservative positions in it and compensation exchanges in some of those will be used - stems, including such positions, are given more chances to be included into the resulting secondary structure. The algorithm is the following: first all of the possible ways of fitting together different pieces of the sequences are looked for. Then locally optimal secondary structures are built from the helices found. Lastly, the final system construction is done optimizing the model energy of the system (includes inputs from conservative and complementary pairs with corresponding coefficients). (Genebee)
  • SoWhat: The SoWhat WWW server predicts distance constraints between amino acids in proteins from the amino acid sequence. It uses a neural network based method to predict contacts between C-alpha atoms from the amino acid sequence. (CBS Denmark)

  • Pasteur Institute:
    • STRIDE: Protein secondary structure assignment from atomic coordinates
    • DSSP: Definition of secondary structure of proteins given a set of 3D coordinates
    • DSC: Discrimination of protein secondary structure class
    • PREDATOR: Protein secondary structure prediction from a single sequence or a set of sequences
    • environ: calculate accessible as well as buried surface area in protein structure
    • confmat: Side chain packing optimization on a given main chain template for protein
    • melting computes, for a nucleic acid duplex, the enthalpy, the entropy and the melting temperature of the helix-coil transitions.

    Other Software and Information Sources:

  • The VSNS BioComputing Division offers educational services over the Internet in bioinformatics/biocomputing. They have offered award winning online courses in sequence analysis. The site includes a hypertext coursebook, covering topics such as pairwise sequence alignments, networking, and multiple alignment. You can also find a collection of online exercises, called ``Sequence Analysis with Distributed Resources'', and ``Biocomputing For Everyone'' and ``Biocomputing For Schools'' Websites.
  • The Banbury Cross Site is a web page for benchmarking gene identification software. Banbury Cross is at the Centre National De La Recherche Scientifique. This Benchmark site is intended to be a forum for scientists working in the field of gene identification and anonymous genomic sequence annotation, with the goal of improving current methods in the context of very large (in particular) vertebrate genomic sequences.
  • CBIL bioWidgets, at the University of Pennsylvania, is a collection of software libraries used for rapid development of graphical molecular biological applications. It includes:
    • bioWidgets for Java(tm), a toolkit of biology-specific user interface widgets useful for rapid application development in Java(tm)
    • bioTK, a toolkit of biology-specific user interface widgets useful for rapid application development in Tcl/Tk
    • RSVP, a PostScript tool which lets your printer do nucleic acid sequence analysis; it generates very nice color diagrams of the results.
  • Human Genome Project Information at Oak Ridge National Laboratory contains many interesting and useful items about the U.S. Human Genome Project. They also have a more technical Research site.
  • FAKtory: A software environment for DNA Sequencing is at the University of Arizona. It is a prototype software environment in support of DNA sequencing. The environment consists of
    1. their software library, FAK, for the core combinatorial problem of assembling fragments
    2. a Tcl/Tk based interface
    3. a software suite supporting a database of fragments and a processing pipeline that includes clipping, tagging, and vector removal modules.
    A key feature of FAKtory is that it is highly customizable: the structure of the fragment database, the processing pipeline, and the operation of each phase of the pipeline may be specified by the user.

Data Analysis Software from the BRC

Data Analysis Software from the BRC

The following software for genetic data analysis were developed and made available by researchers at or affiliated with the Bioinformatics Research Center:

Alternative Splicing Gallery (ASG)
GDA
QTL Cartographer
Windows QTL Cartographer
Forensic DNA Mixtures
Hy-Phy
PowerMarker
Mixed Models and QTLs
Exact Tests
Cytonuclear Disequilibria

Alternative Splicing Gallery (ASG) is a web-based splicing graph database that integrates transcript information from Ensembl, RefSeq, STACK, TIGR gene index, and UniGene, in order to explore and visualize gene structure and alternative splicing and to provide an exhaustive transcript catalog. The program was developed by Jeremy Leipzig and Dr. Steffen Heber.

Genetic Data Analysis (GDA) is a software package for analyzing discrete population genetic data. There are versions of GDA suitable for all Microsoft Windows platforms. The Lewis Lab Software Site has links for downloading the program over the web and instructions for downloading the program using anonymous FTP.

QTL Cartographer is a suite of programs to map quantitative traits using a map of molecular markers.

Windows QTL Cartographer maps quantitative trait loci in cross populations from inbred lines. It incorporates many of the modules found in its command-line sibling, QTL Cartographer (see above). WinQTLCart includes powerful graphic tools for presenting mapping results and can import and export data in a variety of formats.

DNAMIX v.3 calculates likelihood ratios for mixed DNA samples encountered in forensic science. It is platform-independent and is applicable to complex mixtures as well as single-contributor stains.

Hypothesis Testing Using Phylogenies (Hy-Phy) is a software package for maximum likelihood analyses of genetic sequence data. It is equipped with tools to test various statistical hypotheses.

PowerMarker, written by Jack Liu, is a comprehensive set of statistical methods for discrete genetic data analysis, designed especially for SNP/SSR data analysis. It also includes a 2-D Viewer and CoreSet batch system.

Mixed Models and QTLs: Programs for mixed model approaches for quantitative genetic analysis by Jun Zhu.

Exact conditional tests for different combinations of allelic and genotypic disequilibria on haploid and diploid data, or their combination. Get readme file, source code in UNIX archive, source code in MS-DOS archive, solaris2.6 executable, sgi-irix6.5 executable, MS Windows executable.

Cytonuclear Disequilibrium: Christopher J. Basten has written two programs for calculating cytonuclear disequilibria. These are MS Windows and Macintosh binaries as well as the UNIX distribution including source code and a makefile. All three distributions include example files and documentation. These programs have online manual pages.

For diallelic cytonuclear systems, see CNDd. For multiallelic systems, see CNDm.

Bioinformatics and Functional Genomics - Glossary

Glossary: 1

________________________________________________________________

A [top]

Additive genetic effects
When the combined effects of alleles at different loci are equal to the sum of their individual effects. (ORNL)

Adenine (A)
A nitrogenous base, one member of the base pair AT (adenine-thymine).
See also: base pair (ORNL)

Algorithm
A fixed procedure embodied in a computer program. (NCBI)

Alignment
The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. (NCBI)

Alignment
Representation of a prediction of the amino acids in tertiary structures of homologues that overlay in three dimensions. Alignments held by SMART are mostly based on published observations (see domain annotations for details), but are updated and edited manually. (SMART)

All-alpha
A class that has the number of secondary structures in the domain or common core described as 3-, 4-, 5-, 6- or multi- helical. (SCOP)

All-beta
A class that includes two major fold groups: sandwiches and barrels. The sandwich folds are made of two beta-sheets which are usually twisted and pack so their strands are aligned. The barrel fold are made of single beta-sheet that twists and coils upon itself so, in most cases, the first strand in the beta sheet hydrogen bond to the last strand. The strand directions in the two opposite sides of a barrel fold are roughly orthogonal. Orthogonal packing of sheets is also seen in a few special cases of sandwich folds (SCOP)

Allele
Alternative form of a genetic locus; a single allele for each locus is inherited from each parent (e.g., at a locus for eye color the allele might result in blue or brown eyes). (ORNL)

Allele
One of the variant forms of a gene at a particular locus, or location, on a chromosome. Different alleles produce variation in inherited characteristics such as hair color or blood type. In an individual, one form of the allele (the dominant one) may be expressed more than another form (the recessive one). (NHGRI)

Allogeneic
Variation in alleles among members of the same species. (ORNL)

Alternative splicing
Different ways of combining a gene's exons to make variants of the complete protein (ORNL)

Amino acid
Any of a class of 20 molecules that are combined to form proteins in living things. The sequence of amino acids in a protein and hence protein function are determined by the genetic code. (ORNL)

Amplification
An increase in the number of copies of a specific DNA fragment; can be in vivo or in vitro.
See also: cloning (ORNL)

Animal model
See: model organisms (ORNL)

Annotation
Adding pertinent information such as gene coded for, amino acid sequence, or other commentary to the database entry of raw sequence of DNA bases.
See also: bioinformatics (ORNL)

Anticipation
Each generation of offspring has increased severity of a genetic disorder; e.g., a grandchild may have earlier onset and more severe symptoms than the parent, who had earlier onset than the grandparent.
See also: additive genetic effects, complex trait (ORNL)

Antisense
Nucleic acid that has a sequence exactly opposite to an mRNA molecule made by the body; binds to the mRNA molecule to prevent a protein from being made.
See also: transcription (ORNL)

Apoptosis
Programmed cell death, the body's normal method of disposing of damaged, unwanted, or unneeded cells.
See also: cell (ORNL)

Array (of hairpins)
An assemble of alpha-helices that can not be described as a bundle or a folded leaf. (SCOP)

Arrayed library
Individual primary recombinant clones (hosted in phage, cosmid, YAC, or other vector) that are placed in two-dimensional arrays in microtiter dishes. Each primary clone can be identified by the identity of the plate and the clone location (row and column) on that plate. Arrayed libraries of clones can be used for many applications, including screening for a specific gene or genomic region of interest.
See also: library, genomic library, gene chip technology (ORNL)

Assembly
Putting sequenced fragments of DNA into their correct chromosomal positions. (ORNL)

Autoradiography
A technique that uses X-ray film to visualize radioactively labeled molecules or fragments of molecules; used in analyzing length and number of DNA fragments after they are separated by gel electrophoresis. (ORNL)

Autosomal dominant
A gene on one of the non-sex chromosomes that is always expressed, even if only one copy is present. The chance of passing the gene to offspring is 50% for each pregnancy.
See also: autosome, dominant, gene (ORNL)

Autosome
A chromosome not involved in sex determination. The diploid human genome consists of a total of 46 chromosomes: 22 pairs of autosomes, and 1 pair of sex chromosomes (the X and Y chromosomes).
See also: sex chromosome (ORNL)

B [top]

Backcross
A cross between an animal that is heterozygous for alleles obtained from two parental strains and a second animal from one of those parental strains. Also used to describe the breeding protocol of an outcross followed by a backcross.
See also: model organisms (ORNL)

Bacterial artificial chromosome (BAC)
A vector used to clone DNA fragments (100- to 300-kb insert size; average, 150 kb) in Escherichia coli cells. Based on naturally occurring F-factor plasmid found in the bacterium E. coli.
See also: cloning vector (ORNL)

Bacterial artificial chromosome (BAC)
Large segments of DNA, 100,000 to 200,000 bases, from another species cloned into bacteria. Once the foreign DNA has been cloned into the host bacteria, many copies of it can be made. (NHGRI)

Bacteriophage
See: phage (ORNL)

Barrel
Structures are usually closed by main-chain hydrogen bonds between the first and last strands of the beta sheet, in this case it is defined by the two integer numbers: the number of strand in the beta sheet, n, and a measure of the extent the extent to which the strands in the sheet are staggered the shear number, S. (SCOP)

Base
One of the molecules that form DNA and RNA molecules.
See also: nucleotide, base pair, base sequence (ORNL)

Base pair (bp)
Two nitrogenous bases (adenine and thymine or guanine and cytosine) held together by weak bonds. Two strands of DNA are held together in the shape of a double helix by the bonds between base pairs. (ORNL)

Base sequence
The order of nucleotide bases in a DNA molecule; determines structure of proteins encoded by that DNA. (ORNL)

Base sequence analysis
A method, sometimes automated, for determining the base sequence. (ORNL)

Behavioral genetics
The study of genes that may influence behavior. (ORNL)

Beta-sheet
Can be antiparallel (i.e. the strand direction in any two adjacent strands are antiparallel), parallel (all strands are parallel each other) and mixed (there is one strand at least that is parallel to one of its two neighbours and antiparallel to the other). (SCOP)

Bioinformatics
The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology. (NCBI)

Bioinformatics
The science of managing and analyzing biological data using advanced computing techniques. Especially important in analyzing genomic research data.
See also: informatics (ORNL)

Bioremediation
The use of biological organisms such as plants or microbes to aid in removing hazardous substances from an area. (ORNL)

Biotechnology
A set of biological techniques developed through basic research and now applied to research and product development. In particular, biotechnology refers to the use by industry of recombinant DNA, cell fusion, and new bioprocessing techniques. (ORNL)

Birth defect
Any harmful trait, physical or biochemical, present at birth, whether a result of a genetic mutation or some other nongenetic factor.
See also: congenital, gene, mutation, syndrome (ORNL)

Bit score
The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches. (NCBI)

Bits scores
Alignment scores are reported by HMMer and BLAST as bits scores. The likelihood that the query sequence is a bona fide homologue of the database sequence is compared to the likelihood that the sequence was instead generated by a "random" model. Taking the logarithm (to base 2) of this likelihood ratio gives the bits score. (SMART)

BLAST
Basic Local Alignment Search Tool. (Altschul et al.) A sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query. The initial search is done for a word of length "W" that scores at least "T" when compared to the query using a substitution matrix. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S". The "T" parameter dictates the speed and sensitivity of the search. For additional details, see one of the BLAST tutorials (Query or BLAST) or the narrative guide to BLAST. (NCBI)

BLAST
A computer program that identifies homologous (similar) genes in different organisms, such as human, fruit fly, or nematode. (ORNL)

BLOSUM
Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. (Henikoff and Henikoff) (NCBI)

Bundle
An array of alpha-helices each oriented roughly along the same (bundle) axis. It may have twist, left-handed if each helix makes a positive angle to the bundle axis, or, right-handed if each helix makes a negative angle to the bundle axis. (SCOP)

C [top]

Cancer
Diseases in which abnormal cells divide and grow unchecked. Cancer can spread from its original site to other parts of the body and can be fatal.
See also: hereditary cancer, sporadic cancer (ORNL)

Candidate gene
A gene located in a chromosome region suspected of being involved in a disease.
See also: positional cloning, protein (ORNL)

Capillary array
Gel-filled silica capillaries used to separate fragments for DNA sequencing. The small diameter of the capillaries permit the application of higher electric fields, providing high speed, high throughput separations that are significantly faster than traditional slab gels. (ORNL)

Carcinogen
Something which causes cancer to occur by causing changes in a cell's DNA.
See also: mutagen (ORNL)

Carrier
An individual who possesses an unexpressed, recessive trait. (ORNL)

cDNA library
A collection of DNA sequences that code for genes. The sequences are generated in the laboratory from mRNA sequences.
See also: messenger RNA (ORNL)

Cell
The basic unit of any living organism that carries on the biochemical processes of life.
See also: genome, nucleus (ORNL)

Centimorgan (cM)
A unit of measure of recombination frequency. One centimorgan is equal to a 1% chance that a marker at one genetic locus will be separated from a marker at a second locus due to crossing over in a single generation. In human beings, one centimorgan is equivalent, on average, to one million base pairs.
See also: megabase (ORNL)

Centromere
A specialized chromosome region to which spindle fibers attach during cell division. (ORNL)

Chimera (pl. chimaera)
An organism that contains cells or tissues with a different genotype. These can be mutated cells of the host organism or cells from a different organism or species. (ORNL)

Chloroplast chromosome
Circular DNA found in the photosynthesizing organelle (chloroplast) of plants instead of the cell nucleus where most genetic material is located. (ORNL)

Chromosomal deletion
The loss of part of a chromosome's DNA. (ORNL)

Chromosomal inversion
Chromosome segments that have been turned 180 degrees. The gene sequence for the segment is reversed with respect to the rest of the chromosome. (ORNL)

Chromosome
The self-replicating genetic structure of cells containing the cellular DNA that bears in its nucleotide sequence the linear array of genes. In prokaryotes, chromosomal DNA is circular, and the entire genome is carried on one chromosome. Eukaryotic genomes consist of a number of chromosomes whose DNA is associated with different kinds of proteins. (ORNL)

Chromosome painting
Attachment of certain fluorescent dyes to targeted parts of the chromosome. Used as a diagnositic for particular diseases, e.g. types of leukemia. (ORNL)

Chromosome region p
A designation for the short arm of a chromosome. (ORNL)

Chromosome region q
A designation for the long arm of a chromosome. (ORNL)

Clone
An exact copy made of biological material such as a DNA segment (e.g., a gene or other region), a whole cell, or a complete organism. (ORNL)

Clone bank
See: genomic library (ORNL)

Cloning
Using specialized DNA technology to produce multiple, exact copies of a single gene or other segment of DNA to obtain enough material for further study. This process, used by researchers in the Human Genome Project, is referred to as cloning DNA. The resulting cloned (copied) collections of DNA molecules are called clone libraries. A second type of cloning exploits the natural process of cell division to make many copies of an entire cell. The genetic makeup of these cloned cells, called a cell line, is identical to the original cell. A third type of cloning produces complete, genetically identical animals such as the famous Scottish sheep, Dolly.
See also: cloning vector (ORNL)

Cloning vector
DNA molecule originating from a virus, a plasmid, or the cell of a higher organism into which another DNA fragment of appropriate size can be integrated without loss of the vector's capacity for self-replication; vectors introduce foreign DNA into host cells, where the DNA can be reproduced in large quantities. Examples are plasmids, cosmids, and yeast artificial chromosomes; vectors are often recombinant molecules containing DNA sequences from several sources. (ORNL)

Closed, Partly Opened and Opened
For all-alpha structures describes the extent in which the hydrophobic core is screened by the comprising alpha-helices. Opened means that there is space for at least one more helix to be easily attached to the core (SCOP)

Code
See: genetic code (ORNL)

Codominance
Situation in which two different alleles for a genetic trait are both expressed.
See also: autosomal dominant, recessive gene (ORNL)

Codon
See: genetic code (ORNL)

Coisogenic or congenic
Nearly identical strains of an organism; they vary at only a single locus. (ORNL)

Comparative genomics
The study of human genetics by comparisons with model organisms such as mice, the fruit fly, and the bacterium E. coli. (ORNL)

Complementary DNA (cDNA)
DNA that is synthesized in the laboratory from a messenger RNA template. (ORNL)

Complementary sequence
Nucleic acid base sequence that can form a double-stranded structure with another DNA fragment by following base-pairing rules (A pairs with T and C with G). The complementary sequence to GTAC for example, is CATG. (ORNL)

Complex trait
Trait that has a genetic component that does not follow strict Mendelian inheritance. May involve the interaction of two or more genes or gene-environment interactions.
See also: Mendelian inheritance, additive genetic effects (ORNL)

Computational biology
See: bioinformatics (ORNL)

Confidentiality
In genetics, the expectation that genetic material and the information gained from testing that material will not be available without the donor's consent. (ORNL)

Congenital
Any trait present at birth, whether the result of a genetic or nongenetic factor.
See also: birth defect (ORNL)

Conservation
Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue. (NCBI)

Conserved sequence
A base sequence in a DNA molecule (or an amino acid sequence in a protein) that has remained essentially unchanged throughout evolution. (ORNL)

Contig
Group of cloned (copied) pieces of DNA representing overlapping regions of a particular chromosome. (ORNL)

Contig map
A map depicting the relative order of a linked library of overlapping clones representing a complete chromosomal segment. (ORNL)

Cosmid
Artificially constructed cloning vector containing the cos gene of phage lambda. Cosmids can be packaged in lambda phage particles for infection into E. coli; this permits cloning of larger DNA fragments (up to 45kb) than can be introduced into bacterial hosts in plasmid vectors. (ORNL)

Crossover
connection links secondary structures at the opposite ends of the structural core and goes across the surface of the domain. (SCOP)

Crossing over
The breaking during meiosis of one maternal and one paternal chromosome, the exchange of corresponding sections of DNA, and the rejoining of the chromosomes. This process can result in an exchange of alleles between chromosomes.
See also: recombination (ORNL)

Cytogenetics
The study of the physical appearance of chromosomes.
See also: karyotype (ORNL)

Cytological band
An area of the chromosome that stains differently from areas around it.
See also: cytological map (ORNL)

Cytological map
A type of chromosome map whereby genes are located on the basis of cytological findings obtained with the aid of chromosome mutations. (ORNL)

Cytoplasmic (uniparental) inheritance
See: cytoplasmic trait (ORNL)

Cytoplasmic trait
A genetic characteristic in which the genes are found outside the nucleus, in chloroplasts or mitochondria. Results in offspring inheriting genetic material from only one parent. (ORNL)

Cytosine (C)
A nitrogenous base, one member of the base pair GC (guanine and cytosine) in DNA.
See also: base pair, nucleotide (ORNL)

D [top]

Data warehouse
A collection of databases, data tables, and mechanisms to access the data on a single subject. (ORNL)

Deletion
A loss of part of the DNA from a chromosome; can lead to a disease or abnormality.
See also: chromosome, mutation (ORNL)

Deletion map
A description of a specific chromosome that uses defined mutations --specific deleted areas in the genome-- as 'biochemical signposts,' or markers for specific areas. (ORNL)

Deoxyribonucleotide
See: nucleotide (ORNL)

Deoxyribose
A type of sugar that is one component of DNA (deoxyribonucleic acid). (ORNL)

Diploid
A full set of genetic material consisting of paired chromosomes, one from each parental set. Most animal cells except the gametes have a diploid set of chromosomes. The diploid human genome has 46 chromosomes.
See also: haploid (ORNL)

Directed evolution
A laboratory process used on isolated molecules or microbes to cause mutations and identify subsequent adaptations to novel environments. (ORNL)

Directed mutagenesis
Alteration of DNA at a specific site and its reinsertion into an organism to study any effects of the change. (ORNL)

Directed sequencing
Successively sequencing DNA from adjacent stretches of chromosome. (ORNL)

Disease-associated genes
Alleles carrying particular DNA sequences associated with the presence of disease. (ORNL)

DNA (deoxyribonucleic acid)
The molecule that encodes genetic information. DNA is a double-stranded molecule held together by weak bonds between base pairs of nucleotides. The four nucleotides in DNA contain the bases adenine (A), guanine (G), cytosine (C), and thymine (T). In nature, base pairs form only between A and T and between G and C; thus the base sequence of each single strand can be deduced from that of its partner. (ORNL)

DNA bank
A service that stores DNA extracted from blood samples or other human tissue. (ORNL)

DNA probe
See: probe (ORNL)

DNA repair genes
Genes encoding proteins that correct errors in DNA sequencing. (ORNL)

DNA replication
The use of existing DNA as a template for the synthesis of new DNA strands. In humans and other eukaryotes, replication occurs in the cell nucleus. (ORNL)

DNA sequence
The relative order of base pairs, whether in a DNA fragment, gene, chromosome, or an entire genome.
See also: base sequence analysis (ORNL)

Domain
A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function. (NCBI)

Domain
A discrete portion of a protein with its own function. The combination of domains in a single protein determines its overall function. (ORNL)

Domain
Conserved structural entities with distinctive secondary structure content and an hydrophobic core. In small disulphide-rich and Zn2+-binding or Ca2+- binding domains the hydrophobic core may be provided by cystines and metal ions, respectively.
Homologous domains with common functions usually show sequence similarities. (SMART)

Domain composition
Proteins with the same domain composition have at least one copy of each of domains of the query. (SMART)

Domain organisation
Proteins having all the domains as the query in the same order (Additional domains are allowed)). (SMART)

Dominant
An allele that is almost always expressed, even if only one copy is present.
See also: gene, genome (ORNL)

Double helix
The twisted-ladder shape that two linear strands of DNA assume when complementary nucleotides on opposing strands bond together. (ORNL)

Draft sequence
The sequence generated by the HGP as of June 2000 that, while incomplete, offers a virtual road map to an estimated 95% of all human genes. Draft sequence data are mostly in the form of 10,000 base pair-sized fragments whose approximate chromosomal locations are known.
See also: sequencing, finished DNA sequence, working draft DNA sequence (ORNL)

DUST
A program for filtering low complexity regions from nucleic acid sequences. (NCBI)

E [top]

E value
Expectation value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. (NCBI)
This represents the number of sequences with a score greater-than, or equal to, X, expected absolutely by chance. The E-value connects the score ("X") of an alignment between a user-supplied sequence and a database sequence, generated by any algorithm, with how many alignments with similar or greater scores that would be expected from a search of a random sequence database of equivalent size. Since version 2.0 E-values are calculated using Hidden Markov Models, leading to more accurate estimates than before. (SMART)

Electrophoresis
A method of separating large molecules (such as DNA fragments or proteins) from a mixture of similar molecules. An electric current is passed through a medium containing the mixture, and each kind of molecule travels through the medium at a different rate, depending on its electrical charge and size. Agarose and acrylamide gels are the media commonly used for electrophoresis of proteins and nucleic acids. (ORNL)

Electroporation
A process using high-voltage current to make cell membranes permeable to allow the introduction of new DNA; commonly used in recombinant DNA technology.
See also: transfection (ORNL)

Embryonic stem (ES) cells
An embryonic cell that can replicate indefinitely, transform into other types of cells, and serve as a continuous source of new cells. (ORNL)

Endonuclease
See: restriction enzyme (ORNL)

Enzyme
A protein that acts as a catalyst, speeding the rate at which a biochemical reaction proceeds but not altering the direction or nature of the reaction. (ORNL)

Epistasis
One gene interfers with or prevents the expression of another gene located at a different locus. (ORNL)

Escherichia coli
Common bacterium that has been studied intensively by geneticists because of its small genome size, normal lack of pathogenicity, and ease of growth in the laboratory. (ORNL)

Eugenics
The study of improving a species by artificial selection; usually refers to the selective breeding of humans. (ORNL)

Eukaryote
Cell or organism with membrane-bound, structurally discrete nucleus and other well-developed subcellular compartments. Eukaryotes include all organisms except viruses, bacteria, and bluegreen algae.
See also: prokaryote, chromosome. (ORNL)

Evolutionarily conserved
See: conserved sequence (ORNL)

Exogenous DNA
DNA originating outside an organism that has been introducted into the organism. (ORNL)

Exon
The protein-coding DNA sequence of a gene.
See also: intron (ORNL)

Exonuclease
An enzyme that cleaves nucleotides sequentially from free ends of a linear nucleic acid substrate. (ORNL)

Expressed gene
See: gene expression (ORNL)

Expressed sequence tag (EST)
A short strand of DNA that is a part of a cDNA molecule and can act as identifier of a gene. Used in locating and mapping genes.
See also: cDNA, sequence tagged site (ORNL)

F [top]

FASTA
The first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable which specifies the size of a "word". (Pearson and Lipman) (NCBI)

Filial generation (F1, F2)
Each generation of offspring in a breeding program, designated F1, F2, etc. (ORNL)

Filtering
Also known as Masking. The process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores. See SEG and DUST. (NCBI)

Fingerprinting
In genetics, the identification of multiple specific alleles on a person's DNA to produce a unique identifier for that person.
See also: forensics (ORNL)

Finished DNA Sequence
High-quality, low error, gap-free DNA sequence of the human genome. Achieving this ultimate 2003 HGP goal requires additional sequencing to close gaps, reduce ambiguities, and allow for only a single error every 10,000 bases, the agreed-upon standard for HGP finished sequence.
See also: sequencing, draft sequence (ORNL)

Flow cytometry
Analysis of biological material by detection of the light-absorbing or fluorescing properties of cells or subcellular fractions (i.e., chromosomes) passing in a narrow stream through a laser beam. An absorbance or fluorescence profile of the sample is produced. Automated sorting devices, used to fractionate samples, sort successive droplets of the analyzed stream into different fractions depending on the fluorescence emitted by each droplet. (ORNL)

Flow karyotyping
Use of flow cytometry to analyze and separate chromosomes according to their DNA content. (ORNL)

Fluorescence in situ hybridization (FISH)
A physical mapping approach that uses fluorescein tags to detect hybridization of probes with metaphase chromosomes and with the less-condensed somatic interphase chromatin. (ORNL)

Folded leaf
A layer of alpha-helices wrapped around a single hydrophobic core but not with the simple geometry of a bundle. (SCOP)

Forensics
The use of DNA for identification. Some examples of DNA use are to establish paternity in child support cases; establish the presence of a suspect at a crime scene, and identify accident victims. (ORNL)

Fraternal twin
Siblings born at the same time as the result of fertilization of two ova by two sperm. They share the same genetic relationship to each other as any other siblings.
See also: identical twin(ORNL)

Full gene sequence
The complete order of bases in a gene. This order determines which protein a gene will produce. (ORNL)

Functional genomics
The study of genes, their resulting proteins, and the role played by the proteins the body's biochemical processes. (ORNL)

G [top]

Gamete
Mature male or female reproductive cell (sperm or ovum) with a haploid set of chromosomes (23 for humans). (ORNL)

Gap
A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment. (NCBI)

Gap
A position in an alignment that represents a deletion within one sequence relative to another. Gap penalties are requirements for alignment algorithms in order to reduce excessively-gapped regions. Gaps in alignments represent insertions that usually occur in protruding loops or beta-bulges within protein structures. (SMART)

GC-rich area
Many DNA sequences carry long stretches of repeated G and C which often indicate a gene-rich region. (ORNL)

Gel electrophoresis
See: electrophoresis (ORNL)

Gene
The fundamental physical and functional unit of heredity. A gene is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e., a protein or RNA molecule).
See also: gene expression (ORNL)

Gene amplification
Repeated copying of a piece of DNA; a characteristic of tumor cells.
See also: gene, oncogene (ORNL)

Gene chip technology
Development of cDNA microarrays from a large number of genes. Used to monitor and measure changes in gene expression for each gene represented on the chip. (ORNL)

Gene expression
The process by which a gene's coded information is converted into the structures present and operating in the cell. Expressed genes include those that are transcribed into mRNA and then translated into protein and those that are transcribed into RNA but not translated into protein (e.g., transfer and ribosomal RNAs). (ORNL)

Gene family
Group of closely related genes that make similar products. (ORNL)

Gene library
See: genomic library (ORNL)

Gene mapping
Determination of the relative positions of genes on a DNA molecule (chromosome or plasmid) and of the distance, in linkage units or physical units, between them. (ORNL)

Gene pool
All the variations of genes in a species.
See also: allele, gene, polymorphism (ORNL)

Gene prediction
Predictions of possible genes made by a computer program based on how well a stretch of DNA sequence matches known gene sequences (ORNL)

Gene product
The biochemical material, either RNA or protein, resulting from expression of a gene. The amount of gene product is used to measure how active a gene is; abnormal amounts can be correlated with disease-causing alleles. (ORNL)

Gene testing
See: genetic testing, genetic screening (ORNL)

Gene therapy
An experimental procedure aimed at replacing, manipulating, or supplementing nonfunctional or misfunctioning genes with healthy genes.
See also: gene, inherit, somatic cell gene therapy, germ line gene therapy (ORNL)

Gene transfer
Incorporation of new DNA into and organism's cells, usually by a vector such as a modified virus. Used in gene therapy.
See also: mutation, gene therapy, vector (ORNL)

Genetic code
The sequence of nucleotides, coded in triplets (codons) along the mRNA, that determines the sequence of amino acids in protein synthesis. A gene's DNA sequence can be used to predict the mRNA sequence, and the genetic code can in turn be used to predict the amino acid sequence. (ORNL)

Genetic counseling
Provides patients and their families with education and information about genetic-related conditions and helps them make informed decisions. (ORNL)

Genetic discrimination
Prejudice against those who have or are likely to develop an inherited disorder. (ORNL)

Genetic engineering
Altering the genetic material of cells or organisms to enable them to make new substances or perform new functions. (ORNL)

Genetic engineering technology
See: recombinant DNA technology (ORNL)

Genetic illness
Sickness, physical disability, or other disorder resulting from the inheritance of one or more deleterious alleles. (ORNL)

Genetic informatics
See: bioinformatics (ORNL)

Genetic map
See: linkage map (ORNL)

Genetic marker
A gene or other identifiable portion of DNA whose inheritance can be followed.
See also: chromosome, DNA, gene, inherit (ORNL)

Genetic material
See: genome (ORNL)

Genetic mosaic
An organism in which different cells contain different genetic sequence. This can be the result of a mutation during development or fusion of embryos at an early developmental stage. (ORNL)

Genetic polymorphism
Difference in DNA sequence among individuals, groups, or populations (e.g., genes for blue eyes versus brown eyes). (ORNL)

Genetic predisposition
Susceptibility to a genetic disease. May or may not result in actual development of the disease. (ORNL)

Genetic screening
Testing a group of people to identify individuals at high risk of having or passing on a specific genetic disorder. (ORNL)

Genetic testing
Analyzing an individual's genetic material to determine predisposition to a particular health condition or to confirm a diagnosis of genetic disease. (ORNL)

Genetics
The study of inheritance patterns of specific traits. (ORNL)

Genome
All the genetic material in the chromosomes of a particular organism; its size is generally given as its total number of base pairs. (ORNL)

Genome project
Research and technology-development effort aimed at mapping and sequencing the genome of human beings and certain model organisms.
See also: Human Genome Initiative (ORNL)

Genomic library
A collection of clones made from a set of randomly generated overlapping DNA fragments that represent the entire genome of an organism.
See also: library, arrayed library (ORNL)

Genomic sequence
See: DNA (ORNL)

Genomics
The study of genes and their function. (ORNL)

Genotype
The genetic constitution of an organism, as distinguished from its physical appearance (its phenotype). (ORNL)

Germ cell
Sperm and egg cells and their precursors. Germ cells are haploid and have only one set of chromosomes (23 in all), while all other cells have two copies (46 in all). (ORNL)

Germ line
The continuation of a set of genetic information from one generation to the next.
See also: inherit (ORNL)

Germ line gene therapy
An experimental process of inserting genes into germ cells or fertilized eggs to cause a genetic change that can be passed on to offspring. May be used to alleviate effects associated with a genetic disease.
See also: genomics, somatic cell gene therapy. (ORNL)

Germ line genetic mutation
See: mutation (ORNL)

Global Alignment
The alignment of two nucleic acid or protein sequences over their entire length. (NCBI)

Greek-key
a topology for a small number of beta sheet strands in which some interstrand connections going across the end of barrel or, in a sandwich fold, between beta sheets. (SCOP)

Guanine (G)
A nitrogenous base, one member of the base pair GC (guanine and cytosine) in DNA.
See also: base pair, nucleotide (ORNL)

H [top]

H
H is the relative entropy of the target and background residue frequencies. (Karlin and Altschul, 1990). H can be thought of as a measure of the average information (in bits) available per position that distinguishes an alignment from chance. At high values of H, short alignments can be distinguished by chance, whereas at lower H values, a longer alignment may be necessary. (Altschul, 1991) (NCBI)

Haploid
A single set of chromosomes (half the full set of genetic material) present in the egg and sperm cells of animals and in the egg and pollen cells of plants. Human beings have 23 chromosomes in their reproductive cells.
See also: diploid (ORNL)

Haplotype
A way of denoting the collective genotype of a number of closely linked loci on a chromosome. (ORNL)

Hemizygous
Having only one copy of a particular gene. For example, in humans, males are hemizygous for genes found on the Y chromosome. (ORNL)

Hereditary cancer
Cancer that occurs due to the inheritance of an altered gene within a family.
See also: sporadic cancer (ORNL)

Heterozygosity
The presence of different alleles at one or more loci on homologous chromosomes. (ORNL)

Heterozygote
See: heterozygosity (ORNL)

Highly conserved sequence
DNA sequence that is very similar across several different types of organisms.
See also: gene, mutation (ORNL)

High-throughput sequencing
A fast method of determining the order of bases in DNA.
See also: sequencing (ORNL)

Homeobox
A short stretch of nucleotides whose base sequence is virtually identical in all the genes that contain it. Homeoboxes have been found in many organisms from fruit flies to human beings. In the fruit fly, a homeobox appears to determine when particular groups of genes are expressed during development. (ORNL)

Homolog
A member of a chromosome pair in diploid organisms or a gene that has the same origin and functions in two or more species. (ORNL)

Homologous chromosome
Chromosome containing the same linear gene sequences as another, each derived from one parent. (ORNL)

Homologous recombination
Swapping of DNA fragments between paired chromosomes. (ORNL)

Homology
Similarity attributed to descent from a common ancestor. (NCBI)

Homology
Similarity in DNA or protein sequences between individuals of the same species or among different species. (ORNL)

Homology
Evolutionary descent from a common ancestor due to gene duplication. (SMART)

Homozygote
An organism that has two identical alleles of a gene.
See also: heterozygote (ORNL)

Homozygous
See: homozygote (ORNL)

HSP
High-scoring segment pair. Local alignments with no gaps that achieve one of the top alignment scores in a given search. (NCBI)

Human gene therapy
See: gene therapy (ORNL)

Human Genome Initiative
Collective name for several projects begun in 1986 by DOE to create an ordered set of DNA segments from known chromosomal locations, develop new computational methods for analyzing genetic map and DNA sequence data, and develop new techniques and instruments for detecting and analyzing DNA. This DOE initiative is now known as the Human Genome Program. The joint national effort, led by DOE and NIH, is known as the Human Genome Project. (ORNL)

Human Genome Project (HGP)
Formerly titled Human Genome Initiative.
See also: Human Genome Initiative (ORNL)

Hybrid
The offspring of genetically different parents.
See also: heterozygote (ORNL)

Hybridization
The process of joining two complementary strands of DNA or one each of DNA and RNA to form a double-stranded molecule. (ORNL)

I [top]

Identical twin
Twins produced by the division of a single zygote; both have identical genotypes.
See also: fraternal twin (ORNL)

Identity
The extent to which two (nucleotide or amino acid) sequences are invariant. (NCBI)

Immunotherapy
Using the immune system to treat disease, for example, in the development of vaccines. May also refer to the therapy of diseases caused by the immune system.
See also: cancer (ORNL)

Imprinting
A phenomenon in which the disease phenotype depends on which parent passed on the disease gene. For instance, both Prader-Willi and Angelman syndromes are inherited when the same part of chromosome 15 is missing. When the father's complement of 15 is missing, the child has Prader-Willi, but when the mother's complement of 15 is missing, the child has Angelman syndrome. (ORNL)

In situ hybridization
Use of a DNA or RNA probe to detect the presence of the complementary DNA sequence in cloned bacterial or cultured eukaryotic cells. (ORNL)

In vitro
Studies performed outside a living organism such as in a laboratory. (ORNL)

In vivo
Studies carried out in living organisms. (ORNL)

Independent assortment
During meiosis each of the two copies of a gene is distributed to the germ cells independently of the distribution of other genes.
See also: linkage (ORNL)

Informatics
See: bioinformatics (ORNL)

Informed consent
An individual willingly agrees to participate in an activity after first being advised of the risks and benefits.
See also: privacy (ORNL)

Inherit
In genetics, to receive genetic material from parents through biological processes. (ORNL)

Inherited
See: inherit (ORNL)

Insertion
A chromosome abnormality in which a piece of DNA is incorporated into a gene and thereby disrupts the gene's normal function.
See also: chromosome, DNA, gene, mutation (ORNL)

Insertional mutation
See: insertion (ORNL)

Intellectual property rights
Patents, copyrights, and trademarks.
See also: patent (ORNL)

Interference
One crossover event inhibits the chances of another crossover event. Also known as positive interference. Negative interference increases the chance of a second crossover.
See also: crossing over (ORNL)

Interphase
The period in the cell cycle when DNA is replicated in the nucleus; followed by mitosis. (ORNL)

Intracellular Domains
Domain families that are most prevalent in proteins within the cytoplasm. (SMART)

Intron
DNA sequence that interrupts the protein-coding sequence of a gene; an intron is transcribed into RNA but is cut out of the message before it is translated into protein.
See also: exon (ORNL)

Isoenzyme
An enzyme performing the same function as another enzyme but having a different set of amino acids. The two enzymes may function at different speeds. (ORNL)

J [top]

Jelly-roll
A variant of Greek key topology with both ends of a sandwich or a barrel fold being crossed by two interstrand connections. (SCOP)

Junk DNA
Stretches of DNA that do not code for genes; most of the genome consists of so-called junk DNA which may have regulatory and other functions. Also called non-coding DNA. (ORNL)

K [top]

K
A statistical parameter used in calculating BLAST scores that can be thought of as a natural scale for search space size. The value K is used in converting a raw score (S) to a bit score (S'). (NCBI)

Karyotype
A photomicrograph of an individual's chromosomes arranged in a standard format showing the number, size, and shape of each chromosome type; used in low-resolution physical mapping to correlate gross chromosomal abnormalities with the characteristics of specific diseases. (ORNL)

Kilobase (kb)
Unit of length for DNA fragments equal to 1000 nucleotides. (ORNL)

Knockout
Deactivation of specific genes; used in laboratory organisms to study gene function.
See also: gene, locus, model organisms (ORNL)

L [top]

Lambda
A statistical parameter used in calculating BLAST scores that can be thought of as a natural scale for scoring system. The value lambda is used in converting a raw score (S) to a bit score (S'). (NCBI)

Library
An unordered collection of clones (i.e., cloned DNA from a particular organism) whose relationship to each other can be established by physical mapping.
See also: genomic library, arrayed library (ORNL)

Linkage
The proximity of two or more markers (e.g., genes, RFLP markers) on a chromosome; the closer the markers, the lower the probability that they will be separated during DNA repair or replication processes (binary fission in prokaryotes, mitosis or meiosis in eukaryotes), and hence the greater the probability that they will be inherited together. (ORNL)

Linkage disequilibrium
Where alleles occur together more often than can be accounted for by chance. Indicates that the two alleles are physically close on the DNA strand.
See also: Mendelian inheritance (ORNL)

Linkage map
A map of the relative positions of genetic loci on a chromosome, determined on the basis of how often the loci are inherited together. Distance is measured in centimorgans (cM). (ORNL)

Local Alignment
The alignment of some portion of two nucleic acid or protein sequences (NCBI)

Localisation
Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm, extracellular space, nucleus, and membrane-associated) are shown in annotation pages. (SMART)

Localize
Determination of the original position (locus) of a gene or other marker on a chromosome. (ORNL)

Locus (pl. loci)
The position on a chromosome of a gene or other chromosome marker; also, the DNA at that position. The use of locus is sometimes restricted to mean expressed DNA regions.
See also: gene expression (ORNL)

Long-Range Restriction Mapping
Restriction enzymes are proteins that cut DNA at precise locations. Restriction maps depict the chromosomal positions of restriction-enzyme cutting sites. These are used as biochemical "signposts," or markers of specific areas along the chromosomes. The map will detail the positions where the DNA molecule is cut by particular restriction enzymes. (ORNL)

Low Complexity Region (LCR)
Regions of biased composition including homopolymeric runs, short-period repeats, and more subtle overrepresentation of one or a few residues. The SEG program is used to mask or filter LCRs in amino acid queries. The DUST program is used to mask or filter LCRs in nucleic acid queries. (NCBI)

M [top]

Macrorestriction map
Map depicting the order of and distance between sites at which restriction enzymes cleave chromosomes. (ORNL)

Mapping
See: gene mapping, linkage map, physical map (ORNL)

Mapping population
The group of related organisms used in constructing a genetic map. (ORNL)

Marker
See: genetic marker (ORNL)

Masking
Also known as Filtering. The removal of repeated or low complexity regions from a sequence in order to improve the sensitivity of sequence similarity searches performed with that sequence. (NCBI)

Mass spectrometry
An instrument used to identify chemicals in a substance by their mass and charge. (ORNL)

Meander
A simple topology of a beta-sheet where any two consecutive strands are adjacent and antiparallel (SCOP)

Megabase (Mb)
Unit of length for DNA fragments equal to 1 million nucleotides and roughly equal to 1 cM.
See also: centimorgan (ORNL)

Meiosis
The process of two consecutive cell divisions in the diploid progenitors of sex cells. Meiosis results in four rather than two daughter cells, each with a haploid set of chromosomes.
See also: mitosis (ORNL)

Mendelian inheritance
One method in which genetic traits are passed from parents to offspring. Named for Gregor Mendel, who first studied and recognized the existence of genes and this method of inheritance.
See also: autosomal dominant, recessive gene, sex-linked (ORNL)

Messenger RNA (mRNA)
RNA that serves as a template for protein synthesis.
See also: genetic code (ORNL)

Metaphase
A stage in mitosis or meiosis during which the chromosomes are aligned along the equatorial plane of the cell. (ORNL)

Microarray
Sets of miniaturized chemical reaction areas that may also be used to test DNA fragments, antibodies, or proteins. (ORNL)

Microbial genetics
The study of genes and gene function in bacteria, archaea, and other microorganisms. Often used in research in the fields of bioremediation, alternative energy, and disease prevention.
See also: model organisms, biotechnology, bioremediation (ORNL)

Microinjection
A technique for introducing a solution of DNA into a cell using a fine microcapillary pipet. (ORNL)

Mitochondrial DNA
The genetic material found in mitochondria, the organelles that generate energy for the cell. Not inherited in the same fashion as nucleic DNA.
See also: cell, DNA, genome, nucleus(ORNL)

Mitosis
The process of nuclear division in cells that produces daughter cells that are genetically identical to each other and to the parent cell.
See also: meiosis (ORNL)

Model organisms
A laboratory animal or other organism useful for research. (ORNL)

Modeling
The use of statistical analysis, computer analysis, or model organisms to predict outcomes of research. (ORNL)

Molecular biology
The study of the structure, function, and makeup of biologically important molecules. (ORNL)

Molecular farming
The development of transgenic animals to produce human proteins for medical use. (ORNL)

Molecular genetics
The study of macromolecules important in biological inheritance. (ORNL)

Molecular medicine
The treatment of injury or disease at the molecular level. Examples include the use of DNA-based diagnostic tests or medicine derived from DNA sequence information. (ORNL)

Monogenic disorder
A disorder caused by mutation of a single gene.
See also: mutation, polygenic disorder (ORNL)

Monogenic inheritance
See: monogenic disorder (ORNL)

Monosomy
Possessing only one copy of a particular chromosome instead of the normal two copies.
See also: cell, chromosome, gene expression, trisomy (ORNL)

Morbid map
A diagram showing the chromosomal location of genes associated with disease. (ORNL)

Motif
A short conserved region in a protein sequence. Motifs are frequently highly conserved parts of domains. (NCBI)

Motif
Sequence motifs are short conserved regions of polypeptides. Sets of sequence motifs need not necessarily represent homologues. (SMART)

Mouse model
See: model organisms (ORNL)

Multifactorial or Multigenic Disorder
See: polygenic disorder (ORNL)

Multiple Sequence Alignment
An alignment of three or more sequences with gaps inserted in the sequences such that residues with common structural positions and/or ancestral residues are aligned in the same column. Clustal W is one of the most widely used multiple sequence alignment programs (NCBI)

Multiplexing
A laboratory approach that performs multiple sets of reactions in parallel (simultaneously); greatly increasing speed and throughput. (ORNL)

Murine
Organism in the genus Mus. A rat or mouse. (ORNL)

Mutagen
An agent that causes a permanent genetic change in a cell. Does not include changes occurring during normal genetic recombination. (ORNL)

Mutagenicity
The capacity of a chemical or physical agent to cause permanent genetic alterations.
See also: somatic cell genetic mutation (ORNL)

Mutation
Any heritable change in DNA sequence.
See also: polymorphism (ORNL)

N [top]

Nitrogenous base
A nitrogen-containing molecule having the chemical properties of a base. DNA contains the nitrogenous bases adenine (A), guanine (G), cytosine (C), and thymine (T).
See also: DNA (ORNL)

Northern blot
A gel-based laboratory procedure that locates mRNA sequences on a gel that are complementary to a piece of DNA used as a probe.
See also: DNA, library (ORNL)

Nuclear transfer
A laboratory procedure in which a cell's nucleus is removed and placed into an oocyte with its own nucleus removed so the genetic information from the donor nucleus controls the resulting cell. Such cells can be induced to form embryos. This process was used to create the cloned sheep "Dolly".
See also: cloning (ORNL)

Nucleic acid
A large molecule composed of nucleotide subunits.
See also: DNA (ORNL)

Nucleolar organizing region
A part of the chromosome containing rRNA genes. (ORNL)

Nucleotide
A subunit of DNA or RNA consisting of a nitrogenous base (adenine, guanine, thymine, or cytosine in DNA; adenine, guanine, uracil, or cytosine in RNA), a phosphate molecule, and a sugar molecule (deoxyribose in DNA and ribose in RNA). Thousands of nucleotides are linked to form a DNA or RNA molecule.
See also: DNA, base pair, RNA (ORNL)

Nucleus
The cellular organelle in eukaryotes that contains most of the genetic material. (ORNL)

O [top]

Oligo
See: oligonucleotide (ORNL)

Oligogenic
A phenotypic trait produced by two or more genes working together.
See also: polygenic disorder (ORNL)

Oligonucleotide
A molecule usually composed of 25 or fewer nucleotides; used as a DNA synthesis primer.
See also: nucleotide (ORNL)

Oncogene
A gene, one or more forms of which is associated with cancer. Many oncogenes are involved, directly or indirectly, in controlling the rate of cell growth. (ORNL)

Open reading frame (ORF)
The sequence of DNA or RNA located between the start-code sequence (initiation codon) and the stop-code sequence (termination codon). (ORNL)

Operon
A set of genes transcribed under the control of an operator gene. (ORNL)

Optimal Alignment
An alignment of two sequences with the highest possible score. (NCBI)

ORF
Open reading frame. (SMART)

Orthologous
Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. (NCBI)

Overlapping clones
See: genomic library (ORNL)


P [top]

P value
The probability of an alignment occurring with the score in question or better. The p value is calculated by relating the observed alignment score, S, to the expected distribution of HSP scores from comparisons of random sequences of the same length and composition as the query to the database. The most highly significant P values will be those close to 0. P values and E values are different ways of representing the significance of the alignment. (NCBI)

P1-derived artificial chromosome (PAC)
One type of vector used to clone DNA fragments (100- to 300-kb insert size; average, 150 kb) in Escherichia coli cells. Based on bacteriophage (a virus) P1 genome.
See also: cloning vector (ORNL)

PAM
Point Accepted Mutation. A unit introduced by Dayhoff et al. to quantify the amount of evolutionary change in a protein sequence. 1.0 PAM unit, is the amount of evolution which will change, on average, 1% of amino acids in a protein sequence. A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. (NCBI)

Paralogous
Homologous sequences within a single species that arose by gene duplication. (NCBI)

Partly open barrel
Has the edge strands not properly hydrogen bonded because one of the strands is in two parts connected with a linker of more than than one residue. These edge strands can be treated as a single but interrupted strand, allowing classification with the effective strand and shear numbers, n* and S*. In the few open barrels the beta sheets are connected by only a few side-chain hydrogen bonds between the edge strands. (SCOP)

Patent
In genetics, conferring the right or title to genes, gene variations, or identifiable portions of sequenced genetic material to an individual or organization.
See also: gene (ORNL)

Pedigree
A family tree diagram that shows how a particular genetic trait or disease has been inherited.
See also: inherit (ORNL)

Penetrance
The probability of a gene or genetic trait being expressed. "Complete" penetrance means the gene or genes for a trait are expressed in all the population who have the genes. "Incomplete" penetrance means the genetic trait is expressed in only part of the population. The percent penetrance also may change with the age range of the population. (ORNL)

Peptide
Two or more amino acids joined by a bond called a "peptide bond."
See also: polypeptide (ORNL)

Phage
A virus for which the natural host is a bacterial cell. (ORNL)

Pharmacogenomics
The study of the interaction of an individual's genetic makeup and response to a drug. (ORNL)

Phenocopy
A trait not caused by inheritance of a gene but appears to be identical to a genetic trait. (ORNL)

Phenotype
The physical characteristics of an organism or the presence of a disease that may or may not be genetic.
See also: genotype (ORNL)

Physical map
A map of the locations of identifiable landmarks on DNA (e.g., restriction-enzyme cutting sites, genes), regardless of inheritance. Distance is measured in base pairs. For the human genome, the lowest-resolution physical map is the banding patterns on the 24 different chromosomes; the highest-resolution map is the complete nucleotide sequence of the chromosomes. (ORNL)

Plasmid
Autonomously replicating extra-chromosomal circular DNA molecules, distinct from the normal bacterial genome and nonessential for cell survival under nonselective conditions. Some plasmids are capable of integrating into the host genome. A number of artificially constructed plasmids are used as cloning vectors. (ORNL)

Pleiotropy
One gene that causes many different physical traits such as multiple disease symptoms. (ORNL)

Pluripotency
The potential of a cell to develop into more than one type of mature cell, depending on environment. (ORNL)

Polygenic disorder
Genetic disorder resulting from the combined action of alleles of more than one gene (e.g., heart disease, diabetes, and some cancers). Although such disorders are inherited, they depend on the simultaneous presence of several alleles; thus the hereditary patterns usually are more complex than those of single-gene disorders.
See also: single-gene disorder (ORNL)

Polymerase chain reaction (PCR)
A method for amplifying a DNA base sequence using a heat-stable polymerase and two 20-base primers, one complementary to the (+) strand at one end of the sequence to be amplified and one complementary to the (-) strand at the other end. Because the newly synthesized DNA strands can subsequently serve as additional templates for the same primer sequences, successive rounds of primer annealing, strand elongation, and dissociation produce rapid and highly specific amplification of the desired sequence. PCR also can be used to detect the existence of the defined sequence in a DNA sample. (ORNL)

Polymerase, DNA or RNA
Enzyme that catalyzes the synthesis of nucleic acids on preexisting nucleic acid templates, assembling RNA from ribonucleotides or DNA from deoxyribonucleotides. (ORNL)

Polymorphism
Difference in DNA sequence among individuals that may underlie differences in health. Genetic variations occurring in more than 1% of a population would be considered useful polymorphisms for genetic linkage analysis.
See also: mutation (ORNL)

Polypeptide
A protein or part of a protein made of a chain of amino acids joined by a peptide bond. (ORNL)

Population genetics
The study of variation in genes among a group of individuals. (ORNL)

Positional cloning
A technique used to identify genes, usually those that are associated with diseases, based on their location on a chromosome. (ORNL)

Primer
Short preexisting polynucleotide chain to which new deoxyribonucleotides can be added by DNA polymerase. (ORNL)

Privacy
In genetics, the right of people to restrict access to their genetic information. (ORNL)

Probe
Single-stranded DNA or RNA molecules of specific base sequence, labeled either radioactively or immunologically, that are used to detect the complementary base sequence by hybridization. (ORNL)

Profile
A table that lists the frequencies of each amino acid in each position of protein sequence. Frequencies are calculated from multiple alignments of sequences containing a domain of interest. See also PSSM. (NCBI)

Profile
A profile is a table of position-specific scores and gap penalties, representing an homologous family, that may be used to search sequence databases. In CLUSTAL-W-derived profiles those sequences that are more distantly related are assigned higher weights. (SMART)

Prokaryote
Cell or organism lacking a membrane-bound, structurally discrete nucleus and other subcellular compartments. Bacteria are examples of prokaryotes.
See also: chromosome, eukaryote (ORNL)

Promoter
A DNA site to which RNA polymerase will bind and initiate transcription. (ORNL)

Pronucleus
The nucleus of a sperm or egg prior to fertilization.
See also: nucleus, transgenic (ORNL)

Protein
A large molecule composed of one or more chains of amino acids in a specific order; the order is determined by the base sequence of nucleotides in the gene that codes for the protein. Proteins are required for the structure, function, and regulation of the body's cells, tissues, and organs; and each protein has unique functions. Examples are hormones, enzymes, and antibodies. (ORNL)

Proteome
Proteins expressed by a cell or organ at a particular time and under specific conditions. (ORNL)

Proteomics
Systematic analysis of protein expression of normal and diseased tissues that involves the separation, identification and characterization of all of the proteins in an organism. (NCBI)

Pseudogene
A sequence of DNA similar to a gene but nonfunctional; probably the remnant of a once-functional gene that accumulated mutations. (ORNL)

PSI-BLAST
Position-Specific Iterative BLAST. An iterative search using the BLAST algorithm. A profile is built after the initial search, which is then used in subsequent searches. The process may be repeated, if desired with new sequences found in each cycle used to refine the profile. Details can be found in this discussion of PSI-BLAST. (Altschul et al.) (NCBI)

PSSM
Position-specific scoring matrix; see profile. The PSSM gives the log-odds score for finding a particular matching amino acid in a target sequence. (NCBI)

Purine
A nitrogen-containing, double-ring, basic compound that occurs in nucleic acids. The purines in DNA and RNA are adenine and guanine.
See also: base pair (ORNL)

Pyrimidine
A nitrogen-containing, single-ring, basic compound that occurs in nucleic acids. The pyrimidines in DNA are cytosine and thymine; in RNA, cytosine and uracil.
See also: base pair (ORNL)

Q [top]

Query
The input sequence (or other type of search term) with which all of the entries in a database are to be compared. (NCBI)

R [top]

Radiation hybrid
A hybrid cell containing small fragments of irradiated human chromosomes. Maps of irradiation sites on chromosomes for the human, rat, mouse, and other genomes provide important markers, allowing the construction of very precise STS maps indispensable to studying multifactorial diseases.
See also: sequence tagged site (ORNL)

Rare-cutter enzyme
See: restriction-enzyme cutting site (ORNL)

Raw Score
The score of an alignment, S, calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (see PAM, BLOSUM). Gap scores are typically calculated as the sum of G, the gap opening penalty and L, the gap extension penalty. For a gap of length n, the gap cost would be G+Ln. The choice of gap costs, G and L is empirical, but it is customary to choose a high value for G (10-15)and a low value for L (1-2). (NCBI)

Recessive gene
A gene which will be expressed only if there are 2 identical copies or, for a male, if one copy is present on the X chromosome. (ORNL)

Reciprocal translocation
When a pair of chromosomes exchange exactly the same length and area of DNA. Results in a shuffling of genes. (ORNL)

Recombinant clone
Clone containing recombinant DNA molecules.
See also: recombinant DNA technology (ORNL)

Recombinant DNA molecules
A combination of DNA molecules of different origin that are joined using recombinant DNA technologies. (ORNL)

Recombinant DNA technology
Procedure used to join together DNA segments in a cell-free system (an environment outside a cell or organism). Under appropriate conditions, a recombinant DNA molecule can enter a cell and replicate there, either autonomously or after it has become integrated into a cellular chromosome. (ORNL)

Recombination
The process by which progeny derive a combination of genes different from that of either parent. In higher organisms, this can occur by crossing over.
See also: crossing over, mutation (ORNL)

Regulatory region or sequence
A DNA base sequence that controls gene expression. (ORNL)

Repetitive DNA
Sequences of varying lengths that occur in multiple copies in the genome; it represents much of the human genome. (ORNL)

Reporter gene
See: marker (ORNL)

Resolution
Degree of molecular detail on a physical map of DNA, ranging from low to high. (ORNL)

Restriction enzyme, endonuclease
A protein that recognizes specific, short nucleotide sequences and cuts DNA at those sites. Bacteria contain over 400 such enzymes that recognize and cut more than 100 different DNA sequences.
See also: restriction enzyme cutting site (ORNL)

Restriction fragment length polymorphism (RFLP)
Variation between individuals in DNA fragment sizes cut by specific restriction enzymes; polymorphic sequences that result in RFLPs are used as markers on both physical maps and genetic linkage maps. RFLPs usually are caused by mutation at a cutting site.
See also: marker, polymorphism (ORNL)

Restriction-enzyme cutting site
A specific nucleotide sequence of DNA at which a particular restriction enzyme cuts the DNA. Some sites occur frequently in DNA (e.g., every several hundred base pairs); others much less frequently (rare-cutter; e.g., every 10,000 base pairs). (ORNL)

Retroviral infection
The presence of retroviral vectors, such as some viruses, which use their recombinant DNA to insert their genetic material into the chromosomes of the host's cells. The virus is then propogated by the host cell. (ORNL)

Reverse transcriptase
An enzyme used by retroviruses to form a complementary DNA sequence (cDNA) from their RNA. The resulting DNA is then inserted into the chromosome of the host cell. (ORNL)

Ribonucleotide
See: nucleotide (ORNL)

Ribose
The five-carbon sugar that serves as a component of RNA.
See also: RNA, deoxyribose (ORNL)

Ribosomal RNA (rRNA)
A class of RNA found in the ribosomes of cells. (ORNL)

Ribosomes
Small cellular components composed of specialized ribosomal RNA and protein; site of protein synthesis.
See also: RNA (ORNL)

Risk communication
In genetics, a process in which a genetic counselor or other medical professional interprets genetic test results and advises patients of the consequences for them and their offspring. (ORNL)

RNA (Ribonucleic acid)
A chemical found in the nucleus and cytoplasm of cells; it plays an important role in protein synthesis and other chemical activities of the cell. The structure of RNA is similar to that of DNA. There are several classes of RNA molecules, including messenger RNA, transfer RNA, ribosomal RNA, and other small RNAs, each serving a different purpose. (ORNL)

S [top]

Sanger sequencing
A widely used method of determining the order of bases in DNA.
See also: sequencing, shotgun sequencing (ORNL)

Satellite
A chromosomal segment that branches off from the rest of the chromosome but is still connected by a thin filament or stalk. (ORNL)

Scaffold
In genomic mapping, a series of contigs that are in the right order but not necessarily connected in one continuous stretch of sequence. (ORNL)

Seed Alignment
Alignment that contains only one of each pair of homologues that are represented in a CLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 0.2 (see the related article). (SMART)

SEG
A program for filtering low complexity regions in amino acid sequences. Residues that have been masked are represented as "X" in an alignment. SEG filtering is performed by default in the blastp subroutine of BLAST 2.0. (Wootton and Federhen) (NCBI)

Segregation
The normal biological process whereby the two pieces of a chromosome pair are separated during meiosis and randomly distributed to the germ cells. (ORNL)

Sequence
See: base sequence (ORNL)

Sequence assembly
A process whereby the order of multiple sequenced DNA fragments is determined. (ORNL)

Sequence tagged site (STS)
Short (200 to 500 base pairs) DNA sequence that has a single occurrence in the human genome and whose location and base sequence are known. Detectable by polymerase chain reaction, STSs are useful for localizing and orienting the mapping and sequence data reported from many different laboratories and serve as landmarks on the developing physical map of the human genome. Expressed sequence tags (ESTs) are STSs derived from cDNAs. (ORNL)

Sequencing
Determination of the order of nucleotides (base sequences) in a DNA or RNA molecule or the order of amino acids in a protein. (ORNL)

Sequencing technology
The instrumentation and procedures used to determine the order of nucleotides in DNA. (ORNL)

Sex chromosome
The X or Y chromosome in human beings that determines the sex of an individual. Females have two X chromosomes in diploid cells; males have an X and a Y chromosome. The sex chromosomes comprise the 23rd chromosome pair in a karyotype.
See also: autosome (ORNL)

Sex-linked
Traits or diseases associated with the X or Y chromosome; generally seen in males.
See also: gene, mutation, sex chromosome (ORNL)

Shotgun method
Sequencing method that involves randomly sequenced cloned pieces of the genome, with no foreknowledge of where the piece originally came from. This can be contrasted with "directed" strategies, in which pieces of DNA from known chromosomal locations are sequenced. Because there are advantages to both strategies, researchers use both random (or shotgun) and directed strategies in combination to sequence the human genome.
See also: library, genomic library (ORNL)

Similarity
The extent to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity and/or conservation. In BLAST similarity refers to a positive matrix score. (NCBI)

Single nucleotide polymorphism (SNP)
DNA sequence variations that occur when a single nucleotide (A, T, C, or G) in the genome sequence is altered.
See also: mutation, polymorphism, single-gene disorder (ORNL)

Single-gene disorder
Hereditary disorder caused by a mutant allele of a single gene (e.g., Duchenne muscular dystrophy, retinoblastoma, sickle cell disease).
See also: polygenic disorders (ORNL)

Somatic cell
Any cell in the body except gametes and their precursors.
See also: gamete (ORNL)

Somatic cell gene therapy
Incorporating new genetic material into cells for therapeutic purposes. The new genetic material cannot be passed to offspring.
See also: gene therapy (ORNL)

Somatic cell genetic mutation
A change in the genetic structure that is neither inherited nor passed to offspring. Also called acquired mutations.
See also: germ line genetic mutation (ORNL)

Southern blotting
Transfer by absorption of DNA fragments separated in electrophoretic gels to membrane filters for detection of specific base sequences by radio-labeled complementary probes. (ORNL)

Spectral karyotype (SKY)
A graphic of all an organism's chromosomes, each labeled with a different color. Useful for identifying chromosomal abnormalities.
See also:
chromosome (ORNL)

Splice site
Location in the DNA sequence where RNA removes the noncoding areas to form a continuous gene transcript for translation into a protein. (ORNL)

Sporadic cancer
Cancer that occurs randomly and is not inherited from parents. Caused by DNA changes in one cell that grows and divides, spreading throughout the body.
See also: hereditary cancer (ORNL)

Stem cell
Undifferentiated, primitive cells in the bone marrow that have the ability both to multiply and to differentiate into specific blood cells. (ORNL)

Structural genomics
The effort to determine the 3D structures of large numbers of proteins using both experimental techniques and computer simulation (ORNL)

Substitution
The presence of a non-identical amino acid at a given position in an alignment. If the aligned residues have similar physico-chemical properties the substitution is said to be "conservative". (NCBI)

Substitution
In genetics, a type of mutation due to replacement of one nucleotide in a DNA sequence by another nucleotide or replacement of one amino acid in a protein by another amino acid.
See also: mutation (ORNL)

Substitution Matrix
A substitution matrix containing values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. such matrices are constructed by assembling a large and diverse sample of verified pairwise alignments of amino acids. If the sample is large enough to be statistically significant, the resulting matrices should reflect the true probabilities of mutations occuring through a period of evolution. (NCBI)

Suppressor gene
A gene that can suppress the action of another gene. (ORNL)

Syndrome
The group or recognizable pattern of symptoms or abnormalities that indicate a particular trait or disease. (ORNL)

Syngeneic
Genetically identical members of the same species. (ORNL)

Synteny
Genes occurring in the same order on chromosomes of different species.
See also: linkage, conserved sequence (ORNL)

T [top]

Tandem repeat sequences
Multiple copies of the same base sequence on a chromosome; used as markers in physical mapping.
See also: physical map (ORNL)

Targeted mutagenesis
Deliberate change in the genetic structure directed at a specific site on the chromosome. Used in research to determine the targeted region's function.
See also: mutation, polymorphism (ORNL)

Technology transfer
The process of transferring scientific findings from research laboratories to the commercial sector. (ORNL)

Telomerase
The enzyme that directs the replication of telomeres. (ORNL)

Telomere
The end of a chromosome. This specialized structure is involved in the replication and stability of linear DNA molecules.
See also: DNA replication (ORNL)

Teratogenic
Substances such as chemicals or radiation that cause abnormal development of a embryo.
See also: mutatgen (ORNL)

Thymine (T)
A nitrogenous base, one member of the base pair AT (adenine-thymine).
See also: base pair, nucleotide (ORNL)

Toxicogenomics
The study of how genomes respond to environmental stressors or toxicants. Combines genome-wide mRNA expression profiling with protein expression patterns using bioinformatics to understand the role of gene-environment interactions in disease and dysfunction. (ORNL)

Transcription
The synthesis of an RNA copy from a sequence of DNA (a gene); the first step in gene expression.
See also: translation (ORNL)

Transcription factor
A protein that binds to regulatory regions and helps control gene expression. (ORNL)

Transcriptome
The full complement of activated genes, mRNAs, or transcripts in a particular tissue at a particular time (ORNL)

Transfection
The introduction of foreign DNA into a host cell.
See also: cloning vector, gene therapy (ORNL)

Transfer RNA (tRNA)
A class of RNA having structures with triplet nucleotide sequences that are complementary to the triplet nucleotide coding sequences of mRNA. The role of tRNAs in protein synthesis is to bond with amino acids and transfer them to the ribosomes, where proteins are assembled according to the genetic code carried by mRNA. (ORNL)

Transformation
A process by which the genetic material carried by an individual cell is altered by incorporation of exogenous DNA into its genome. (ORNL)

Transgenic
An experimentally produced organism in which DNA has been artificially introduced and incorporated into the organism's germ line.
See also: cell, DNA, gene, nucleus, germ line (ORNL)

Translation
The process in which the genetic code carried by mRNA directs the synthesis of proteins from amino acids.
See also: transcription (ORNL)

Translocation
A mutation in which a large segment of one chromosome breaks off and attaches to another chromosome.
See also: mutation (ORNL)

Transposable element
A class of DNA sequences that can move from one chromosomal site to another. (ORNL)

Trisomy
Possessing three copies of a particular chromosome instead of the normal two copies.
See also: cell, gene, gene expression, chromosome (ORNL)

U [top]

Unitary Matrix
Also known as Identity Matrix. A scoring system in which only identical characters receive a positive score. (NCBI)

Up-and-Down
the simplest topology for a helical bundle or folded leaf, in which consecutive helices are adjacent and antiparallel; it is approximately equivalent to the meander topology of a beta-sheet. (SCOP)

Uracil
A nitrogenous base normally found in RNA but not DNA; uracil is capable of forming a base pair with adenine.
See also: base pair, nucleotide (ORNL)

V [top]

Vector
See: cloning vector (ORNL)

Virus
A noncellular biological entity that can reproduce only within a host cell. Viruses consist of nucleic acid covered by protein; some animal viruses are also surrounded by membrane. Inside the infected cell, the virus uses the synthetic capability of the host to produce progeny virus.
See also: cloning vector (ORNL)

W [top]

Western blot
A technique used to identify and locate proteins based on their ability to bind to specific antibodies.
See also: DNA, Northern blot, protein, RNA, Southern blotting (ORNL)

Wild type
The form of an organism that occurs most frequently in nature. (ORNL)

Working Draft DNA Sequence
See: Draft DNA Sequence (ORNL)

X [top]

X chromosome
One of the two sex chromosomes, X and Y.
See also: Y chromosome, sex chromosome (ORNL)

Xenograft
Tissue or organs from an individual of one species transplanted into or grafted onto an organism of another species, genus, or family. A common example is the use of pig heart valves in humans. (ORNL)

Y [top]

Y chromosome
One of the two sex chromosomes, X and Y.
See also: X chromosome, sex chromosome (ORNL)

Yeast artificial chromosome (YAC)
Constructed from yeast DNA, it is a vector used to clone large DNA fragments.
See also: cloning vector, cosmid (ORNL)

Z [top]

Zinc-finger protein
A secondary feature of some proteins containing a zinc atom; a DNA-binding protein. (ORNL)