Sequence Alignment: Experiments, Studies and Background Information

Sequence Alignment Experiments, Studies and Background Information For Science Labs, Lesson Plans & Science Fair Projects
For High School and College Students & Teachers

Experiments and Studies

An Improved Sequence-to-sequence Alignment Method Combined with Feature-based Image Registration Algorithm [View Experiment]
Computational Experiments with Multicriteria Sequence Alignment [View Experiment]
Locally Sensitive Backtranslation based on Multiple Sequence Alignment [View Experiment]
MicroRNA identification based on sequence and structure alignment [View Experiment]
Experiments with Algorithms for DNA Sequence Alignment [View Experiment]
Distributed Sequence Alignment Applications for the Public Computing Architecture [View Experiment]
Gene Sequence Alignment on a Public Computing Platform [View Experiment]
Multiple Sequence Alignment Using SAGA: Investigating the Effects of Operator Scheduling, Population Seeding, and Crossover Operators [View Experiment]
Dissertation: Protein Sequence Alignment Reliability: Prediction And Measurement [View Experiment]
Thesis: Multiple Sequence Alignment and Phylogenetic Reconstruction: Theory and Methods in Biological Data Analysis [View Experiment]
Thesis: Integrated Multiple Sequence Alignment [View Experiment]

Background Information

Definition

A sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix.

Sequence alignment produced with the freely available program ClustalW between two zinc finger protein sequences publicly available in GenBank

Introduction

See also the following links:

Global and local alignments

Pairwise alignment

Multiple sequence alignment

Structural alignment

Computational phylogenetics

Sequence alignment software

If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In protein sequence alignment, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, suggest that this region has structural or functional importance. Although DNA and RNA nucleotide bases are more similar to each other than to amino acids, the conservation of base pairing can indicate a similar functional or structural role. Sequence alignment can be used for non-biological sequences, such as those present in natural language or in financial data.

Very short or very similar sequences can be aligned by hand; however, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Instead, human knowledge is primarily applied in constructing algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: global alignments and local alignments. Calculating a global alignment is a form of global optimization that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity. A variety of computational algorithms have been applied to the sequence alignment problem, including slow but formally optimizing methods like dynamic programming and efficient heuristic or probabilistic methods designed for large-scale database search.

Representations

Alignments are commonly represented both graphically and in text format. In almost all sequence alignment representations, sequences are written in rows arranged so that aligned residues appear in successive columns. In text formats, aligned columns containing identical or similar characters are indicated with a system of conservation symbols. As in the image above, an asterisk or pipe symbol is used to show identity between two columns; other less common symbols include a colon for conservative substitutions and a period for semiconservative substitutions. Many sequence visualization programs also use color to display information about the properties of the individual sequence elements; in DNA and RNA sequences, this equates to assigning each nucleotide its own color. In protein alignments, such as the one in the image above, color is often used to indicate amino acid properties to aid in judging the conservation of a given amino acid substitution. For multiple sequences the last row in each column is often the consensus sequence determined by the alignment; the consensus sequence is also often represented in graphical format with a sequence logo in which the size of each nucleotide or amino acid letter corresponds to its degree of conservation.

Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a limited number of input and output formats, such as FASTA format and GenBank format and the output is not easily editable. In the past the use of specific tools authored by individual research laboratories or exorbitantly expensive commercial tools can be complicated by limited file format compatibility. General conversion tools like Chromatogram Explorer and ABI 2 FASTA are available to help with conversion from chromatogram formats (ABI/SCF) to FASTA format. However, modern tools like DNA Baser (Windows and Mac) can handle any classic input format (like SCF, AB, ABI, FASTA, FST, SEQ, TXT...) without problems.

Assessment of significance

Sequence alignments are useful in bioinformatics for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures. However, the biological relevance of sequence alignments is not always clear. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor; however, it is formally possible that convergent evolution can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures.

In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.

Scoring functions

The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using substitution matrices that reflect the probabilities of given character-to-character substitutions. A series of matrices called PAM matrices (Point Accepted Mutation matrices, originally defined by Margaret Dayhoff and sometimes referred to as "Dayhoff matrices") explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as BLOSUM (Blocks Substitution Matrix), encodes empirically derived substitution probabilities. Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or expand to detect more divergent sequences. Gap penalties account for the introduction of a gap - on the evolutionary model, an insertion or deletion mutation - in both nucleotide and protein sequences, and therefore the penalty values should be proportional to the expected rate of such mutations. The quality of the alignments produced therefore depends on the quality of the scoring function.

It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters.

Non-biological uses

The methods used for biological sequence alignment have also found applications in other fields, most notably in natural language processing. Techniques that generate the set of elements from which words will be selected in natural-language generation algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of computer-generated mathematical proofs. In the field of historical and comparative linguistics, sequence alignment has been used to partially automate the comparative method by which linguists traditionally reconstruct languages. Business and marketing research has also applied multiple sequence alignment techniques in analyzing series of purchases over time.

Software

Common software tools used for general sequence alignment tasks include DNA Baser, RNA Baser, ClustalW and T-coffee for alignment, and BLAST for database searching. A more complete list of available software categorized by algorithm and alignment type is available at sequence alignment software.

Alignment algorithms and software can be directly compared to one another using a standardized set of benchmark reference multiple sequence alignments known as BAliBASE. The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE. A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP.

For more information: https://en.wikipedia.org/wiki/Sequence_alignment

Source: Wikipedia (All text is available under the terms of the GNU Free Documentation License and Creative Commons Attribution-ShareAlike License.)

Useful Links

Bioinformatics Science Fair Projects and Experiments

General Science Fair Projects Resources

Biology / Biochemistry Science Fair Projects Books