Bioinformatics Encyclopedia
Home Bioinformatics Science Fair Projects Bioinformatics Resources Bioinformatics Books Biology Jokes and Evolution
 
 


Sequence Alignment



See also:

In bioinformatics, a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that residues with identical or similar characters are aligned in successive columns.

A sequence alignment, produced by ClustalW between two human zinc finger proteins identified by GenBank accession number. (Key)
A sequence alignment, produced by ClustalW between two human zinc finger proteins identified by GenBank accession number. (Key)

If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In protein sequence alignment, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, suggest that this region has structural or functional importance. Although DNA and RNA nucleotide bases are more similar to each other than to amino acids, the conservation of base pairing can indicate a similar functional or structural role. Sequence alignment can be used for non-biological sequences, such as those present in natural language or in financial data.

Very short or very similar sequences can be aligned by hand; however, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Instead, human knowledge is primarily applied in constructing algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: global alignments and local alignments. Calculating a global alignment is a form of global optimization that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity. A variety of computational algorithms have been applied to the sequence alignment problem, including slow but formally optimizing methods like dynamic programming and efficient heuristic or probabilistic methods designed for large-scale database search.

Representations

Alignments are commonly represented both graphically and in text format. In almost all sequence alignment representations, sequences are written in rows arranged so that aligned residues appear in successive columns. In text formats, aligned columns containing identical or similar characters are indicated with a system of conservation symbols. As in the image above, an asterisk or pipe symbol is used to show identity between two columns; other less common symbols include a colon for conservative substitutions and a period for semiconservative substitutions. Many sequence visualization programs also use color to display information about the properties of the individual sequence elements; in DNA and RNA sequences, this equates to assigning each nucleotide its own color. In protein alignments, such as the one in the image above, color is often used to indicate amino acid properties to aid in judging the conservation of a given amino acid substitution. For multiple sequences the last row in each column is often the consensus sequence determined by the alignment; the consensus sequence is also often represented in graphical format with a sequence logo in which the size of each nucleotide or amino acid letter corresponds to its degree of conservation.[1]

Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a limited number of input and output formats, such as FASTA format and GenBank format and the output is not easily editable. In the past the use of specific tools authored by individual research laboratories or exorbitantly expensive commercial tools can be complicated by limited file format compatibility. General conversion tools like Chromatogram Explorer and ABI 2 FASTA are available to help with conversion from chromatogram formats (ABI/SCF) to FASTA format. However, modern tools like DNA Baser (Windows and Mac) can handle any classic input format (like SCF, AB, ABI, FASTA, FST, SEQ, TXT...) without problems.

Assessment of significance

Sequence alignments are useful in bioinformatics for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures. However, the biological relevance of sequence alignments is not always clear. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor; however, it is formally possible that convergent evolution can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures.

In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.

Scoring functions

The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using substitution matrices that reflect the probabilities of given character-to-character substitutions. A series of matrices called PAM matrices (Point Accepted Mutation matrices, originally defined by Margaret Dayhoff and sometimes referred to as "Dayhoff matrices") explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as BLOSUM (Blocks Substitution Matrix), encodes empirically derived substitution probabilities. Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or expand to detect more divergent sequences. Gap penalties account for the introduction of a gap - on the evolutionary model, an insertion or deletion mutation - in both nucleotide and protein sequences, and therefore the penalty values should be proportional to the expected rate of such mutations. The quality of the alignments produced therefore depends on the quality of the scoring function.

It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters.

Non-biological uses

The methods used for biological sequence alignment have also found applications in other fields, most notably in natural language processing. Techniques that generate the set of elements from which words will be selected in natural-language generation algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of computer-generated mathematical proofs.[19] In the field of historical and comparative linguistics, sequence alignment has been used to partially automate the comparative method by which linguists traditionally reconstruct languages.[20] Business and marketing research has also applied multiple sequence alignment techniques in analyzing series of purchases over time.[21]

Software

Common software tools used for general sequence alignment tasks include DNA Baser, RNA Baser, ClustalW and T-coffee for alignment, and BLAST for database searching. A more complete list of available software categorized by algorithm and alignment type is available at sequence alignment software.

Alignment algorithms and software can be directly compared to one another using a standardized set of benchmark reference multiple sequence alignments known as BAliBASE.[22] The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE.[23] A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP.

For more information see the following links:

References

  1. ^ Schneider TD, Stephens RM (1990). "Sequence logos: a new way to display consensus sequences". Nucleic Acids Res 18: 6097–6100. doi:10.1093/nar/18.20.6097. PMID 2172928. 
  2. ^ Brudno M, Malde S, Poliakov A, Do CB, Couronne O, Dubchak I, Batzoglou S (2003). "Glocal alignment: finding rearrangements during alignment". Bioinformatics 19 Suppl 1: i54–62. doi:10.1093/bioinformatics/btg1005. PMID 12855437. 
  3. ^ a b Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed.. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.. ISBN 0-87969-608-7. 
  4. ^ Wang L, Jiang T. (1994). "On the complexity of multiple sequence alignment". J Comput Biol 1: 337–48. PMID 8790475. 
  5. ^ Lipman DJ, Altschul SF, Kececioglu JD (1989). "A tool for multiple sequence alignment". Proc Natl Acad Sci U S A 86: 4412–5. doi:10.1073/pnas.86.12.4412. PMID 2734293. 
  6. ^ Higgins DG, Sharp PM (1988). "CLUSTAL: a package for performing multiple sequence alignment on a microcomputer". Gene 73 (1): 237–44. doi:10.1016/0378-1119(88)90330-7. PMID 3243435. 
  7. ^ Thompson JD, Higgins DG, Gibson TJ. (1994). "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice". Nucleic Acids Res 22: 4673–80. doi:10.1093/nar/22.22.4673. PMID 7984417. 
  8. ^ Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD. (2003). "Multiple sequence alignment with the Clustal series of programs". Nucleic Acids Res 31: 3497–500. doi:10.1093/nar/gkg500. PMID 12824352. 
  9. ^ Notredame C, Higgins DG, Heringa J. (2000). "T-Coffee: A novel method for fast and accurate multiple sequence alignment". J Mol Biol 302 (1): 205–17. doi:10.1006/jmbi.2000.4042. PMID 10964570. 
  10. ^ Hirosawa M, Totoki Y, Hoshida M, Ishikawa M. (1995). "Comprehensive study on iterative algorithms of multiple sequence alignment". Comput Appl Biosci 11: 13–8. doi:10.1093/bioinformatics/11.1.13. PMID 7796270. 
  11. ^ Karplus K, Barrett C, Hughey R. (1998). "Hidden Markov models for detecting remote protein homologies". Bioinformatics 14 (10): 846–856. doi:10.1093/bioinformatics/14.10.846. PMID 9927713. 
  12. ^ Chothia C, Lesk AM. (1986). "The relation between the divergence of sequence and structure in proteins". EMBO J 5 (4): 823–6. PMID 3709526. 
  13. ^ a b Zhang Y, Skolnick J. (2005). "The protein structure prediction problem could be solved using the current PDB library". Proc Natl Acad Sci U S A 102: 1029–34. doi:10.1073/pnas.0407152101. PMID 15653774. 
  14. ^ Holm L, Sander C (1996). "Mapping the protein universe". Science 273: 595–603. doi:10.1126/science.273.5275.595. PMID 8662544. 
  15. ^ Taylor WR, Flores TP, Orengo CA. (1994). "Multiple protein structure alignment". Protein Sci 3: 1858–70. PMID 7849601. 
  16. ^ Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM (1997). "CATH--a hierarchic classification of protein domain structures". Structure 5: 1093–108. doi:10.1016/S0969-2126(97)00260-8. PMID 9309224. 
  17. ^ Shindyalov IN, Bourne PE. (1998). "Protein structure alignment by incremental combinatorial extension (CE) of the optimal path". Protein Eng 11: 739–47. doi:10.1093/protein/11.9.739. PMID 9796821. 
  18. ^ Felsenstein J. (2004). Inferring Phylogenies. Sinauer Associates: Sunderland, MA. ISBN 0-87893-177-5. 
  19. ^ Barzilay R, Lee L. (2002). "Bootstrapping Lexical Choice via Multiple-Sequence Alignment". Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) 10: 164–171. doi:10.3115/1118693.1118715. 
  20. ^ Kondrak, Grzegorz (2002). "Algorithms for Language Reconstruction" (PDF). University of Toronto, Ontario. Retrieved on 2007-01-21.
  21. ^ Prinzie A., D. Van den Poel (2006). "Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM". Decision Support Systems 42 (2): 508–526. doi:10.1016/j.dss.2005.02.004.  See also Prinzie and Van den Poel's paper Prinzie, A (2007). "Predicting home-appliance acquisition sequences: Markov/Markov for Discrimination and survival analysis for modeling sequential information in NPTB models". Decision Support Systems 44 (1): 28–45. doi:10.1016/j.dss.2007.02.008. 
  22. ^ Thompson JD, Plewniak F, Poch O (1999). "BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs". Bioinformatics 15: 87–8. doi:10.1093/bioinformatics/15.1.87. PMID 10068696. 
  23. ^ Thompson JD, Plewniak F, Poch O. (1999). "A comprehensive comparison of multiple sequence alignment programs". Nucleic Acids Res 27: 2682–90. doi:10.1093/nar/27.13.2682. PMID 10373585. 
This article is licensed under the GNU Free Documentation License. It uses material from Wikipedia Encyclopedia article "Sequence Alignment"

Most Popular

Bioinformatics Introduction

Sequence Alignment

Sequence Database

Phylogenetics

Protein Structure Prediction


Bioinformatics Books

























Site Map   About Us

Comments and inquiries could be addressed to:
webmaster@juliantrubin.com


Last updated: July 2008
Copyright © 2003-2008 Julian Rubin