Sequence Alignment
See also:
In bioinformatics, a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix.
Gaps are inserted between the residues so that residues with identical
or similar characters are aligned in successive columns.
A sequence alignment, produced by ClustalW between two human zinc finger proteins identified by GenBank accession number. (Key)
If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels
(that is, insertion or deletion mutations) introduced in one or both
lineages in the time since they diverged from one another. In protein
sequence alignment, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif
is among lineages. The absence of substitutions, or the presence of
only very conservative substitutions (that is, the substitution of
amino acids whose side chains
have similar biochemical properties) in a particular region of the
sequence, suggest that this region has structural or functional
importance. Although DNA and RNA nucleotide bases are more similar to each other than to amino acids, the conservation of base pairing
can indicate a similar functional or structural role. Sequence
alignment can be used for non-biological sequences, such as those
present in natural language or in financial data.
Very short or very similar sequences can be aligned by hand;
however, most interesting problems require the alignment of lengthy,
highly variable or extremely numerous sequences that cannot be aligned
solely by human effort. Instead, human knowledge is primarily applied
in constructing algorithms to produce high-quality sequence alignments,
and occasionally in adjusting the final results to reflect patterns
that are difficult to represent algorithmically (especially in the case
of nucleotide sequences). Computational approaches to sequence
alignment generally fall into two categories: global alignments and local alignments. Calculating a global alignment is a form of global optimization
that "forces" the alignment to span the entire length of all query
sequences. By contrast, local alignments identify regions of similarity
within long sequences that are often widely divergent overall. Local
alignments are often preferable, but can be more difficult to calculate
because of the additional challenge of identifying the regions of
similarity. A variety of computational algorithms have been applied to
the sequence alignment problem, including slow but formally optimizing
methods like dynamic programming and efficient heuristic or probabilistic methods designed for large-scale database search.
Representations
Alignments are commonly represented both graphically and in text
format. In almost all sequence alignment representations, sequences are
written in rows arranged so that aligned residues appear in successive
columns. In text formats, aligned columns containing identical or
similar characters are indicated with a system of conservation symbols.
As in the image above, an asterisk or pipe symbol is used to show
identity between two columns; other less common symbols include a colon
for conservative substitutions and a period for semiconservative
substitutions. Many sequence visualization programs also use color to
display information about the properties of the individual sequence
elements; in DNA and RNA sequences, this equates to assigning each
nucleotide its own color. In protein alignments, such as the one in the
image above, color is often used to indicate amino acid properties to
aid in judging the conservation of a given amino acid substitution. For multiple sequences the last row in each column is often the consensus sequence determined by the alignment; the consensus sequence is also often represented in graphical format with a sequence logo in which the size of each nucleotide or amino acid letter corresponds to its degree of conservation.[1]
Sequence alignments can be stored in a wide variety of text-based
file formats, many of which were originally developed in conjunction
with a specific alignment program or implementation. Most web-based
tools allow a limited number of input and output formats, such as FASTA format and GenBank
format and the output is not easily editable. In the past the use of
specific tools authored by individual research laboratories or
exorbitantly expensive commercial tools can be complicated by limited
file format compatibility. General conversion tools like Chromatogram Explorer and ABI 2 FASTA are available to help with conversion from chromatogram formats (ABI/SCF) to FASTA format. However, modern tools like DNA Baser (Windows and Mac) can handle any classic input format (like SCF, AB, ABI, FASTA, FST, SEQ, TXT...) without problems.
Assessment of significance
Sequence alignments are useful in bioinformatics for identifying
sequence similarity, producing phylogenetic trees, and developing
homology models of protein structures. However, the biological
relevance of sequence alignments is not always clear. Alignments are
often assumed to reflect a degree of evolutionary change between
sequences descended from a common ancestor; however, it is formally
possible that convergent evolution
can occur to produce apparent similarity between proteins that are
evolutionarily unrelated but perform similar functions and have similar
structures.
In database searches such as BLAST, statistical methods can
determine the likelihood of a particular alignment between sequences or
sequence regions arising by chance given the size and composition of
the database being searched. These values can vary significantly
depending on the search space. In particular, the likelihood of finding
a given alignment by chance increases if the database consists only of
sequences from the same organism as the query sequence. Repetitive
sequences in the database or query can also distort both the search
results and the assessment of statistical significance; BLAST
automatically filters such repetitive sequences in the query to avoid
apparent hits that are statistical artifacts.
Scoring functions
The choice of a scoring function that reflects biological or
statistical observations about known sequences is important to
producing good alignments. Protein sequences are frequently aligned
using substitution matrices that reflect the probabilities of given character-to-character substitutions. A series of matrices called PAM matrices (Point Accepted Mutation matrices, originally defined by Margaret Dayhoff
and sometimes referred to as "Dayhoff matrices") explicitly encode
evolutionary approximations regarding the rates and probabilities of
particular amino acid mutations. Another common series of scoring
matrices, known as BLOSUM
(Blocks Substitution Matrix), encodes empirically derived substitution
probabilities. Variants of both types of matrices are used to detect
sequences with differing levels of divergence, thus allowing users of
BLAST or FASTA to restrict searches to more closely related matches or
expand to detect more divergent sequences. Gap penalties
account for the introduction of a gap - on the evolutionary model, an
insertion or deletion mutation - in both nucleotide and protein
sequences, and therefore the penalty values should be proportional to
the expected rate of such mutations. The quality of the alignments
produced therefore depends on the quality of the scoring function.
It can be very useful and instructive to try the same alignment
several times with different choices for scoring matrix and/or gap
penalty values and compare the results. Regions where the solution is
weak or non-unique can often be identified by observing which regions
of the alignment are robust to variations in alignment parameters.
Non-biological uses
The methods used for biological sequence alignment have also found applications in other fields, most notably in natural language processing.
Techniques that generate the set of elements from which words will be
selected in natural-language generation algorithms have borrowed
multiple sequence alignment techniques from bioinformatics to produce
linguistic versions of computer-generated mathematical proofs.[19] In the field of historical and comparative linguistics, sequence alignment has been used to partially automate the comparative method by which linguists traditionally reconstruct languages.[20]
Business and marketing research has also applied multiple sequence
alignment techniques in analyzing series of purchases over time.[21]
Software
Common software tools used for general sequence alignment tasks include DNA Baser, RNA Baser, ClustalW and T-coffee for alignment, and BLAST for database searching. A more complete list of available software categorized by algorithm and alignment type is available at sequence alignment software.
Alignment algorithms and software can be directly compared to one another using a standardized set of benchmark reference multiple sequence alignments known as BAliBASE.[22]
The data set consists of structural alignments, which can be considered
a standard against which purely sequence-based methods are compared.
The relative performance of many common alignment methods on frequently
encountered alignment problems has been tabulated and selected results
published online at BAliBASE.[23]
A comprehensive list of BAliBASE scores for many (currently 12)
different alignment tools can be computed within the protein workbench STRAP.
For more information see the following links:
References
- ^ Schneider TD, Stephens RM (1990). "Sequence logos: a new way to display consensus sequences". Nucleic Acids Res 18: 6097–6100. doi:10.1093/nar/18.20.6097. PMID 2172928.
- ^ Brudno M, Malde S, Poliakov A, Do CB, Couronne O, Dubchak I, Batzoglou S (2003). "Glocal alignment: finding rearrangements during alignment". Bioinformatics 19 Suppl 1: i54–62. doi:10.1093/bioinformatics/btg1005. PMID 12855437.
- ^ a b Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed.. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.. ISBN 0-87969-608-7.
- ^ Wang L, Jiang T. (1994). "On the complexity of multiple sequence alignment". J Comput Biol 1: 337–48. PMID 8790475.
- ^ Lipman DJ, Altschul SF, Kececioglu JD (1989). "A tool for multiple sequence alignment". Proc Natl Acad Sci U S A 86: 4412–5. doi:10.1073/pnas.86.12.4412. PMID 2734293.
- ^ Higgins DG, Sharp PM (1988). "CLUSTAL: a package for performing multiple sequence alignment on a microcomputer". Gene 73 (1): 237–44. doi:10.1016/0378-1119(88)90330-7. PMID 3243435.
- ^ Thompson JD, Higgins DG, Gibson TJ. (1994). "CLUSTAL
W: improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and weight
matrix choice". Nucleic Acids Res 22: 4673–80. doi:10.1093/nar/22.22.4673. PMID 7984417.
- ^ Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD. (2003). "Multiple sequence alignment with the Clustal series of programs". Nucleic Acids Res 31: 3497–500. doi:10.1093/nar/gkg500. PMID 12824352.
- ^ Notredame C, Higgins DG, Heringa J. (2000). "T-Coffee: A novel method for fast and accurate multiple sequence alignment". J Mol Biol 302 (1): 205–17. doi:10.1006/jmbi.2000.4042. PMID 10964570.
- ^ Hirosawa M, Totoki Y, Hoshida M, Ishikawa M. (1995). "Comprehensive study on iterative algorithms of multiple sequence alignment". Comput Appl Biosci 11: 13–8. doi:10.1093/bioinformatics/11.1.13. PMID 7796270.
- ^ Karplus K, Barrett C, Hughey R. (1998). "Hidden Markov models for detecting remote protein homologies". Bioinformatics 14 (10): 846–856. doi:10.1093/bioinformatics/14.10.846. PMID 9927713.
- ^ Chothia C, Lesk AM. (1986). "The relation between the divergence of sequence and structure in proteins". EMBO J 5 (4): 823–6. PMID 3709526.
- ^ a b Zhang Y, Skolnick J. (2005). "The protein structure prediction problem could be solved using the current PDB library". Proc Natl Acad Sci U S A 102: 1029–34. doi:10.1073/pnas.0407152101. PMID 15653774.
- ^ Holm L, Sander C (1996). "Mapping the protein universe". Science 273: 595–603. doi:10.1126/science.273.5275.595. PMID 8662544.
- ^ Taylor WR, Flores TP, Orengo CA. (1994). "Multiple protein structure alignment". Protein Sci 3: 1858–70. PMID 7849601.
- ^ Orengo
CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM (1997).
"CATH--a hierarchic classification of protein domain structures". Structure 5: 1093–108. doi:10.1016/S0969-2126(97)00260-8. PMID 9309224.
- ^ Shindyalov IN, Bourne PE. (1998). "Protein structure alignment by incremental combinatorial extension (CE) of the optimal path". Protein Eng 11: 739–47. doi:10.1093/protein/11.9.739. PMID 9796821.
- ^ Felsenstein J. (2004). Inferring Phylogenies. Sinauer Associates: Sunderland, MA. ISBN 0-87893-177-5.
- ^ Barzilay R, Lee L. (2002). "Bootstrapping Lexical Choice via Multiple-Sequence Alignment". Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) 10: 164–171. doi:10.3115/1118693.1118715.
- ^ Kondrak, Grzegorz (2002). "Algorithms for Language Reconstruction" (PDF). University of Toronto, Ontario. Retrieved on 2007-01-21.
- ^ Prinzie A., D. Van den Poel (2006). "Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM". Decision Support Systems 42 (2): 508–526. doi:10.1016/j.dss.2005.02.004. See also Prinzie and Van den Poel's paper Prinzie, A (2007). "Predicting
home-appliance acquisition sequences: Markov/Markov for Discrimination
and survival analysis for modeling sequential information in NPTB models". Decision Support Systems 44 (1): 28–45. doi:10.1016/j.dss.2007.02.008.
- ^ Thompson JD, Plewniak F, Poch O (1999). "BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs". Bioinformatics 15: 87–8. doi:10.1093/bioinformatics/15.1.87. PMID 10068696.
- ^ Thompson JD, Plewniak F, Poch O. (1999). "A comprehensive comparison of multiple sequence alignment programs". Nucleic Acids Res 27: 2682–90. doi:10.1093/nar/27.13.2682. PMID 10373585.
This article is licensed under the GNU Free Documentation License. It uses material from Wikipedia Encyclopedia article "Sequence Alignment"
|
|