MUSCLE (alignment software)
Original author(s) | Robert C. Edgar |
---|---|
Developer(s) | drive5 |
Initial release | 2004 |
Stable release | 5.3
/ 10 November 2024 |
Repository | github |
Operating system | Linux, macOS, Windows |
Platform | IA-32, x86-64 |
Available in | English |
Type | Multiple sequence alignment |
License | Public domain |
Website | drive5 |
MUltiple Sequence Comparison by Log-Expectation (MUSCLE) is a computer software for multiple sequence alignment of protein and nucleotide sequences. It is licensed as public domain. The method was published by Robert C. Edgar in two papers in 2004. The first paper, published in Nucleic Acids Research, introduced the sequence alignment algorithm.[1] The second paper, published in BMC Bioinformatics, presented more technical details.[2]
History
[edit]Robert C. Edgar
[edit]Edgar graduated in 1982 from University College London, BSc in Physics, PhD in Particle physics.[3] He pursued software development post-graduation and founded his own company, Parity Software, in 1988.[3] In 2001, he began working with coding algorithms after attending a seminar at the University of California Berkley.[4] From 2001-present day Edgar has contributed to or been the sole creator of multiple software programs, including MUSCLE and USEARCH.[3] He has written a total of 96 papers in the field of computational biology from 2002-present, with his most recent paper being Discovery and Validation of Alternatives to VSV-G for Pseudotyping of Lentiviral Vectors for In Vivo Delivery of Anti-Tumor Transgenes. As of April, 2025, his work has been cited over 143,126 times.[5] The two originally published MUSCLE papers have been cited more than 58,979 times combined. The paper “MUSCLE: multiple sequence alignment with high accuracy and high throughput”[1] has received more than 49,052 citations[5], while “MUSCLE: a multiple sequence alignment method with reduced time and space complexity”[2] has been cited over 9,936 times.[5]
Muscle Versions History
[edit]MUSCLE Vversion | Date Published | Summary | Reference |
---|---|---|---|
MUSCLE v1 | March 1, 2004 | The method was initially published on March 1, 2004, however this version was already on v3.2.[1] It is to be assumed MUSCLE version v1-v3.1 were unreleased to the public prior to publication as no record of previous versions exist online. | MUSCLE: multiple sequence alignment with high accuracy and high throughput |
MUSCLE v3.3 & MUSCLE-fast | August 19, 2004 | In the paper MUSCLE: a multiple sequence alignment method with reduced time and space complexity, published on August 19, 2004, a newer of MUSCLE (MUSCLE v3.3), was released alongside MUSCLE-fast. MUSCLE-fast, as the name suggests, was designed specifically for "high-throughput applications".[6] | MUSCLE: a multiple sequence alignment method with reduced time and space complexity |
MUSCLE v3.6 | September 2005 | In September 2005, another release of MUSCLE was published, specifically in the article MUSCLE User Guide (PDF available in the link), which describes how to use the, at the time, latest version of the software, which was v3.6.[7] | MUSCLE User Guide |
MUSCLE v3.8.31 | September 15, 2010 | MUSCLE v3.8.31 was released on September 15, 2010 and was the latest version prior to the release of MUSCLE v5. | MUSCLE 3.8.31 download on Drive5 |
MUSCLE v5.0-5.3 | June 21, 2021 | MUSCLE v5 was published on June 21, 2021 in the paper MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping. Since then, v5.1, 5.2, and 5.3 have been released, with all versions accessible through Edgar's website https://drive5.com/software.html. | MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping |
Muscle5
[edit]
Overview
[edit]In late 2021, Edgar released Muscle5 (also referred to as Muscle v5), an updated version of the MUSCLE software. It introduces several innovations aimed at improving alignment accuracy and reducing bias found in other MSA algorithms. Traditional tools such as Clustal Omega, MAFFT, and earlier versions of MUSCLE rely on progressive alignment strategies that produce a single alignment. Muscle5, in contrast, generates an ensemble of high-accuracy alignments by perturbing a hidden Markov model and permuting its guide tree. At its core, the algorithm is a parallelized reimplementation of ProbCons, and is designed to scale efficiently to large datasets. Muscle5 has demonstrated improved benchmark performance compared to leading MSA methods across several datasets, including BAliBASE, BRAliBASE, and PREFAB.[8]
Ensembles
[edit]A key innovation in Muscle5 is the use of alignment ensembles, which provide unbiased metrics of confidence in alignments. Each individual MSA (replicate) in the ensemble uses fixed but independently chosen parameters for the hidden Markov model and guide tree, allowing results to be averaged over a diverse set of replicates. This enables biologists to assess how sensitive their downstream analyses are to alignment uncertainty by comparing results across the ensemble.[8]
Algorithm
[edit]The MUSCLE algorithm proceeds in three stages: the draft progressive, improved progressive, and refinement stage.
Stage 1: Draft Progressive
[edit]In this first stage, the algorithm produces a multiple alignment, emphasizing speed over accuracy. This step begins by computing the k-mer distance for every pair of input sequences to create a distance matrix. UPGMA clusters the distance matrix to produce a binary tree. From this tree a progressive alignment is constructed, beginning with the creation of profiles for each leaf of the tree. For every node in the tree, a pairwise alignment is constructed of the two child profiles, creating a new profile to be assigned to that node. This continues until there is a multiple sequence alignment of all input sequences at the root of the tree.
Given input sequences and as the average sequence length, the time complexity of the draft progressive stage is
.
Here, the pairwise -mer distance calculation is computed as , and the progressive alignment steps take , where denotes the asymptotic upper bound. The space complexity is as the algorithm maintains profiles and alignments for each sequence across the tree.[1]
Stage 2: Improved Progressive
[edit]This stage focuses on obtaining a more optimal tree by calculating the Kimura distance for each pair of input sequences using the multiple sequence alignment obtained in Stage one, and creates a second distance matrix. UPGMA clusters this distance matrix to obtain a second binary tree. A progressive alignment is performed to obtain a multiple sequence alignment like in Stage one, but it is optimized by only computing alignments in subtrees whose branching orders have changed from the first binary tree, resulting in a more accurate alignment.
Refined alignments are made in the second stage by recalculating a more accurate tree via the Kimura distance. Thus, the algorithm analysis involves the initial subprocedures of pairwise distance calculations and progressive alignment; however, optimizations in computation are made by limiting re-alignment to only those subtrees with altered branching orders. The optimization is thus given as
,
where the variable denotes the number of subtree realignments. Similarly, the space complexity is
,
as profiles and alignments for the input sequences are stored for the progressive alignment process.[1]
Stage 3: Refinement
[edit]In this final stage, an edge is chosen from the second tree, with edges being visited in decreasing distance from the root. The chosen edge is deleted, dividing the tree into two subtrees. The profile of the multiple alignment is then computed for each subtree. A new multiple sequence alignment is produced by re-aligning the subtree profiles. If the SP score is improved, the new alignment is kept, otherwise, it is discarded. The process of deleting an edge and aligning is repeated until convergence, or until a user-defined limit is reached.
The time complexity of the refinement stage is given as . Here, denotes the number of edge deletions and denotes the average sequence length, where re-alignment of the subtree profiles is still the dominant cost per iteration. The space complexity remains the same as given in Stage one and two: . Since the same iterative refinement process occurs, the asymptotic complexity remains polynomial as the dominant term grows linearly with respect to the number of refinement steps.
In comparison, the CLUSTALW algorithm includes an optimized iterative refinement step such that selective re-alignment of the tree occurs in order to maximize alignment accuracy without repeating the entire process. The time and space complexity, however, do not change for this optimized iterative refinement step. The time complexity is , where is the number of refinement steps and is the average alignment length. The space complexity is given as , again, for alignment profiles and sequence data for all input sequences.[1][2]
Feature | CLUSTALW | MUSCLE v3 Version |
---|---|---|
Time Complexity | ||
Alignment Accuracy | Moderate | Higher, due to iterative refinement improving SP score |
Algorithm Flowchart
[edit]Complexity and Comparison
[edit]In the first two stages of the algorithm, the time complexity is O(N2L + NL2), the space complexity is O(N2 + NL + L2). The refinement stage adds to the time complexity another term, O(N3L).[1] MUSCLE is often used as a replacement for Clustal, since it usually (but not always) gives better sequence alignments. Depending on the chosen options, MUSCLE is significantly faster than Clustal, more so for larger alignments.[1][2]
Most modern multiple sequence alignment programs are generally accepted when presenting aligned sequences but there are few differences amongst them. The main difference between programs is the method used to align the sequences. For instance, T-COFFEE and Clustal use the progressive method while MUSCLE and MAFFT perform using the iterative method of alignment.[9] These two methods differ in their ability to handle low similarity sequences with the iterative method providing more accurate results. The other way methods differ is with their computational needs.
Originally MUSCLE had middling CPU demands in comparison to other programs but was definitely higher than the progressive methods.[1] Comparisons with modern versions of MSA programs reveal that many are quite similar in capabilities. The alignments were assessed based on their sum of pairs (SP), correctly matching two nucleotides/amino acids across two sequences, and their total columns (TC), matching columns divided by the total columns. In these cases, MUSCLE was average in its ability to maximize matching pairs and columns, being slightly worse than the ProbCons, T-Coffee, Probalign and MAFFT.[10] Outside the alignment scores, MUSCLE was less computationally demanding in both time to execute the alignment and the memory demand.
Program | Average Alignment Time (sec) | Average Memory Usage (Mb) | Short Sequence Average SP | Short Sequence Average TC | Long Sequence Average SP | Long Sequence Average TC |
---|---|---|---|---|---|---|
MUSCLE | 301.4 | 60.8 | 0.718 | 0.341 | 0.789 | 0.437 |
Probalign | 1410.2 | 162.7 | 0.774 | 0.219 | 0.800 | 0.455 |
ProbCons | 1781.7 | 192.5 | 0.763 | 0.220 | 0.831 | 0.524 |
MAFFT | 309.0 | 231.6 | 0.767 | 0.421 | 0.803 | 0.470 |
T-Coffee | 1964.9 | 372.2 | 0.760 | 0.407 | 0.830 | 0.520 |
Integration
[edit]MUSCLE is widely supported across multiple bioinformatics platforms. It is fully integrated into software programs such as DNASTAR's Lasergene software, Geneious, and MacVector, and is also available as a plug-in for Sequencher, MEGA, UGENE, and AliView. Users can also access MUSCLE as a web service via the European Molecular Biology Laboratory (EMBL)-European Bioinformatics Institute (EBI)[11] or T-Coffee. MUSCLE can also be downloaded by users on their personal devices via the Official website.
Platform | Integration Type | Access Method | Source |
---|---|---|---|
Geneious | Built-in | Available under Align/Assemble → Multiple Align → MUSCLE | [12] |
DNASTAR Lasergene | Built-in | Available in MegAlign Pro module under Align → Align using Muscle | [13] |
MacVector | Built-in | Found under Align | [14] |
Sequencher (Gene Codes Corp.) | Plug-in | Requires installation of MUSCLE plugin; accessed under Assemble → Align Using → Muscle | [15] |
MEGA | Plug-in | Found under Alignment Explorer → Align → MUSCLE | [16] |
UGENE | Plug-in | Accessible by right-clicking → Align → Align with Muscle | [17] |
AliView | Plug-in | Requires installation of MUSCLE plugin; Accessible by Preferences → External Tools → Set path to MUSCLE | [18] |
EMBL-EBI | Web tool | Accessible via the MUSCLE Web Interface | [11] |
T-Coffee | Web tool | Accessible via the more options menu and selecting muscle under the methods section | [19] |
See also
[edit]References
[edit]- ^ a b c d e f g h i Edgar RC (2004). "MUSCLE: multiple sequence alignment with high accuracy and high throughput". Nucleic Acids Research. 32 (5): 1792–97. doi:10.1093/nar/gkh340. PMC 390337. PMID 15034147.
- ^ a b c d Edgar RC (2004). "MUSCLE: a multiple sequence alignment method with reduced time and space complexity". BMC Bioinformatics. 5 (1): 113. doi:10.1186/1471-2105-5-113. PMC 517706. PMID 15318951.
- ^ a b c "Curriculum Vitae". drive5.com. Retrieved 2025-04-24.
- ^ Edgar, Robert (September 3, 2014). "An Unemployed Gentleman Scholar". Retrieved April 24, 2025.
{{cite web}}
: CS1 maint: url-status (link) - ^ a b c "Robert C. Edgar". scholar.google.com. Retrieved 2025-04-24.
- ^ Edgar, Robert C. (2004-08-19). "MUSCLE: a multiple sequence alignment method with reduced time and space complexity". BMC Bioinformatics. 5 (1): 113. doi:10.1186/1471-2105-5-113. ISSN 1471-2105. PMC 517706. PMID 15318951.
- ^ "Muscle user guide". scholar.google.com. Retrieved 2025-04-22.
- ^ a b Edgar, Robert C. (2022). "Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny". Nature Communications. 13 (6968): 1–9. doi:10.1038/s41467-022-34630-w.
- ^ Zhang, Chenyue (November 29, 2024). "The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction". Biomolecules. 14 (12): 1531. doi:10.3390/biom14121531. PMC 11673352. PMID 39766238.
- ^ a b Pais, Fabiano (6 March 2014). "Assessing the efficiency of multiple sequence alignment programs". Algorithms for Molecular Biology. 9 (4): 4. doi:10.1186/1748-7188-9-4. PMC 4015676. PMID 24602402.
- ^ a b "MUSCLE < Multiple Sequence Alignment < EMBL-EBI". Archived from the original on 18 January 2015. Retrieved 1 September 2014.
- ^ Geneious: Multiple Alignment using MUSCLE
- ^ DNASTAR: MUSCLE alignment options
- ^ Alignments in MacVector
- ^ Sequencher: MUSCLE MSA
- ^ MEGA Help: MUSCLE Alignment
- ^ UGENE: MSA with Muscle
- ^ AliView: AliView: About
- ^ T-Coffee: Tutorial