Jump to content

MUSCLE (alignment software)

From Wikipedia, the free encyclopedia
MUltiple Sequence Comparison by Log-Expectation
Original author(s)Robert C. Edgar
Developer(s)drive5
Initial release2004; 21 years ago (2004)
Stable release
5.3 / 10 November 2024; 5 months ago (2024-11-10)
Repositorygithub.com/rcedgar/muscle/releases/tag/v5.1 at GitHub
Operating systemLinux, macOS, Windows
PlatformIA-32, x86-64
Available inEnglish
TypeMultiple sequence alignment
LicensePublic domain
Websitedrive5.com/muscle/

MUltiple Sequence Comparison by Log-Expectation (MUSCLE) is a computer software for multiple sequence alignment of protein and nucleotide sequences. It is licensed as public domain. The method was published by Robert C. Edgar in two papers in 2004. The first paper, published in Nucleic Acids Research, introduced the sequence alignment algorithm.[1] The second paper, published in BMC Bioinformatics, presented more technical details.[2]

History

[edit]

Robert C. Edgar

[edit]

Edgar graduated in 1982 from University College London, BSc in Physics, PhD in Particle physics.[3] He pursued software development post-graduation and founded his own company, Parity Software, in 1988.[3] In 2001, he began working with coding algorithms after attending a seminar at the University of California Berkley.[4] From 2001-present day Edgar has contributed to or been the sole creator of multiple software programs, including MUSCLE and USEARCH.[3] He has written a total of 96 papers in the field of computational biology from 2002-present, with his most recent paper being Discovery and Validation of Alternatives to VSV-G for Pseudotyping of Lentiviral Vectors for In Vivo Delivery of Anti-Tumor Transgenes. As of April, 2025, his work has been cited over 143,126 times.[5] The two originally published MUSCLE papers have been cited more than 58,979 times combined. The paper “MUSCLE: multiple sequence alignment with high accuracy and high throughput”[1] has received more than 49,052 citations[5], while “MUSCLE: a multiple sequence alignment method with reduced time and space complexity”[2] has been cited over 9,936 times.[5]

Muscle Versions History

[edit]
MUSCLE Vversion Date Published Summary Reference
MUSCLE v1 March 1, 2004 The method was initially published on March 1, 2004, however this version was already on v3.2.[1] It is to be assumed MUSCLE version v1-v3.1 were unreleased to the public prior to publication as no record of previous versions exist online. MUSCLE: multiple sequence alignment with high accuracy and high throughput
MUSCLE v3.3 & MUSCLE-fast August 19, 2004 In the paper MUSCLE: a multiple sequence alignment method with reduced time and space complexity, published on August 19, 2004, a newer of MUSCLE (MUSCLE v3.3), was released alongside MUSCLE-fast. MUSCLE-fast, as the name suggests, was designed specifically for "high-throughput applications".[6] MUSCLE: a multiple sequence alignment method with reduced time and space complexity
MUSCLE v3.6 September 2005 In September 2005, another release of MUSCLE was published, specifically in the article MUSCLE User Guide (PDF available in the link), which describes how to use the, at the time, latest version of the software, which was v3.6.[7] MUSCLE User Guide
MUSCLE v3.8.31 September 15, 2010 MUSCLE v3.8.31 was released on September 15, 2010 and was the latest version prior to the release of MUSCLE v5. MUSCLE 3.8.31 download on Drive5
MUSCLE v5.0-5.3 June 21, 2021 MUSCLE v5 was published on June 21, 2021 in the paper MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping. Since then, v5.1, 5.2, and 5.3 have been released, with all versions accessible through Edgar's website https://drive5.com/software.html. MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping

Muscle5

[edit]
Diagram showcasing how Muscle5 generates ensembles

Overview

[edit]

In late 2021, Edgar released Muscle5 (also referred to as Muscle v5), an updated version of the MUSCLE software. It introduces several innovations aimed at improving alignment accuracy and reducing bias found in other MSA algorithms. Traditional tools such as Clustal Omega, MAFFT, and earlier versions of MUSCLE rely on progressive alignment strategies that produce a single alignment. Muscle5, in contrast, generates an ensemble of high-accuracy alignments by perturbing a hidden Markov model and permuting its guide tree. At its core, the algorithm is a parallelized reimplementation of ProbCons, and is designed to scale efficiently to large datasets. Muscle5 has demonstrated improved benchmark performance compared to leading MSA methods across several datasets, including BAliBASE, BRAliBASE, and PREFAB.[8]

Ensembles

[edit]

A key innovation in Muscle5 is the use of alignment ensembles, which provide unbiased metrics of confidence in alignments. Each individual MSA (replicate) in the ensemble uses fixed but independently chosen parameters for the hidden Markov model and guide tree, allowing results to be averaged over a diverse set of replicates. This enables biologists to assess how sensitive their downstream analyses are to alignment uncertainty by comparing results across the ensemble.[8]

Algorithm

[edit]

The MUSCLE algorithm proceeds in three stages: the draft progressive, improved progressive, and refinement stage.

Stage 1: Draft Progressive

[edit]

In this first stage, the algorithm produces a multiple alignment, emphasizing speed over accuracy. This step begins by computing the k-mer distance for every pair of input sequences to create a distance matrix. UPGMA clusters the distance matrix to produce a binary tree. From this tree a progressive alignment is constructed, beginning with the creation of profiles for each leaf of the tree. For every node in the tree, a pairwise alignment is constructed of the two child profiles, creating a new profile to be assigned to that node. This continues until there is a multiple sequence alignment of all input sequences at the root of the tree.

Given input sequences and as the average sequence length, the time complexity of the draft progressive stage is

.

Here, the pairwise -mer distance calculation is computed as , and the progressive alignment steps take , where denotes the asymptotic upper bound. The space complexity is as the algorithm maintains profiles and alignments for each sequence across the tree.[1]

Stage 2: Improved Progressive

[edit]

This stage focuses on obtaining a more optimal tree by calculating the Kimura distance for each pair of input sequences using the multiple sequence alignment obtained in Stage one, and creates a second distance matrix. UPGMA clusters this distance matrix to obtain a second binary tree. A progressive alignment is performed to obtain a multiple sequence alignment like in Stage one, but it is optimized by only computing alignments in subtrees whose branching orders have changed from the first binary tree, resulting in a more accurate alignment.

Refined alignments are made in the second stage by recalculating a more accurate tree via the Kimura distance. Thus, the algorithm analysis involves the initial subprocedures of pairwise distance calculations and progressive alignment; however, optimizations in computation are made by limiting re-alignment to only those subtrees with altered branching orders. The optimization is thus given as

,

where the variable denotes the number of subtree realignments. Similarly, the space complexity is

,

as profiles and alignments for the input sequences are stored for the progressive alignment process.[1]

Stage 3: Refinement

[edit]

In this final stage, an edge is chosen from the second tree, with edges being visited in decreasing distance from the root. The chosen edge is deleted, dividing the tree into two subtrees. The profile of the multiple alignment is then computed for each subtree. A new multiple sequence alignment is produced by re-aligning the subtree profiles. If the SP score is improved, the new alignment is kept, otherwise, it is discarded. The process of deleting an edge and aligning is repeated until convergence, or until a user-defined limit is reached.

The time complexity of the refinement stage is given as . Here, denotes the number of edge deletions and denotes the average sequence length, where re-alignment of the subtree profiles is still the dominant cost per iteration. The space complexity remains the same as given in Stage one and two: . Since the same iterative refinement process occurs, the asymptotic complexity remains polynomial as the dominant term grows linearly with respect to the number of refinement steps.

In comparison, the CLUSTALW algorithm includes an optimized iterative refinement step such that selective re-alignment of the tree occurs in order to maximize alignment accuracy without repeating the entire process. The time and space complexity, however, do not change for this optimized iterative refinement step. The time complexity is , where is the number of refinement steps and is the average alignment length. The space complexity is given as , again, for alignment profiles and sequence data for all input sequences.[1][2]

Comparison of Space Complexity and Alignment Accuracy
Feature CLUSTALW MUSCLE v3 Version
Time Complexity
Alignment Accuracy Moderate Higher, due to iterative refinement improving SP score

Algorithm Flowchart

[edit]

Complexity and Comparison

[edit]

In the first two stages of the algorithm, the time complexity is O(N2L + NL2), the space complexity is O(N2 + NL + L2). The refinement stage adds to the time complexity another term, O(N3L).[1] MUSCLE is often used as a replacement for Clustal, since it usually (but not always) gives better sequence alignments. Depending on the chosen options, MUSCLE is significantly faster than Clustal, more so for larger alignments.[1][2]

Most modern multiple sequence alignment programs are generally accepted when presenting aligned sequences but there are few differences amongst them. The main difference between programs is the method used to align the sequences. For instance, T-COFFEE and Clustal use the progressive method while MUSCLE and MAFFT perform using the iterative method of alignment.[9] These two methods differ in their ability to handle low similarity sequences with the iterative method providing more accurate results. The other way methods differ is with their computational needs.

Originally MUSCLE had middling CPU demands in comparison to other programs but was definitely higher than the progressive methods.[1] Comparisons with modern versions of MSA programs reveal that many are quite similar in capabilities. The alignments were assessed based on their sum of pairs (SP), correctly matching two nucleotides/amino acids across two sequences, and their total columns (TC), matching columns divided by the total columns. In these cases, MUSCLE was average in its ability to maximize matching pairs and columns, being slightly worse than the ProbCons, T-Coffee, Probalign and MAFFT.[10] Outside the alignment scores, MUSCLE was less computationally demanding in both time to execute the alignment and the memory demand.

MSA Computational Properties [10]
Program Average Alignment Time (sec) Average Memory Usage (Mb) Short Sequence Average SP Short Sequence Average TC Long Sequence Average SP Long Sequence Average TC
MUSCLE 301.4 60.8 0.718 0.341 0.789 0.437
Probalign 1410.2 162.7 0.774 0.219 0.800 0.455
ProbCons 1781.7 192.5 0.763 0.220 0.831 0.524
MAFFT 309.0 231.6 0.767 0.421 0.803 0.470
T-Coffee 1964.9 372.2 0.760 0.407 0.830 0.520

Integration

[edit]

MUSCLE is widely supported across multiple bioinformatics platforms. It is fully integrated into software programs such as DNASTAR's Lasergene software, Geneious, and MacVector, and is also available as a plug-in for Sequencher, MEGA, UGENE, and AliView. Users can also access MUSCLE as a web service via the European Molecular Biology Laboratory (EMBL)-European Bioinformatics Institute (EBI)[11] or T-Coffee. MUSCLE can also be downloaded by users on their personal devices via the Official website.

Platform Integration Type Access Method Source
Geneious Built-in Available under Align/AssembleMultiple AlignMUSCLE [12]
DNASTAR Lasergene Built-in Available in MegAlign Pro module under AlignAlign using Muscle [13]
MacVector Built-in Found under Align [14]
Sequencher (Gene Codes Corp.) Plug-in Requires installation of MUSCLE plugin; accessed under AssembleAlign UsingMuscle [15]
MEGA Plug-in Found under Alignment ExplorerAlignMUSCLE [16]
UGENE Plug-in Accessible by right-clicking → AlignAlign with Muscle [17]
AliView Plug-in Requires installation of MUSCLE plugin; Accessible by PreferencesExternal ToolsSet path to MUSCLE [18]
EMBL-EBI Web tool Accessible via the MUSCLE Web Interface [11]
T-Coffee Web tool Accessible via the more options menu and selecting muscle under the methods section [19]

See also

[edit]

References

[edit]
  1. ^ a b c d e f g h i Edgar RC (2004). "MUSCLE: multiple sequence alignment with high accuracy and high throughput". Nucleic Acids Research. 32 (5): 1792–97. doi:10.1093/nar/gkh340. PMC 390337. PMID 15034147.
  2. ^ a b c d Edgar RC (2004). "MUSCLE: a multiple sequence alignment method with reduced time and space complexity". BMC Bioinformatics. 5 (1): 113. doi:10.1186/1471-2105-5-113. PMC 517706. PMID 15318951.
  3. ^ a b c "Curriculum Vitae". drive5.com. Retrieved 2025-04-24.
  4. ^ Edgar, Robert (September 3, 2014). "An Unemployed Gentleman Scholar". Retrieved April 24, 2025.{{cite web}}: CS1 maint: url-status (link)
  5. ^ a b c "Robert C. Edgar". scholar.google.com. Retrieved 2025-04-24.
  6. ^ Edgar, Robert C. (2004-08-19). "MUSCLE: a multiple sequence alignment method with reduced time and space complexity". BMC Bioinformatics. 5 (1): 113. doi:10.1186/1471-2105-5-113. ISSN 1471-2105. PMC 517706. PMID 15318951.
  7. ^ "‪Muscle user guide‬". scholar.google.com. Retrieved 2025-04-22.
  8. ^ a b Edgar, Robert C. (2022). "Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny". Nature Communications. 13 (6968): 1–9. doi:10.1038/s41467-022-34630-w.
  9. ^ Zhang, Chenyue (November 29, 2024). "The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction". Biomolecules. 14 (12): 1531. doi:10.3390/biom14121531. PMC 11673352. PMID 39766238.
  10. ^ a b Pais, Fabiano (6 March 2014). "Assessing the efficiency of multiple sequence alignment programs". Algorithms for Molecular Biology. 9 (4): 4. doi:10.1186/1748-7188-9-4. PMC 4015676. PMID 24602402.
  11. ^ a b "MUSCLE < Multiple Sequence Alignment < EMBL-EBI". Archived from the original on 18 January 2015. Retrieved 1 September 2014.
  12. ^ Geneious: Multiple Alignment using MUSCLE
  13. ^ DNASTAR: MUSCLE alignment options
  14. ^ Alignments in MacVector
  15. ^ Sequencher: MUSCLE MSA
  16. ^ MEGA Help: MUSCLE Alignment
  17. ^ UGENE: MSA with Muscle
  18. ^ AliView: AliView: About
  19. ^ T-Coffee: Tutorial
[edit]