Contents
Initial release
We often have to convert between sequence formats and do little tasks on them, and it’s not worth writing scripts for that. Seqmagick is a kickass little utility built in the spirit of imagemagick to expose the file format conversion in Biopython in a convenient way. Instead of having a big mess of scripts, there is one that takes arguments:
seqmagick convert a.fasta b.phy # convert from fasta to phylip
seqmagick mogrify --ungap a.fasta # remove all gaps from a.fasta, in place
seqmagick info *.fasta # describe all FASTA files in the current directory
And more.
First, you’ll need to install BioPython. NumPy (which parts of BioPython depend on) is not required for seqmagick to function. Once done, install the latest release with:
pip install seqmagick
Or install the bleeding edge version:
pip install git+git://github.com/fhcrc/seqmagick.git@master#egg-info=seqmagick
Seqmagick can be used to query information about sequence files, convert between types, and modify sequence files. All functions are accessed through subcommands:
seqmagick <subcommand> [options] arguments
Convert and mogrify achieve similar goals. convert performs some operation on a file (from changing format to something more complicated) and writes to a new file. mogrify modifies a file in place, and would not normally be used to convert formats.
The two have similar signatures:
seqmagick convert [options] infile outfile
vs:
seqmagick mogrify [options] infile
Options are shared between convert and mogrify.
convert can be used to convert between any file types BioPython supports (which is many). For a full list of supported types, see the BioPython SeqIO wiki page.
By default, file type is inferred from file extension, so:
seqmagick convert a.fasta a.sto
converts an existing file a.fasta from FASTA to Stockholm format. Neat! But there’s more.
A wealth of options await you when you’re ready to do something slightly more complicated with your sequences.
Let’s say I just want a few of my sequences:
$ seqmagick convert --head 5 examples/test.fasta examples/test.head.fasta
$ seqmagick info examples/test*.fasta
name alignment min_len max_len avg_len num_seqs
examples/test.fasta FALSE 972 9719 1573.67 15
examples/test.head.fasta FALSE 978 990 984.00 5
Or I want to remove any gaps, reverse complement, select the last 5 sequences, and remove any duplicates from an alignment in place:
seqmagick mogrify --tail 5 --reverse-complement --ungap --deduplicate-sequences examples/test.fasta examples/test.fasta
You can even define your own functions in python and use them via --apply-function.
Note
To maximize flexibility, most transformations passed as options to mogrify and convert are processed in order, so:
seqmagick convert --min-length 50 --cut 1:5 a.fasta b.fasta
will work fine, but:
seqmagick convert --cut 1:5 --min-length 50 a.fasta b.fasta
will never return records, since the cutting transformation happens before the minimum length predicate is applied.
Traceback (most recent call last):
File "../seqmagick.py", line 7, in <module>
sys.exit(cli.main(sys.argv[1:]))
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/scripts/cli.py", line 12, in main
action, arguments = parse_arguments(argv)
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/scripts/cli.py", line 58, in parse_arguments
for name, mod in subcommands.itermodules():
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/subcommands/__init__.py", line 7, in itermodules
__import__('%s.%s' % (root, command), fromlist=[command]))
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/subcommands/convert.py", line 8, in <module>
from Bio import Alphabet, SeqIO
ImportError: No module named Bio
Given a protein alignment and unaligned nucleotides, align the nucleotides using the protein alignment. Protein and nucleotide sequence files must contain the same number of sequences, in the same order, with the same IDs.
Traceback (most recent call last):
File "../seqmagick.py", line 7, in <module>
sys.exit(cli.main(sys.argv[1:]))
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/scripts/cli.py", line 12, in main
action, arguments = parse_arguments(argv)
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/scripts/cli.py", line 58, in parse_arguments
for name, mod in subcommands.itermodules():
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/subcommands/__init__.py", line 7, in itermodules
__import__('%s.%s' % (root, command), fromlist=[command]))
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/subcommands/convert.py", line 8, in <module>
from Bio import Alphabet, SeqIO
ImportError: No module named Bio
seqmagick extract-ids is extremely simple - all the IDs from a sequence file are printed to stdout (by default) or the file of your choosing:
Traceback (most recent call last):
File "../seqmagick.py", line 7, in <module>
sys.exit(cli.main(sys.argv[1:]))
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/scripts/cli.py", line 12, in main
action, arguments = parse_arguments(argv)
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/scripts/cli.py", line 58, in parse_arguments
for name, mod in subcommands.itermodules():
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/subcommands/__init__.py", line 7, in itermodules
__import__('%s.%s' % (root, command), fromlist=[command]))
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/subcommands/convert.py", line 8, in <module>
from Bio import Alphabet, SeqIO
ImportError: No module named Bio
seqmagick info describes one or more sequence files
seqmagick info examples/*.fasta
name alignment min_len max_len avg_len num_seqs
examples/aligned.fasta TRUE 9797 9797 9797.00 15
examples/dewrapped.fasta TRUE 240 240 240.00 148
examples/range.fasta TRUE 119 119 119.00 2
examples/test.fasta FALSE 972 9719 1573.67 15
examples/wrapped.fasta FALSE 120 237 178.50 2
Output can be in comma-separated, tab-separated, or aligned formats. See seqmagick info -h for details.
Usage:
Traceback (most recent call last):
File "../seqmagick.py", line 7, in <module>
sys.exit(cli.main(sys.argv[1:]))
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/scripts/cli.py", line 12, in main
action, arguments = parse_arguments(argv)
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/scripts/cli.py", line 58, in parse_arguments
for name, mod in subcommands.itermodules():
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/subcommands/__init__.py", line 7, in itermodules
__import__('%s.%s' % (root, command), fromlist=[command]))
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/subcommands/convert.py", line 8, in <module>
from Bio import Alphabet, SeqIO
ImportError: No module named Bio
quality-filter truncates and removes sequences that don’t match a set of quality criteria. The subcommand takes a FASTA and quality score file, and writes the results to an output file:
Traceback (most recent call last):
File "../seqmagick.py", line 7, in <module>
sys.exit(cli.main(sys.argv[1:]))
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/scripts/cli.py", line 12, in main
action, arguments = parse_arguments(argv)
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/scripts/cli.py", line 58, in parse_arguments
for name, mod in subcommands.itermodules():
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/subcommands/__init__.py", line 7, in itermodules
__import__('%s.%s' % (root, command), fromlist=[command]))
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/subcommands/convert.py", line 8, in <module>
from Bio import Alphabet, SeqIO
ImportError: No module named Bio
primer-trim trims an alignment to a region defined by a set of forward and reverse primers. Usage is as follows:
Traceback (most recent call last):
File "../seqmagick.py", line 7, in <module>
sys.exit(cli.main(sys.argv[1:]))
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/scripts/cli.py", line 12, in main
action, arguments = parse_arguments(argv)
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/scripts/cli.py", line 58, in parse_arguments
for name, mod in subcommands.itermodules():
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/subcommands/__init__.py", line 7, in itermodules
__import__('%s.%s' % (root, command), fromlist=[command]))
File "/var/build/user_builds/seqmagick/checkouts/0.6.0/seqmagick/subcommands/convert.py", line 8, in <module>
from Bio import Alphabet, SeqIO
ImportError: No module named Bio
By default, seqmagick infers the file type from extension. Currently mapped extensions are:
Extension | Format |
---|---|
.afa | fasta |
.aln | clustal |
.fa | fasta |
.faa | fasta |
.fas | fasta |
.fasta | fasta |
.fastq | fastq |
.ffn | fasta |
.fna | fasta |
.fq | fastq |
.frn | fasta |
.gb | genbank |
.gbk | genbank |
.needle | emboss |
.nex | nexus |
.phy | phylip |
.phylip | phylip |
.phyx | phylip-relaxed |
.qual | qual |
.sff | sff-trim |
.sth | stockholm |
.sto | stockholm |
Note
NEXUS-format output requires the --alphabet flag.
When reading from stdin or writing to stdout, seqmagick defaults to fasta format. This behavior may be overridden with the --input-format and --output-format flags.
If an extension is not listed, you can either rename the file to a supported extension, or specify it manually via --input-format or --output-format.
most commands support gzip (files ending in .gz) and bzip (files ending in .bz2 or .bz) compressed inputs and outputs. File types for these files are inferred using the extension of the file after stripping the file extension indicating that the file is compressed, so input.fasta.gz would be inferred to be in FASTA format.
seqmagick is written and maintained by the Matsen Group at the Fred Hutchinson Cancer Research Center.
We welcome contributions! Simply fork the repository on GitHub and send a pull request.