SUPERFAMILY EC2 AWS cloud image
This page describes how to use the publicly available SUPERFAMILY image on Amazon Elastic Compute Cloud (EC2) to run our pipeline for assigning protein domains.
The process involves running a FASTA format protein sequence file against SUPERFAMILY HMM models using cloud computing. The results are parsed and two files are outputted: one is a computer-readable tab-delimited file and the other is human-readable HTML.
This page is divided into three main sections:
1: Creating an AWS account
To have access to our publicly available image you first need to create a account at Amazon Web Service (AWS).
You can find here a great tutorial to setup your computer to use the EC2. This tutorial is only meant to beginners since more advanced users might have this setup already. You can also find an extensive documentation here.
Once you have created your account will you be able to search and mount instances with the SUPERFAMILY image.
2: Use scripts to produce domain assignments
Before start using the SUPERFAMILY image you will need to Register for a SUPERFAMILY license (free for academic and commercial use).
The Registration will give you access to a file in our FTP server in which you will find our image ID number at AWS.
2.1 Finding and mounting the SUPERFAMILY image:
Once you register you will have access to a file in our FTP server with our image ID.
With this ID you can mount as many instances as you like.
ec2-run-instances image_id -z us-east-1a -k key --group default -t c1.xlarge
For large sequence sets (20,000+) you should use the -t c1.xlarge since our pipeline is optimized for this type of instance.
2.2 Uploading Genome/FASTA and running pipeline:
Once you mount the image you will be able to upload your FASTA sequence file to the cloud instance.
Finding the instance address:
You should see a similar output:
RESERVATION r-9234a2fa 999988887777 default INSTANCE i-b2e051db ami-3c98a365 ec2-85-121-157-145.compute-1.amazonaws.com
domU-12-31-33-07-25-62.compute-1.internal running gsg-keypair 0 c1.xlarge 2010-03-17T13:17:41+0000 us-east-1a aki-
a73df9be ari-a51cf9bc monitoring-disabled 22.214.171.124 10.210.42.144 instance-store
Use the instance address to upload your file to the directory genomes:
scp -i key genome_file.fa email@example.com:/root/genomes
Connect with the cloud via SSH:
ssh -i key firstname.lastname@example.org
Start the pipeline script
Optionally you may use nohup to avoid losing your results in case of disconnection with the cloud.
The STDOUT and error messages will be written to /root/nohup.out file. You can use this file to track the progress of the pipeline as well as to identify any errors (i.e. wrong file extension, bad formating, etc.).
nohup superfamily.pl genome.fa &
2.3 SUPERFAMILY pipeline:
The superfamily.pl script is a wrapper that call several other programs.
These are the programs, description and the output of each of them.
Script Description OUTPUT
fasta_checker.pl Verifies the integrity of the fasta file and
replaces any non-standard character (i.e. stop codons) /root/genomes/genome_torun.fa
hmmscan.pl This is a wrapper that calls hmmscan.
This script is also responsible for
multi-threading hmmscan. /root/genomes/genomes.res (hmmer raw output)
Each Thread has a temporary output that is stored in /root/post_processing/tmp
ass3.pl This script is responsible for parsing the hmmscan
output and creating the assignment file. /root/genome.ass
ass_to_html.pl This script parses the genome.ass file into
an html table. /root/genome.html
3: Domain assignment output formats
Output is a tab-delimited file of domains (.ass), one domain per line.
There can be more than one domain per sequence, and there may be sequences for which there
is no domain assignment.
An HTML file is also available. It contains the output file in a html table (.html).
The columns for superfamily and family assignments:
SCOP family evalue
In our tests we were able to analyze 450 000 sequences per day per instance.
If you have further questions, suggestions or comments, then please contact
us using the feedback form or via email