SUPERFAMILY 1.75 HMM library and genome assignments server

SUPERFAMILY EC2 AWS cloud image

This page describes how to use the publicly available SUPERFAMILY image on Amazon Elastic Compute Cloud (EC2) to run our pipeline for assigning protein domains.

Introduction

The process involves running a FASTA format protein sequence file against SUPERFAMILY HMM models using cloud computing. The results are parsed and two files are outputted: one is a computer-readable tab-delimited file and the other is human-readable HTML.

This page is divided into three main sections:

1: Creating an AWS account

To have access to our publicly available image you first need to create a account at Amazon Web Service (AWS).

You can find here a great tutorial to setup your computer to use the EC2. This tutorial is only meant to beginners since more advanced users might have this setup already. You can also find an extensive documentation here.

Once you have created your account will you be able to search and mount instances with the SUPERFAMILY image.

2: Use scripts to produce domain assignments

Before start using the SUPERFAMILY image you will need to Register for a SUPERFAMILY license (free for academic and commercial use).
The Registration will give you access to a file in our FTP server in which you will find our image ID number at AWS.

2.1 Finding and mounting the SUPERFAMILY image:

Once you register you will have access to a file in our FTP server with our image ID.
With this ID you can mount as many instances as you like.

 ec2-run-instances image_id -z us-east-1a -k key --group default -t c1.xlarge

For large sequence sets (20,000+) you should use the -t c1.xlarge since our pipeline is optimized for this type of instance.

2.2 Uploading Genome/FASTA and running pipeline:
Once you mount the image you will be able to upload your FASTA sequence file to the cloud instance.

Finding the instance address:

ec2-describe-instance

You should see a similar output:

RESERVATION r-9234a2fa 999988887777 default INSTANCE i-b2e051db ami-3c98a365 ec2-85-121-157-145.compute-1.amazonaws.com
domU-12-31-33-07-25-62.compute-1.internal running gsg-keypair	0 c1.xlarge	2010-03-17T13:17:41+0000	us-east-1a	aki-
a73df9be	ari-a51cf9bc	monitoring-disabled	85.121.157.145 10.210.42.144	instance-store

Use the instance address to upload your file to the directory genomes:

scp  -i  key genome_file.fa  root@ec2-75-101-157-145.compute-1.amazonaws.com:/root/genomes

Connect with the cloud via SSH:

ssh -i key root@ec2-75-101-157-145.compute-1.amazonaws.com

Start the pipeline script

superfamily.pl genome.fa

Optionally you may use nohup to avoid losing your results in case of disconnection with the cloud.
The STDOUT and error messages will be written to /root/nohup.out file. You can use this file to track the progress of the pipeline as well as to identify any errors (i.e. wrong file extension, bad formating, etc.).

nohup superfamily.pl genome.fa &
2.3 SUPERFAMILY pipeline:

The superfamily.pl script is a wrapper that call several other programs.
These are the programs, description and the output of each of them.

Script           Description                                                     OUTPUT

fasta_checker.pl Verifies the integrity of the fasta file and 
                 replaces any non-standard character (i.e. stop codons)		/root/genomes/genome_torun.fa
                 
hmmscan.pl       This is a wrapper that calls hmmscan. 
                 This script is also responsible for 
                 multi-threading hmmscan.                                      	/root/genomes/genomes.res (hmmer raw output)
                 Each Thread has a temporary output that is stored in          	/root/post_processing/tmp

ass3.pl          This script is responsible for parsing the hmmscan
                 output and creating the assignment file.                       /root/genome.ass
                 
ass_to_html.pl   This script parses the genome.ass file into 
                 an html table.                                                 /root/genome.html

3: Domain assignment output formats

Output is a tab-delimited file of domains (.ass), one domain per line.
There can be more than one domain per sequence, and there may be sequences for which there is no domain assignment.

An HTML file is also available. It contains the output file in a html table (.html).

The columns for superfamily and family assignments:

Sequence ID
Match region
E-value Score
SCOP superfamily
Family E-value
SCOP family evalue
Closest structure
Alignment

In our tests we were able to analyze 450 000 sequences per day per instance.

If you have further questions, suggestions or comments, then please contact us using the feedback form or via email superfamily@cs.bris.ac.uk.