Tuesday, March 10, 2009

6) Process hit sequences in BLAST output files into FASTA formatted sequences

 
LAPIS is problematic to use with large input files because of its limited memory.
We end up having to break up the file, process it (often having to restart LAPIS every time the memory runs out. The following code is able to process or convert all the hit sequences in Blast output files into Fasta formatted sequences.

Follow the following steps:

1. copy the following code into a text editor


#!/bin/bash
mkdir output;
for num in `seq 4`;
do for pii in `seq -w 2 11`;
do `cat DENV$num"o"$pii.txt | sed 's/>/33ab>/g' | sed 's/Sbjct:/33abSbjct:/g' | grep '33ab' | sed 's/33ab//g' | sed 's/Sbjct://g' | sed 's/[ ][0-9]/33ab/g' | sed 's/33ab/ /g' | sed 's/[ ][0-9]/33ab/g' | sed 's/33ab/ /g' | sed 's/[ ][0-9]/33ab/g' | sed 's/33ab/ /g' | sed 's/[ ][0-9]/33ab/g' | sed 's/33ab/ /g' | sed 's/ //g' > output/DENV$num$pii.fasta`;
done;
done


2. Delete the line breaks. i.e. make it such that everything is in one line with the different commands separated by semicolons ";". Remember to put a space after the semicolon.

3. Save the above code in the text editor as extract.sh (or whatever you want)

4. Remember to run it in bash enviroment as "bash extract.sh" using Linux Konsole

Note:

i. My BLAST results were in the form DENV$num"o"$pii.txt e.g DENV1o02.txt, so, I've got 2 nested "for loops" to change 2 variables $num and $pii; DENV1o02, DENV2o02, etc and to change DENV1o02, DENV1o03. For convenience use files named abcd1, abcd2 etc so that the for loops can be used to run the program for a number of files

ii. I use sed to add in a dummy identifier "33ab" to ">" and "Sbjct". Then I can use grep to extract these lines. I use sed to remove 33ab, then I use sed to remove "Sbjct:", then subject sequence numbers and the spaces. Unfortunately all 1 digit to 4 digit numbers are removed (including from the sequence title line) except the gi and the accession numbers

iii. The processed files are created in a a folder called output. Make sure that there are no other folders of that name in your folder. Alternatively the output folder can be renamed from the code

Code by: Murali
Blog Post by: Murali
Post Edited by: Asif M. Khan

1 comment:

  1. 1. Why do you need to globally substitute > to 33ab>, and Sbjct to 33abSbjct before grep'ping it? why can't you simply egrep for "^>|^Sbjct" ie. extended grep, all in a single command. If you cannot egrep it, substituting it and grepping for it makes no difference.
    2. multiple sed commands can be sucked into one since sed process for faster processing by using this idiom: sed -e 's/blah/blah/' -e 's/bloh/bloh/' -e ....
    one after another.
    3. your second set of sed commands are totally baffling to me. You can be sure you are doing something in a circuitous manner.

    ReplyDelete