Friday, April 3, 2009

12) UNIX Shell Script - Converting blast2seq results into FASTA

This is a modification of the shell script written by Murali (see Entry 11) to convert a blast2seq result (blastp) to FASTA format by extracting out all sbjct sequences. This is very useful especially for creating my dengue database (see http://benlogbook.blogspot.com for more details), as I used this script as a polyprotein separator to separate the sequences for each protein subtype. However, double hits will still have to be removed from your blast2seq file.

In the case for my dengue database, I have all my sample sequences for each protein subtype (C,prM,E,NS1,NS2a,NS2b,NS3,NS4a,NS4b,NS5) in FASTA, and paste all in the query textbox. Next, I paste my polyprotein which contain the full protein sequence in the sbjct textbox. The e-value threshold is set at 0.1 to prevent double hits from appearing. After obtaining the blast2seq results in text file, this script is executed to obtain each protein sequence in each subtype. Note that the results file would have to be placed in a folder called "input" before running the script, which have to be placed anywhere except inside the "input" folder itself.


Process Code

#!/usr/bin/bash
mkdir output
ls input | sed 's/.txt//g' > index.txt
# lists all the files in the folder called input, removes the extension and places them in a file called index.txt
#extension of BLAST output is assumed to be .txt. If your BLAST output is not a text file, change the code in order to comply with your file extension.
printf "The following lines in the following BLAST output files had gaps" > logfile.txt
# creates the logfile
for num in `cat index.txt`
# generates a for-loop where your variable $num takes the values of an input file name in each iteration
do cat input/$num.txt | awk '/>/ {printf "\n\n"$0"\n"} /Sbjct/ {printf $3}' | sed 's/-//g' > output/$num.fasta
# takes the sequence identifiers and the subject sequences, removes the gaps from subject sequences and funnels it to an output file which is of the same name, but is in an output folder
printf "\n" >> logfile.txt
# generates a new line and updates the logfile
printf $num >> logfile.txt
# generates a new entry for that input file
printf "\n" >> logfile.txt
# generates a new line
cat input/$num.txt | grep -n Sbjct | sed 's/Sbjct//g' | awk '/-/ {print $1," " ,$3}' >> logfile.txt
# takes an input file, numbers the lines, prints the line number of the subject sequence with gaps and the subject sequence and updates the logfile
done
# end of loop
printf "\n"
printf "Gaps have been removed in output" >> logfile.txt



Code by: Murali and BenBen
Blog Post by: BenBen

No comments:

Post a Comment