Saturday, March 21, 2009

11) Script converting BLAST output to FASTA format

The following script takes all your BLAST output from an input folder, processes the files, removes gaps and generates the ouput files in an output folder. It will also generate a logfile which indicates which lines in which input file contain gaps.


#!/bin/bash
mkdir output
ls input | sed 's/.txt//g' > index.txt
# lists all the files in the folder called input, removes the extension and places them in a file called index.txt
#extension of BLAST output is assumed to be .txt. If your BLAST output is not a text file, change the code in order to comply with your file extension.
printf "The following lines in the following BLAST output files had gaps" > logfile.txt
# creates the logfile
for num in `cat index.txt`
# generates a for-loop where your variable $num takes the values of an input file name in each iteration
do cat input/$num.txt | awk '/>/ {print $0} /Sbjct:/ {print $3}' | sed 's/-//g' > output/$num.fasta
# takes the sequence identifiers and the subject sequences, removes the gaps from subject sequences and funnels it to an output file which is of the same name, but is in an output folder
printf "\n" >> logfile.txt
# generates a new line and updates the logfile
printf $num >> logfile.txt
# generates a new entry for that input file
printf "\n" >> logfile.txt
# generates a new line
cat input/$num.txt | grep -n Sbjct | sed 's/Sbjct://g' | awk '/-/ {print $1," " ,$3}' >> logfile.txt
# takes an input file, numbers the lines, prints the line number of the subject sequence with gaps and the subject sequence and updates the logfile
done
# end of loop
printf "\n"
printf "Gaps have been removed in output" >> logfile.txt

To execute, copy the code to a text editor and save it as a .sh file. Create a directory called input and place all your BLAST output into the folder, nothing else. Run the script in the parent folder. 

1 comment:

  1. Thank you for this script. It worked great once I removed the colons from the regular expression "Sbjct:".

    I guess NCBI has changed the BLAST output.

    ReplyDelete