Steps
1) extract {from line starting with > to line starting with sbjct} # (this is to get rid of double hits)
2) extract {from line containing "/15" to line containing sbjct}
3) extract {line starting with sbjct}
4) omit sbjct:
5) omit number
6) omit spaces
Output example corresponding the above codes:
a) Output of code:
extract {from line starting with > to line starting with sbjct}
>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Length = 154
Score = 50.7 bits (112), Expect = 5e-09
Identities = 15/15 (100%), Positives = 15/15 (100%)
Query: 1 CRQILGQLQPSLQTG 15
CRQILGQLQPSLQTG
Sbjct:57 CRQILGQLQPSLQTG 71
b) Output of code:
extract {from line containing "/15" to line containing sbjct}
Identities = 15/15 (100%), Positives = 15/15 (100%)
Query: 1 CRQILGQLQPSLQTG 15
CRQILGQLQPSLQTG
Sbjct:57 CRQILGQLQPSLQTG 71
c) Output of code:
extract {line starting with sbjct}
Sbjct: 57 CRQILGQLQPSLQTG 71
d) Output of codes:
i) omit sbjct: ii) omit number iii) omit spaces
CRQILGQLQPSLQTG
Code by: Benben & Asif M. Khan
Blog Post by: BenBen
Post Edited by: Asif M. Khan
1. Parsing BLAST OUTPUT is a favourite initiation to a perl coder. Everyone has solved this problem. The art probably comes from trying to do it in the shortest possible time with the least amount of coding effort.
ReplyDelete2. awk '/^>/,/^Sbjct/' filename
will generate all lines between those beginning with > and inclusive of that beginning with Sbjct. It will not be greedy, so only up to the first time Sbjct appears after the > will be outputted. This means all second or third blast alignments for the corresponding hit, will not be outputed.
The awk idiom is:
awk '/match1/,/match2/'
and everything between line containing match1 and line containing match2, inclusive, will be outputted.
3. Then you just need to grep for the lines containing Sbjct, and then print the sequence.
awk '/^>/,/^Sbjct/' filename | grep "Sbjct" | awk -F" " '{print $2}'
will give you the sequence.
If the sequence contains hyphens, it is ok, but if it contains a space or more, the second awk command will fail.
sed 's/Sbjct:.*[0-9][0-9]*[ ]\([A-Z ][A-Z ]*\)[0-9][0-9]*/\1/'
should work, preserving the spaces.
To remove the spaces, simply tr -d ' '
or
awk '/^>/,/^Sbjct/' filename | grep "Sbjct" |
sed -e 's/Sbjct:.*[0-9][0-9]*[ ]\([A-Z ][A-Z ]*\)[0-9][0-9]*/\1/' -e 's/ //g'