Wednesday, March 4, 2009

5) LAPIS: Extracting all sbjct sequences of a particular length from BLAST results

This lapis code is to extract all sbjct sequences with the length of 15 amino acids and at the same time getting rid of double blast hits. This code only works provided that each hit only has one line of sbjct sequence.

Steps

1) extract {from line starting with > to line starting with sbjct} # (this is to get rid of double hits)
2) extract {from line containing "/15" to line containing sbjct}
3) extract {line starting with sbjct}
4) omit sbjct:
5) omit number
6) omit spaces


Output example corresponding the above codes:


a) Output of code: extract {from line starting with > to line starting with sbjct}


>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Length = 154

Score = 50.7 bits (112), Expect = 5e-09
Identities = 15/15 (100%), Positives = 15/15 (100%)

Query: 1 CRQILGQLQPSLQTG 15
         CRQILGQLQPSLQTG
Sbjct:57 CRQILGQLQPSLQTG 71


b) Output of code: extract {from line containing "/15" to line containing sbjct}


Identities = 15/15 (100%), Positives = 15/15 (100%)

Query: 1 CRQILGQLQPSLQTG 15
CRQILGQLQPSLQTG
Sbjct:57 CRQILGQLQPSLQTG 71


c) Output of code: extract {line starting with sbjct}


Sbjct: 57 CRQILGQLQPSLQTG 71


d) Output of codes: i) omit sbjct: ii) omit number iii) omit spaces


CRQILGQLQPSLQTG


Code by: Benben & Asif M. Khan
Blog Post by: BenBen
Post Edited by: Asif M. Khan

1 comment:

  1. 1. Parsing BLAST OUTPUT is a favourite initiation to a perl coder. Everyone has solved this problem. The art probably comes from trying to do it in the shortest possible time with the least amount of coding effort.

    2. awk '/^>/,/^Sbjct/' filename
    will generate all lines between those beginning with > and inclusive of that beginning with Sbjct. It will not be greedy, so only up to the first time Sbjct appears after the > will be outputted. This means all second or third blast alignments for the corresponding hit, will not be outputed.
    The awk idiom is:
    awk '/match1/,/match2/'
    and everything between line containing match1 and line containing match2, inclusive, will be outputted.

    3. Then you just need to grep for the lines containing Sbjct, and then print the sequence.

    awk '/^>/,/^Sbjct/' filename | grep "Sbjct" | awk -F" " '{print $2}'

    will give you the sequence.

    If the sequence contains hyphens, it is ok, but if it contains a space or more, the second awk command will fail.

    sed 's/Sbjct:.*[0-9][0-9]*[ ]\([A-Z ][A-Z ]*\)[0-9][0-9]*/\1/'

    should work, preserving the spaces.

    To remove the spaces, simply tr -d ' '
    or

    awk '/^>/,/^Sbjct/' filename | grep "Sbjct" |
    sed -e 's/Sbjct:.*[0-9][0-9]*[ ]\([A-Z ][A-Z ]*\)[0-9][0-9]*/\1/' -e 's/ //g'

    ReplyDelete