Bioinformatics Codelets: 21) LAPIS: Extracting all hit sequences from BLAST results (for hits with one or more lines of sbjct sequence)

The LAPIS code below allows extraction of all hit sequences from BLAST results, even when there are more than one line of sbjct sequences. This is in contrast to post (4) which only works if there is only one line of sbjct sequence for each BLAST hit. The code in this post also teaches how to clean up the description line to only contain the GI number.

Steps


1. Select the line containing the fasta description together with the line containing the subject sequences: “line containing > or line containing sbjct” --> Tools --> Extract

2. To get rid of the numbers in the line containing sbjct: “digits in line containing sbjct” --> Tools --> Omit

3. To get rid of sbjct: “sbjct:” --> Extract --> Omit

4. To get rid of dashes: type “-“ --> Extract --> Omit

5. To get rid of the extra spaces in the lines containing sequences: “spaces not in line containing >” --> Tools --> Omit

6. In case you want to clean up the description line to only have the GI
a. From second | in line containing > to start of linebreak

Output example corresponding the above codes:

Before

>gi|126385999|gb|CP000521.1| Acinetobacter baumannii ATCC 17978, complete genome
Length = 3976747

Score = 570 bits (1470), Expect = e-163
Identities = 284/284 (100%), Positives = 284/284 (100%)
Frame = -2

Query: 1 LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT 60
LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT
Sbjct: 1766322 LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT 1766143

After

>gi|126385999
LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT

Bioinformatics Codelets

Sunday, May 31, 2009

21) LAPIS: Extracting all hit sequences from BLAST results (for hits with one or more lines of sbjct sequence)

No comments:

Post a Comment

Related Sites

Contributors