Steps
1) extract {from line starting with > to line starting with sbjct} (this is to get rid of double hits)
2) extract {line either starting > or starting sbjct} ("or" must be after "either")
3) omit {number in line starting with sbjct}
4) omit {spaces in line starting with sbjct}
5) omit sbjct:
Output example corresponding the above codes:
a) Output of code:
extract {from line starting with > to line starting with sbjct}
>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Length = 154
Score = 50.7 bits (112), Expect = 5e-09
Identities = 15/15 (100%), Positives = 15/15 (100%)
Query: 1 CRQILGQLQPSLQTG 15
CRQILGQLQPSLQTG
Sbjct:57 CRQILGQLQPSLQTG 71
Lines starting with ">" to lines starting with "sbjct" are extracted.
b) Output of code:
extract {line either starting > or starting sbjct}
>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Sbjct: 57 CRQILGQLQPSLQTG 71
Lines starting with ">" and "sbjct" are extracted.
c) Output of code:
omit {number in line starting with sbjct}
>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Sbjct: CRQILGQLQPSLQTG
Numbers in lines starting with "sbjct" are removed.
d) Output of code:
omit {spaces in line starting with sbjct}
>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Sbjct:CRQILGQLQPSLQTG
Spaces in lines starting with "sbjct" are removed.
e) Output of code:
omit {spaces in line starting with sbjct}
>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
CRQILGQLQPSLQTG
The word "sbjct:" is removed, thus obtaining the FASTA sequences in the end.
Code by: Benben & Asif M. Khan
Blog Post by: BenBen
Post Edited by: BenBen
Thank you Prof Tan for your comments. I will try them.
ReplyDelete