Friday, March 20, 2099
Monday, October 26, 2009
24) Unix one-liner to split one-line sequences into FASTA formatted files
$ pr -n:3 -t -T inputfile.txt | sed 's/^/>/' | tr ":" "\n"
pr is print command.
-n:3 is to number the line with 3 digits or more depending on how many lines you have
and : is to separate the line number from the sequence
-t is to switch off header
-T is to switch off pagination
sed 's/^/>/' is to insert a > at the beginning of each line,
and
tr is to change the separator ":" with a new line character "\n"
And if you want to wrap the sequence into a width of 60 characters,
$ pr -n:3 -t -T input.txt | sed 's/^/>/' | tr ":" "\n" | fold -w 60
where the command
fold -w 60 means to wrap with a width of 60 characters.
By adjusting the -n:3 to 4 you can change the number of spaces for the digits
By adjusting -w 60, you can change the width of your sequences
By substituting s/^/>/ with s/^[ ]*/>/ you can remove spaces in front,
or you can add prefixes in front of the number e.g
s/^[ ]*/>NP/
And if you want to split each line into a separate FASTA formatted file with filename corresponding to the number
$ pr -n:3 -t -T input.txt | sed 's/^[ ]*/>NP/' | tr ":" "\n" | \
fold -w 60 | csplit -f NP - "/>/" "{*}"
where the command csplit is told to name the split files starting with a
prefix "-f NP"
and to take the input file as from the standard input "-"
and to split the files according to a matching regular pattern starting with "/>/"
and to do this repetitively until the end of the file "{*}"
Say for 107 lines of sequences, you will get 108 files listed NP00 to NP107 where the contents of NP01 are
>NP01
tataatatcagtatatctat
or whatever the sequence is folded to 60 characters per line.
NP00 is blank file. Ignore.
So, just one line of code. Five unix commands: pr, sed, tr, fold, csplit will do the job really fast
because it is pipelined. Therefore for large inputfiles, the first files will start to pop out as soon as they are finished, so you can actually read them almost immediately after you start the program.
Tuesday, October 20, 2009
23) Simple perl script to generate a header line for each sequence in a fasta file
Output: An output file, xxx_processed.fasta will be produced
Note: There are 2 conditions for the script to work
1. Sequences must be formatted in a way such that each line contains 1 sequence only
2. Sequences must be saved in *.fasta extension (of course the script can be modified to
take any other extensions)
Code (without comments):
#!/usr/bin/perl
print "\n NOTE: Output will be filename_processed.fasta";
print "\nPlease enter filename (without .fasta extension): ";
$input = <>;
chomp ($input);
$number=1;
open (INFILE, "$input.fasta") or die "Cannot open infile!";
open (OUT, ">"."$input"."_processed.fasta") or die "Cannot open outfile!";
while ($line=
{
print OUT ">$number\n";
print OUT $line;
$number++;
}
Code (with comments):
#!/usr/bin/perl
# Text to print at command line (\n means print a new line)
print "\n NOTE: Output will be filename_processed.fasta";
print "\nPlease enter filename (without .fasta extension): ";
# Processing file name at command line, file name will be pasesd to the variable $input -> $input = file name
$input = <>;
chomp ($input);
# Setting the value of variable, $number, to 1
$number=1;
# Opening the input file, generate error message "Cannot open infile!" if there's an error
# INFILE is the reference (filehandle) for your input file
open (INFILE, "$input.fasta") or die "Cannot open infile!";
# Creating the output file, the symbol > is to create and overwrite a new file
open (OUT, ">"."$input"."_processed.fasta") or die "Cannot open outfile!";
# Passing each line in the input file (INFILE) to the variable $line
while ($line=
{
# Print, in the output file (OUT), fasta header starting with the symbol >, followed by the variable $number, and a new line, can customize the header to print if you want to
print OUT ">$number\n";
# Print out current line from input file
print OUT $line;
# Change the value of $number so that $number=$number+1
$number++;
}
Tuesday, September 22, 2009
22) Perl version of AVANA: BENANA!!!!!
Past AVANA users will know that in the presence of any real gap/s in e.g. a nonamer window, AVANA will shift its sequence to cover the real gap/s. Furthermore, AVANA excludes any nonamers with padded gaps in its computations. However, BENANA has solved both issues...in that there is no shifting of amino acids to cover any gaps and it also includes padded gaps in its computations.
Things to take note:
1) After running BENANA, it will prompt you to enter your input file. Please note that BENANA only accepts .taln format as input files.
2) Next, define the window size that you are interested in. If you try to enter any thing besides numbers, the program will be terminated. If you did not enter any thing, the program will treat it as "9" by default.
3) Please ignore _output1.csv and _output2.csv as they are used for processing.
4) It is important to note that full gaps are not considered as a variant and is excluded from the variant list.
5) Output files: _intraserotype_position.csv, _intraserotype_variant_list_sorted.csv, BENANA_intraserotype_diversity_results.csv
What BENANA does:
For example, if your taln input file contains:
AAAACCCCDDDD
AAAANNNNMMMM
CCCCNNNNMMMM
It first generates two files output1.csv and output2.csv, which is used for processing (you may ignore them).
It then generates a peptide list at each position in the window that the user has defined (default: nonamer window). In the above example, 4 files are automatically generated:
1-9_intraserotype_position.csv:
1,AAAACCCCD,9
1,AAAANNNNM,9
1,CCCCNNNNM,9
2-10_intraserotype_position.csv:
2,AAACCCCDD,10
2,AAANNNNMM,10
2,CCCNNNNMM,10
3-11_intraserotype_position.csv:
3,AACCCCDDD,11
3,AANNNNMMM,11
3,CCNNNNMMM,11
4-12_intraserotype_position.csv:
4,ACCCCDDDD,12
4,ANNNNMMMM,12
4,CNNNNMMMM,12
From each peptide list at each position, BENANA also generates a variant list just like AVANA, with the predominant peptide displayed at the top and the count of each peptide in the alignment. In the case where there is no predominant peptide, the first peptide in the alignment will be displayed at the top:
1-9_intraserotype_variant_list_sorted.csv:
AAAACCCCD,1
AAAANNNNM,1
CCCCNNNNM,1
2-10_intraserotype_variant_list_sorted.csv:
AAACCCCDD,1
AAANNNNMM,1
CCCNNNNMM,1
3-11_intraserotype_variant_list_sorted.csv:
AACCCCDDD,1
AANNNNMMM,1
CCNNNNMMM,1
4-12_intraserotype_variant_list_sorted.csv:
ACCCCDDDD,1
ANNNNMMMM,1
CNNNNMMMM,1
Lastly, BENANA will generate the diversity file just like AVANA did. However, it does not compute entropy values at each position.
Diversity File:
StartPos,Predominant Peptide,EndPos,% Representation,No. of Variants,Total no. of Sequences with Valid Variants,Support %
1,CCCCNNNNM,9,33.33%,2,3,100.00%
2,AAACCCCDD,10,33.33%,2,3,100.00%
3,CCNNNNMMM,11,33.33%,2,3,100.00%
4,ANNNNMMMM,12,33.33%,2,3,100.00%
BENANA_open_source_v1.0.pl is downloadable from this link.
© 2009 ^BeNBeN^. All rights reserved.
Sunday, May 31, 2009
21) LAPIS: Extracting all hit sequences from BLAST results (for hits with one or more lines of sbjct sequence)
Steps
1. Select the line containing the fasta description together with the line containing the subject sequences: “line containing > or line containing sbjct” --> Tools --> Extract
2. To get rid of the numbers in the line containing sbjct: “digits in line containing sbjct” --> Tools --> Omit
3. To get rid of sbjct: “sbjct:” --> Extract --> Omit
4. To get rid of dashes: type “-“ --> Extract --> Omit
5. To get rid of the extra spaces in the lines containing sequences: “spaces not in line containing >” --> Tools --> Omit
6. In case you want to clean up the description line to only have the GI
a. From second | in line containing > to start of linebreak
Output example corresponding the above codes:
Before
>gi|126385999|gb|CP000521.1| Acinetobacter baumannii ATCC 17978, complete genome
Length = 3976747
Score = 570 bits (1470), Expect = e-163
Identities = 284/284 (100%), Positives = 284/284 (100%)
Frame = -2
Query: 1 LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT 60
LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT
Sbjct: 1766322 LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT 1766143
After
>gi|126385999
LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT
Tuesday, April 21, 2009
20) Plotting Entropy Plot of MSA
19) Calculating Entropy of MSA by variable windowsize
#!/bin/bashL=`head -1 $2 | wc -c | awk '{print $1}' | sed 's/.*/&-1/' | bc`x=`expr $L - $1 + 1`n=`cat $2 |wc -l `w=$1for i in `seq $x`
doj=`expr $i + $w - 1`
cat $2 | cut -c$i-$j| sort | uniq -cecho -n "Position=$i "cat $2 | cut -c$i-$j| sort | uniq -c | awk -v n="$n" '{ h=( -1 / n * log(2)) * ( log($1) - log(n) ); H=h+H } END {print "Entropy=" H }'echo -------done
#!/bin/bash# Usage: prog # this program computes the Entropy of window size w for a# multiple sequence alignment with gaps marked with "-" only.# Date: 22 April 2009# ToDo:# does not trap missing $1# does not handle blank or missing data or missing line# works for windows = 1 onwards. w cannot be = or less than 0.# need to check the equation for any errors# assumes that the data input is well formated according to msa output# does not trap data input errors and may still generate errorneous output# does not cater for description tag of the clustal MSA output# only handles one catenated sequence per line
input=$2# Assumes the first line is the longest# L = number of column positions to be computedL=`head -1 $2 | wc -c | awk '{print $1}' | sed 's/.*/&-1/' | bc`
# x - number of windowsx=`expr $L - $1 + 1`echo Window Size: $1 sizeecho No of Windows: $x windows# number of sequences (=rows, since only one cateneted sequence per line allowed)n=`cat $input |wc -l `echo No of Lines: $n linesw=$1
# i = current residue position beginfor i in `seq $x`do
# i = current residue position endj=`expr $i + $w - 1`
# print aligned sequences within window and count themcat $input | cut -c$i-$j| sort | uniq -c
# Report position numberecho -n "Position=$i "# Compute Entropy in bits of these sequencescat $input | cut -c$i-$j| sort | uniq -c | awk -v n="$n" '{ h=( -1 / n * log(2)) * ( log($1) - log(n) ); H=h+H } END {print "Entropy=" H }'echo -------done
Sunday, April 19, 2009
18) Doing maths in Unix
17) Basic shell command to delimit protein/DNA sequence
16) Basic shell command to count number of fields in a file
Tuesday, April 7, 2009
15) Perl script to randomly shuffle sequences in a fasta file
14) Perl script for separating a fasta file into individual sequence fasta files
13) Perl script for splitting a fasta file into several small files
Friday, April 3, 2009
12) UNIX Shell Script - Converting blast2seq results into FASTA
In the case for my dengue database, I have all my sample sequences for each protein subtype (C,prM,E,NS1,NS2a,NS2b,NS3,NS4a,NS4b,NS5) in FASTA, and paste all in the query textbox. Next, I paste my polyprotein which contain the full protein sequence in the sbjct textbox. The e-value threshold is set at 0.1 to prevent double hits from appearing. After obtaining the blast2seq results in text file, this script is executed to obtain each protein sequence in each subtype. Note that the results file would have to be placed in a folder called "input" before running the script, which have to be placed anywhere except inside the "input" folder itself.
Process Code
#!/usr/bin/bash
mkdir output
ls input | sed 's/.txt//g' > index.txt
# lists all the files in the folder called input, removes the extension and places them in a file called index.txt
#extension of BLAST output is assumed to be .txt. If your BLAST output is not a text file, change the code in order to comply with your file extension.
printf "The following lines in the following BLAST output files had gaps" > logfile.txt
# creates the logfile
for num in `cat index.txt`
# generates a for-loop where your variable $num takes the values of an input file name in each iteration
do cat input/$num.txt | awk '/>/ {printf "\n\n"$0"\n"} /Sbjct/ {printf $3}' | sed 's/-//g' > output/$num.fasta
# takes the sequence identifiers and the subject sequences, removes the gaps from subject sequences and funnels it to an output file which is of the same name, but is in an output folder
printf "\n" >> logfile.txt
# generates a new line and updates the logfile
printf $num >> logfile.txt
# generates a new entry for that input file
printf "\n" >> logfile.txt
# generates a new line
cat input/$num.txt | grep -n Sbjct | sed 's/Sbjct//g' | awk '/-/ {print $1," " ,$3}' >> logfile.txt
# takes an input file, numbers the lines, prints the line number of the subject sequence with gaps and the subject sequence and updates the logfile
done
# end of loop
printf "\n"
printf "Gaps have been removed in output" >> logfile.txt
Code by: Murali and BenBen
Blog Post by: BenBen
Saturday, March 21, 2009
11) Script converting BLAST output to FASTA format
The following script takes all your BLAST output from an input folder, processes the files, removes gaps and generates the ouput files in an output folder. It will also generate a logfile which indicates which lines in which input file contain gaps.
#!/bin/bash
mkdir output
ls input | sed 's/.txt//g' > index.txt
# lists all the files in the folder called input, removes the extension and places them in a file called index.txt
#extension of BLAST output is assumed to be .txt. If your BLAST output is not a text file, change the code in order to comply with your file extension.
printf "The following lines in the following BLAST output files had gaps" > logfile.txt
# creates the logfile
for num in `cat index.txt`
# generates a for-loop where your variable $num takes the values of an input file name in each iteration
do cat input/$num.txt | awk '/>/ {print $0} /Sbjct:/ {print $3}' | sed 's/-//g' > output/$num.fasta
# takes the sequence identifiers and the subject sequences, removes the gaps from subject sequences and funnels it to an output file which is of the same name, but is in an output folder
printf "\n" >> logfile.txt
# generates a new line and updates the logfile
printf $num >> logfile.txt
# generates a new entry for that input file
printf "\n" >> logfile.txt
# generates a new line
cat input/$num.txt | grep -n Sbjct | sed 's/Sbjct://g' | awk '/-/ {print $1," " ,$3}' >> logfile.txt
# takes an input file, numbers the lines, prints the line number of the subject sequence with gaps and the subject sequence and updates the logfile
done
# end of loop
printf "\n"
printf "Gaps have been removed in output" >> logfile.txt
To execute, copy the code to a text editor and save it as a .sh file. Create a directory called input and place all your BLAST output into the folder, nothing else. Run the script in the parent folder.
10) Script converting ACII files from DOS to UNIX format
Shell scripts written in a notepad, notepad2, notepad++ etc encounter errors when executed in shell. For users who use unix environments on their Windows system, this would be a problem as scripting is often done using text editors from windows. This occurs in scripts with multiple lines of code. In DOS formatting, a new line includes two characters, a line feed and a carriage return. Unix formats use only the line feed character for a new line. In order to make your file executable, run the following command in shell:
>tr -d '\15' < infile > OUTFILE
The above code removes all carriage returns (ASCII code \15) from each line. Make sure that your outfile and infile have different names, and remember to inlude your extensions. Your outfile can now be executed in unix shell once you have made it executable.
Wednesday, March 11, 2009
9) UNIX Shell Script - automating the benbo script for more than one input file
Process Code
for i in xx*
do
./benbo.sh 1 $i > $i\_out.txt
done
Usage Notes:
1) To run the script assuming the code is saved in a file called benbo2.sh: ./benbo2.sh.
2) This script will run benbo.sh for more than one input file, however all the input files would have to include "xx" in their filenames.
3)The user may switch to the perl version of benbo by substituting ./benbo.sh to ./benbo.pl in the above code.
4) This script assumes that the index protein/nucleotide sequence is on the first line, the user may change "1" to another number in the above code if their index sequence is not located on the first line of their input file.
5) The output files will be generated on the same folder and named as "input file name"_out.txt
Code by: BenBen
Blog Post by: BenBen
8) Perl Script for converting matched amino acid residues to dots
The following is a code to process an alignment input to an output whereby the amino acid residues or the nucleotides of all protein or DNA sequences variant to an index protein or DNA sequence are represented as dots, leaving the variable residues or nucleotides intact. This is similar to the previous method, however, this is a perl script which is suitable for matching longer index protein sequences.
Input
ASIFTANTINWEE
ASAFTANTINWEE
ASEETENTANWAA
Process Code
#!/usr/bin/perl
# Usage benbo.pl
$L=$ARGV[0]-1;
# Open file for reading
open (FILE, "$ARGV[1]") or die "cannot open file: $!";
# Store everything in the file into array @raw_data
@raw_data=;
# print the Target Line
print $raw_data[$L];
# trim off newline from the Target Line
chomp($raw_data[$L]);
@linearray = split("", $raw_data[$L] );
# test each line of the input file, as read into the rawdata array
foreach $line (@raw_data)
{
chomp($line);
# take line in question and split into an array of single chr
@linetest = split("", $line );
# initialise the character count
$nn=0;
# analyse each character in the current Test line
foreach $linechr (@linetest) {
# check if Test character matches Target character
if ($linechr =~ m/$linearray[$nn]/) {
# if yes, print dot
print ".";
} else {
# if no, print the non-matching chr in the Test sequence
print "$linechr";
} # if
# increment the character count of the Target line
$nn=$nn+1;
} #foreach linechr
# once each line is completed, print new line, and standby to reset $nn
print "\n";
# run through every line of the sequence input file
} # foreach
close(FILE);
Usage Notes:
1) To run the script assuming the code is saved in a file called benbo.pl: ./benbo.pl 1 xx1.txt, where "1" is the line number of the index sequence and "xx1.txt" is the input file. All the characters of the index sequence are shown in the output.
Output
ASIFTANTINWEE
..A..........
..EE.E..A..AA
Code by: Dr. Tan Tin Wee
Blog Post by: BenBen
7) UNIX Shell Script for converting matched amino acid residues to dots
The following is a code to process an alignment input to an output whereby the amino acid residues or the nucleotides of all protein or DNA sequences variant to an index protein or DNA sequence are represented as dots, leaving the variable residues or nucleotides intact.
Input
ASIFTANTINWEE
ASAFTANTINWEE
ASEETENTANWAA
Process Code
#!/bin/bash
# grab the sequence in linenumber $1 in sequence file $2
seq=`head -n $1 $2 |tail -1`
# take sequence, break it down to each residue and iterate
(
for i in `echo $seq | fold -1`
do
# Generate the sed script
echo "s_^\($dot\)${i}_\1._"
# Set up the dot for the last sed command
dot=".$dot"
done
# after finishing generating the sed script for each column
# construct the substitution for the sequence at linenumber $1
echo "$1s+$dot+$seq+" | sed -e 's/\./\\./g'
) > /tmp/y
# temp file created
# use the dynamically created temp file as program for sed to
# apply on the sequence file $2
sed -f /tmp/y $2
# remove temp file created
rm /tmp/y
exit
# Example of the temporary file /tmp/y template created dynamically
# slashdot=`echo $dot | sed -e 's/\./\\\\./g'`
n=11
n=`expr $n + 1`
s/^\(\)V/./
s/^\(.\)D/\1./
s_^\(..\)R_\1._
s_^\(...\)F_\1._
s_^\(....\)Y_\1._
s_^\(.....\)K_\1._
s_^\(......\)T_\1._
s_^\(.......\)L_\1._
s_^\(........\)R_\1._
s_^\(.........\)A_\1._
s_^\(..........\)E_\1._
s_^\(...........\)Q_\1._
s_^\(............\)A_\1._
s_^\(.............\)S_\1._
s_^\(..............\)Q_\1._
1s+\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.+VDRFYKTLRAEQASQ+
Usage Notes:
1) To run the script assuming the code is saved in a file called benbo.sh: ./benbo.sh 1 xx1.txt, where "1" is the line number of the index sequence and "xx1.txt" is the input file. All the characters of the index sequence are shown in the output.
Output
ASIFTANTINWEE
..A..........
..EE.E..A..AA
Code by: Dr. Tan Tin Wee
Blog Post by: BenBen
Post Edited by: Asif M. Khan
Tuesday, March 10, 2009
6) Process hit sequences in BLAST output files into FASTA formatted sequences
LAPIS is problematic to use with large input files because of its limited memory.
We end up having to break up the file, process it (often having to restart LAPIS every time the memory runs out. The following code is able to process or convert all the hit sequences in Blast output files into Fasta formatted sequences.
Follow the following steps:
1. copy the following code into a text editor
#!/bin/bash
mkdir output;
for num in `seq 4`;
do for pii in `seq -w 2 11`;
do `cat DENV$num"o"$pii.txt | sed 's/>/33ab>/g' | sed 's/Sbjct:/33abSbjct:/g' | grep '33ab' | sed 's/33ab//g' | sed 's/Sbjct://g' | sed 's/[ ][0-9]/33ab/g' | sed 's/33ab/ /g' | sed 's/[ ][0-9]/33ab/g' | sed 's/33ab/ /g' | sed 's/[ ][0-9]/33ab/g' | sed 's/33ab/ /g' | sed 's/[ ][0-9]/33ab/g' | sed 's/33ab/ /g' | sed 's/ //g' > output/DENV$num$pii.fasta`;
done;
done
2. Delete the line breaks. i.e. make it such that everything is in one line with the different commands separated by semicolons ";". Remember to put a space after the semicolon.
3. Save the above code in the text editor as extract.sh (or whatever you want)
4. Remember to run it in bash enviroment as "bash extract.sh" using Linux Konsole
Note:
i. My BLAST results were in the form DENV$num"o"$pii.txt e.g DENV1o02.txt, so, I've got 2 nested "for loops" to change 2 variables $num and $pii; DENV1o02, DENV2o02, etc and to change DENV1o02, DENV1o03. For convenience use files named abcd1, abcd2 etc so that the for loops can be used to run the program for a number of files
ii. I use sed to add in a dummy identifier "33ab" to ">" and "Sbjct". Then I can use grep to extract these lines. I use sed to remove 33ab, then I use sed to remove "Sbjct:", then subject sequence numbers and the spaces. Unfortunately all 1 digit to 4 digit numbers are removed (including from the sequence title line) except the gi and the accession numbers
iii. The processed files are created in a a folder called output. Make sure that there are no other folders of that name in your folder. Alternatively the output folder can be renamed from the code
Code by: Murali
Blog Post by: Murali
Post Edited by: Asif M. Khan
Wednesday, March 4, 2009
5) LAPIS: Extracting all sbjct sequences of a particular length from BLAST results
Steps
1) extract {from line starting with > to line starting with sbjct} # (this is to get rid of double hits)
2) extract {from line containing "/15" to line containing sbjct}
3) extract {line starting with sbjct}
4) omit sbjct:
5) omit number
6) omit spaces
Output example corresponding the above codes:
a) Output of code:
extract {from line starting with > to line starting with sbjct}
>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Length = 154
Score = 50.7 bits (112), Expect = 5e-09
Identities = 15/15 (100%), Positives = 15/15 (100%)
Query: 1 CRQILGQLQPSLQTG 15
CRQILGQLQPSLQTG
Sbjct:57 CRQILGQLQPSLQTG 71
b) Output of code:
extract {from line containing "/15" to line containing sbjct}
Identities = 15/15 (100%), Positives = 15/15 (100%)
Query: 1 CRQILGQLQPSLQTG 15
CRQILGQLQPSLQTG
Sbjct:57 CRQILGQLQPSLQTG 71
c) Output of code:
extract {line starting with sbjct}
Sbjct: 57 CRQILGQLQPSLQTG 71
d) Output of codes:
i) omit sbjct: ii) omit number iii) omit spaces
CRQILGQLQPSLQTG
Code by: Benben & Asif M. Khan
Blog Post by: BenBen
Post Edited by: Asif M. Khan
4) LAPIS: Extracting all hit sequences from BLAST results (for hits with only one line of sbjct sequence)
Steps
1) extract {from line starting with > to line starting with sbjct} (this is to get rid of double hits)
2) extract {line either starting > or starting sbjct} ("or" must be after "either")
3) omit {number in line starting with sbjct}
4) omit {spaces in line starting with sbjct}
5) omit sbjct:
Output example corresponding the above codes:
a) Output of code:
extract {from line starting with > to line starting with sbjct}
>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Length = 154
Score = 50.7 bits (112), Expect = 5e-09
Identities = 15/15 (100%), Positives = 15/15 (100%)
Query: 1 CRQILGQLQPSLQTG 15
CRQILGQLQPSLQTG
Sbjct:57 CRQILGQLQPSLQTG 71
Lines starting with ">" to lines starting with "sbjct" are extracted.
b) Output of code:
extract {line either starting > or starting sbjct}
>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Sbjct: 57 CRQILGQLQPSLQTG 71
Lines starting with ">" and "sbjct" are extracted.
c) Output of code:
omit {number in line starting with sbjct}
>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Sbjct: CRQILGQLQPSLQTG
Numbers in lines starting with "sbjct" are removed.
d) Output of code:
omit {spaces in line starting with sbjct}
>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Sbjct:CRQILGQLQPSLQTG
Spaces in lines starting with "sbjct" are removed.
e) Output of code:
omit {spaces in line starting with sbjct}
>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
CRQILGQLQPSLQTG
The word "sbjct:" is removed, thus obtaining the FASTA sequences in the end.
Code by: Benben & Asif M. Khan
Blog Post by: BenBen
Post Edited by: BenBen
3) UNIX Shell Script for HIV Standalone Blast
HIVstandaloneblast.sh
Process Code
for i in xx*
do
blastall -p blastp -d HIV_db1.fasta -i $i -F F -e 200000 -I T -v 0 -b 20000 -f 11 -W 2 -T F -C F -M PAM30 -o $i.out
done
Code by: BenBen & Asif M. Khan
Blog Post by: BenBen
Post Edited by: BenBen
2) UNIX Shell Script for WNV Netblast
WNVNetblast.sh
Process Code
for i in xx*
do
\./blastcl3 -p blastp -d nr -i $i -F F -e 200000 -I T -v 0 -b 20000 -f 11 -W 2 -T F -u "Root[ORGN] NOT txid11082[Organism:exp] NOT txid81077[ORGN]" -C F -M PAM30 -o $i.out
done
Code by: BenBen & Asif M. Khan
Blog Post by: BenBen
Post Edited by: BenBen
1) UNIX Shell Script for Mass Alignments using Muscle
Muscle.sh
Process Code
for i in xx*
do
muscle -in $i -out $i"_align".fasta
done
Usage Notes:
1) All your fasta files that are required for alignments must have xx1,xx2,etc included in the file name.
2) Your output files would have "_align" included in the file name.
Code by: BenBen
Blog Post by: BenBen
Post Edited by: BenBen
0) Understanding Unix Environment
Unix File System
The Unix File System is a hierarchical file system.
/
is root of the system /var
is a directory in root /var/www
is the full path of the sub-directory in "var" /var/www/htdocs
is the full path of the sub-sub-directory in "/var/www" and so on...Some examples
1. To know in which directory you are currently in:
pwd
# It shows the path of the present working directory.
2. Navigate:
cd
# If you want to get into a certain directory, for example:
cd /var/www/htdocs/db
3. To list the content of a directory:
ls
# To get long directory listing:
ls -l
4. Getting the source code:
lynx -source
# Example
lynx -source http://aps.unmc.edu/AP/database/query_output.php?ID=00286
# lynx is a text-based browser. -source tells lynx NOT to display the HTML code, but to return the raw HTML code.
5. Unix command history
# Just press the ''Arrow up'' key to see your previous command
# Whatever you do, double check
6. Get the internet page to your folder
:Example: to put the internet page to the dir where you are at now, key in
lynx -source internet_address.html > your_pagename.html
7. Change mode
: ''chmod'' takes several arguments
: -x is to make executable
: ''chmod -x programfile'' will make program executable rather than just a plain text file.
Questions
1. Why put /var/www/htdocs/db/myprogram in front of myprogram?
: This is because whenever you execute a program, Unix shell needs to find the path of that program.
: In this case, we spell out the full directory path so that Unix shell knows where that program is.
2. Why then can we use the ''lynx'' command and don't have to put the full path?
: This is because lynx command is already in the Shell's list of known paths.
: Later on, we will show you what is the current known paths, e.g. echo $PATH
Command Line Arguments
1. Command Line command without arguments
: For example ''pwd'' doesn't need to take arguments
2. Command line command with one argument
: For example ''ls -l'' takes one argument, the flag ''-l'' to give a ''l''ong listing.
3. Command line command with two arguments
: For example ''lynx -source http://www.google.com''
: lynx the command
: -source the first argument
: http://www.google.com as the second argument
Loop Control
How do we make the numbers autoincrement without us to put in the numbers ourselves?
1. Make 6 into a program to create multiple files
:(1)pseudo code for ''my_program''
for (i=0;i<100;i++)> your_pagename[i].html}
:(2)Change the file to executable file
:(3)Key in [path]/my_program
:(4)chmod +x my_program (This is to make the file executable.)
Learning Unix Commands
'''Environment Control'''
# refers to comments
1) cd d # Change to directory d
2) mkdir d #Create new directory d
3) rmdir d #Remove directory d
4) rm -r existingdirectories # Will delete the existing directory named 'existingdirectories' and all directories and files below it.
5) rm -rf existingdirectories # Will delete the existing directory named 'existingdirectories' and all directories and files below it. Even if it runs into an exception that it would usually prompt for user interaction.
the r is recursive and the f is force.
Be careful with the "rm -r" command. Don't run it as root. And stay away from the root directory if you plan on running it.
6) mv f1 [f2...] d #Move file f to directory d
7) mv d1 d2 #Rename directory d1 as d2
8) passwd #Change password
9) alias name1 name2 #Create command alias (csh/tcsh)
10) alias name1="name2" #Create command alias (ksh/bash)
11) unalias name1[na2...] #Remove command alias na
12) ssh nd #Login securely to remote node
13) exit #End terminal session
14) setenv name v #Set env var to value v (csh/tcsh)
export name="v" #set environment variable to value v (ksh/bash)
Miscellaneous commands
1)
ls# is to list the directory
2)
cp# copy
3)
wc# word count
4)
ls -al | grep root |awk '{print "file: $9, S8" }'# to extract the 8th and the 9th column; details of all the files/directories called root is ls
5)
wc -l# count number of lines
6)
># is to funnel to a new file
7)
ls -al | grep root |awk '{print "file: $9, S8" }' > x.file# transfer the output to a file name, in this case called x.file
8)
pico x.file# a text editor that helps you read the x file; to get out from pico, type ctrl x; pico is very sophisticated
9)
vi x.file# a text editor that helps you read the x file
10)
:q# helps exit vi, which is very obscure.
11)
:q!# helps exit vi, which is very obscure.
12)
pwd# print the current working directory
13)
cd# is to change the directory
14)
tab#use it to automatically fill in the blanks; auto completion
15)
key up and down# help you look at commands typed earlier
16)
echo "ls -al /root/Desktop/ |awk '{print $5 "b", S9 }' " >tinwee# creates an executable file
17)
./tinwee# executing tinwee, but will not work because you have no permission
18)
chmod 755 tinwee# give full access to run tinwee
19)
pico tinwee# to make the program more generic by modifying the code
20)
ls -al $1 |awk '{print $5 "bytes", $9 }#change the code to more generic instead of hardwiring it
21)
tinwee /# takes / as $1, which in this case is root; note that this $1 is different from the #awk $1
22)
ln tinwee t# creating a shortname or short cut (soft link) for tinwee (alias); make sure it is unique alias
23)
echo $PATH | sed 's/:/\n/g'# this is to show the path of all the directories and files line by #line
24)
echo $PATH | sed 's/:/\n/g' |more# show little by little
25)
echo $PATH | sed 's/:/\n/g' |less# show less
26)
echo $PATH | sed 's/:/\n/g' |cat# to show all
27)
echo $PATH | sed 's/:/\n/g' |tac# to show in reverse order
28)
echo $PATH | sed 's/:/\n/g' |less |rev# show less but in reverse order
29)
q# to quit while looking at the results of these
30)
ctrl + C#to quit while looking at these
31) Create a translation tool, simple one
echo "ATGCTTA"
" |rev |tac |tr "atcg" "tagc"
# rev reverses the sequence, tac show it in reverse order and tr does the translation
32)
alias#does not link a file but substitutes for the name
33)
>>#will append to the bottom of a file that already exists
34)
>#append to a new file
35) Creata database of your own
lynx http://aps.unmc.edu/AP/database/query_output.php?ID=00286 > AP00286.html#saves the web page to AP00286.html but it is not saving the source code
lynx -source http://aps.unmc.edu/AP/database/query_output.php?ID=00286 > AP00286.html
#saves the web page to AP00286.html and this time it is saving the source code
echo "lynx -source http://aps.unmc.edu/AP/database/query_output.php?ID=00286 > AP00286.html" >getapd
pico getapd# read the getapd file
change
lynx -source http://aps.unmc.edu/AP/database/query_output.php?ID=00286 > AP00286.htmlto
lynx -source http://aps.unmc.edu/AP/database/query_output.php?ID=0$1 > AP0$1.html
# this is making the code generic
getapd AP00287# type on the command prompt to download the record 00287 without typing the whole command
pico getapd# open the getapd file to make the command more generic
for i in 'Seq -w 1137`
do
lynx -source http://aps.unmc.edu/AP/database/query_output.php?ID=0$i > AP0$i.html
sleep 3
done
./getapd#run getapd for automatic download
36.
" "# double quotes are used whe you want to evaluate the variable. For example "$PATH" will evaluate the dollar sign and give you relevant matches
37.
' '# are used when one wants to print the special characters without evaluating them. For example, '$PATH' will return $PATH