Bioinformatics Codelets

Friday, March 20, 2099

Monday, October 26, 2009

24) Unix one-liner to split one-line sequences into FASTA formatted files

In Unix, one line of code is enough to carry out the process of transforming a text file containing x sequences, one sequence per line, into a fasta formatted file of x sequences with >n where n is the number of the sequence line.

$ pr -n:3 -t -T inputfile.txt | sed 's/^/>/' | tr ":" "\n"

pr is print command.
-n:3 is to number the line with 3 digits or more depending on how many lines you have
and : is to separate the line number from the sequence
-t is to switch off header
-T is to switch off pagination

sed 's/^/>/' is to insert a > at the beginning of each line,
and
tr is to change the separator ":" with a new line character "\n"

And if you want to wrap the sequence into a width of 60 characters,

$ pr -n:3 -t -T input.txt | sed 's/^/>/' | tr ":" "\n" | fold -w 60

where the command
fold -w 60 means to wrap with a width of 60 characters.

By adjusting the -n:3 to 4 you can change the number of spaces for the digits
By adjusting -w 60, you can change the width of your sequences
By substituting s/^/>/ with s/^[ ]*/>/ you can remove spaces in front,
or you can add prefixes in front of the number e.g

s/^[ ]*/>NP/

And if you want to split each line into a separate FASTA formatted file with filename corresponding to the number

$ pr -n:3 -t -T input.txt | sed 's/^[ ]*/>NP/' | tr ":" "\n" | \
fold -w 60 | csplit -f NP - "/>/" "{*}"

where the command csplit is told to name the split files starting with a
prefix "-f NP"
and to take the input file as from the standard input "-"
and to split the files according to a matching regular pattern starting with "/>/"
and to do this repetitively until the end of the file "{*}"

Say for 107 lines of sequences, you will get 108 files listed NP00 to NP107 where the contents of NP01 are

>NP01
tataatatcagtatatctat

or whatever the sequence is folded to 60 characters per line.
NP00 is blank file. Ignore.

So, just one line of code. Five unix commands: pr, sed, tr, fold, csplit will do the job really fast
because it is pipelined. Therefore for large inputfiles, the first files will start to pop out as soon as they are finished, so you can actually read them almost immediately after you start the program.

Tuesday, October 20, 2009

23) Simple perl script to generate a header line for each sequence in a fasta file

A simple script to generate a header line (starting with >) for sequences without headers in a faster file. I have included plenty of comments in the script, for people who are interested to learn Perl basics.

Output: An output file, xxx_processed.fasta will be produced

Note: There are 2 conditions for the script to work

1. Sequences must be formatted in a way such that each line contains 1 sequence only
2. Sequences must be saved in *.fasta extension (of course the script can be modified to
take any other extensions)

Code (without comments):

#!/usr/bin/perl

print "\n NOTE: Output will be filename_processed.fasta";
print "\nPlease enter filename (without .fasta extension): ";

$input = <>;
chomp ($input);

$number=1;

open (INFILE, "$input.fasta") or die "Cannot open infile!";
open (OUT, ">"."$input"."_processed.fasta") or die "Cannot open outfile!";

while ($line=)
{

print OUT ">$number\n";
print OUT $line;

$number++;

}

Code (with comments):

#!/usr/bin/perl

# Text to print at command line (\n means print a new line)
print "\n NOTE: Output will be filename_processed.fasta";
print "\nPlease enter filename (without .fasta extension): ";

# Processing file name at command line, file name will be pasesd to the variable $input -> $input = file name
$input = <>;
chomp ($input);

# Setting the value of variable, $number, to 1
$number=1;

# Opening the input file, generate error message "Cannot open infile!" if there's an error
# INFILE is the reference (filehandle) for your input file
open (INFILE, "$input.fasta") or die "Cannot open infile!";

# Creating the output file, the symbol > is to create and overwrite a new file
open (OUT, ">"."$input"."_processed.fasta") or die "Cannot open outfile!";

# Passing each line in the input file (INFILE) to the variable $line
while ($line=)
{

# Print, in the output file (OUT), fasta header starting with the symbol >, followed by the variable $number, and a new line, can customize the header to print if you want to
print OUT ">$number\n";

# Print out current line from input file
print OUT $line;

# Change the value of $number so that $number=$number+1
$number++;
}

Tuesday, September 22, 2009

22) Perl version of AVANA: BENANA!!!!!

The life changing program is here.....Ben's version of AVANA!!!

Past AVANA users will know that in the presence of any real gap/s in e.g. a nonamer window, AVANA will shift its sequence to cover the real gap/s. Furthermore, AVANA excludes any nonamers with padded gaps in its computations. However, BENANA has solved both issues...in that there is no shifting of amino acids to cover any gaps and it also includes padded gaps in its computations.

Things to take note:

1) After running BENANA, it will prompt you to enter your input file. Please note that BENANA only accepts .taln format as input files.

2) Next, define the window size that you are interested in. If you try to enter any thing besides numbers, the program will be terminated. If you did not enter any thing, the program will treat it as "9" by default.

3) Please ignore _output1.csv and _output2.csv as they are used for processing.

4) It is important to note that full gaps are not considered as a variant and is excluded from the variant list.

5) Output files: _intraserotype_position.csv, _intraserotype_variant_list_sorted.csv, BENANA_intraserotype_diversity_results.csv

What BENANA does:

For example, if your taln input file contains:

AAAACCCCDDDD
AAAANNNNMMMM
CCCCNNNNMMMM

It first generates two files output1.csv and output2.csv, which is used for processing (you may ignore them).

It then generates a peptide list at each position in the window that the user has defined (default: nonamer window). In the above example, 4 files are automatically generated:

1-9_intraserotype_position.csv:

1,AAAACCCCD,9
1,AAAANNNNM,9
1,CCCCNNNNM,9

2-10_intraserotype_position.csv:

2,AAACCCCDD,10
2,AAANNNNMM,10
2,CCCNNNNMM,10

3-11_intraserotype_position.csv:

3,AACCCCDDD,11
3,AANNNNMMM,11
3,CCNNNNMMM,11

4-12_intraserotype_position.csv:

4,ACCCCDDDD,12
4,ANNNNMMMM,12
4,CNNNNMMMM,12

From each peptide list at each position, BENANA also generates a variant list just like AVANA, with the predominant peptide displayed at the top and the count of each peptide in the alignment. In the case where there is no predominant peptide, the first peptide in the alignment will be displayed at the top:

1-9_intraserotype_variant_list_sorted.csv:

AAAACCCCD,1
AAAANNNNM,1
CCCCNNNNM,1

2-10_intraserotype_variant_list_sorted.csv:

AAACCCCDD,1
AAANNNNMM,1
CCCNNNNMM,1

3-11_intraserotype_variant_list_sorted.csv:

AACCCCDDD,1
AANNNNMMM,1
CCNNNNMMM,1

4-12_intraserotype_variant_list_sorted.csv:

ACCCCDDDD,1
ANNNNMMMM,1
CNNNNMMMM,1

Lastly, BENANA will generate the diversity file just like AVANA did. However, it does not compute entropy values at each position.

Diversity File:

StartPos,Predominant Peptide,EndPos,% Representation,No. of Variants,Total no. of Sequences with Valid Variants,Support %
1,CCCCNNNNM,9,33.33%,2,3,100.00%
2,AAACCCCDD,10,33.33%,2,3,100.00%
3,CCNNNNMMM,11,33.33%,2,3,100.00%
4,ANNNNMMMM,12,33.33%,2,3,100.00%

BENANA_open_source_v1.0.pl is downloadable from this link.

© 2009 ^BeNBeN^. All rights reserved.

Sunday, May 31, 2009

21) LAPIS: Extracting all hit sequences from BLAST results (for hits with one or more lines of sbjct sequence)

The LAPIS code below allows extraction of all hit sequences from BLAST results, even when there are more than one line of sbjct sequences. This is in contrast to post (4) which only works if there is only one line of sbjct sequence for each BLAST hit. The code in this post also teaches how to clean up the description line to only contain the GI number.

Steps


1. Select the line containing the fasta description together with the line containing the subject sequences: “line containing > or line containing sbjct” --> Tools --> Extract

2. To get rid of the numbers in the line containing sbjct: “digits in line containing sbjct” --> Tools --> Omit

3. To get rid of sbjct: “sbjct:” --> Extract --> Omit

4. To get rid of dashes: type “-“ --> Extract --> Omit

5. To get rid of the extra spaces in the lines containing sequences: “spaces not in line containing >” --> Tools --> Omit

6. In case you want to clean up the description line to only have the GI
a. From second | in line containing > to start of linebreak

Output example corresponding the above codes:

Before

>gi|126385999|gb|CP000521.1| Acinetobacter baumannii ATCC 17978, complete genome
Length = 3976747

Score = 570 bits (1470), Expect = e-163
Identities = 284/284 (100%), Positives = 284/284 (100%)
Frame = -2

Query: 1 LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT 60
LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT
Sbjct: 1766322 LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT 1766143

After

>gi|126385999
LNFKFNFISLMNIKALLLITSAIFISACSPYIVTANPNHSASKSDEKAEKIKNLFNEAHT

Tuesday, April 21, 2009

20) Plotting Entropy Plot of MSA

Say Windowsize 3

$ ent2 3 inputfile | grep Position | sed 's/.*=$.*$ .*=$.*$/\1 \2 /'

$ ent2 3 inputfile | grep Position | sed -e 's/Position=//' -e 's/Entropy=//'

$ ent 2 3 inputfile | grep Position | awk -F"=| " '{print $2, $4}'

(this awk script demonstrates how we use the field separator command -F

to define = or " " blank space as a delimiter, and as such

$1 is Position

delimiter =

$2 is the position number

$3 is Entropy

$4 is the entropy value)

1 0.954885

2 0.460826

3 0.228055

4 0

5 0

6 0

7 0

$ ent2 3 inputfile | grep Position | sed -e 's/Position=//' -e 's/Entropy=//' > ent.dat

$ echo "

set term jpeg

set output 'x.jpg'

plot 'ent.dat' with line " | gnuplot

$ konqueror x.jpg

will display the x.jpg file.

19) Calculating Entropy of MSA by variable windowsize

The short version

Save file as "entropy2"

#!/bin/bash
L=`head -1 $2 |  wc -c | awk '{print $1}' | sed 's/.*/&-1/' | bc`
x=`expr $L - $1 + 1`
n=`cat $2 |wc -l `
w=$1
for i in `seq $x`
do
j=`expr $i + $w - 1`
cat $2 | cut -c$i-$j| sort | uniq -c
echo -n "Position=$i "
cat $2 | cut -c$i-$j| sort | uniq -c | awk  -v n="$n" '{ h=( -1 /  n * log(2))  * ( log($1) - log(n) ); H=h+H } END {print "Entropy=" H }'
echo -------
done

Example input file: input.seq

ASIFKHANT

BSIFKHANT

CTIFKHANT

DSIFKHANT

BSIFKHANT

DSAFKHANT

$ chmod +x entropy2

$ entropy2 2 input.seq

1 AS

2 BS

1 CT

2 DS

Position=1 Entropy=0.667818

-------

1 SA

4 SI

1 TI

Position=2 Entropy=0.460826

-------

1 AF

5 IF

Position=3 Entropy=0.228055

The long version

#!/bin/bash
# Usage:  prog 
# this program computes the Entropy of window size w for a
# multiple sequence alignment with gaps marked with "-" only.
# Date: 22 April 2009
# ToDo:
# does not trap missing $1
# does not handle blank or missing data or missing line
# works for windows = 1 onwards. w cannot be = or less than 0.
# need to check the equation for any errors
# assumes that the data input is well formated according to msa output
# does not trap data input errors and may still generate errorneous output
# does not cater for description tag of the clustal MSA output
# only handles one catenated sequence per line

input=$2
# Assumes the first line is the longest
# L = number of column positions to be computed
L=`head -1 $2 |  wc -c | awk '{print $1}' | sed 's/.*/&-1/' | bc`

# x - number of windows
x=`expr $L - $1 + 1`
echo Window Size: $1 size
echo No of Windows: $x windows
# number of sequences (=rows, since only one cateneted sequence per line allowed)
n=`cat $input |wc -l `
echo No of Lines: $n lines
w=$1

# i = current residue position begin
for i in `seq $x`
do


# i = current residue position end
j=`expr $i + $w - 1`

# print aligned sequences within window and count them
cat $input | cut -c$i-$j| sort | uniq -c

# Report position number
echo -n "Position=$i "
# Compute Entropy in bits of these sequences
cat $input | cut -c$i-$j| sort | uniq -c | awk  -v n="$n" '{ h=( -1 /  n * log(2))  * ( log($1) - log(n) ); H=h+H } END {print "Entropy=" H }'
echo -------
done

Sunday, April 19, 2009

18) Doing maths in Unix

To perform calculations in Unix, one can key in "bc" to access the program. Once in the program, just key in the formula (e.g. 3+2).

Alternatively, to perform calculations straight from the command line, key in:

echo "3+2" | bc

17) Basic shell command to delimit protein/DNA sequence

The 'sed' shell command delimits protein/DNA sequences such that each base/residue is seperated by a comma or a specified symbol:

sed 's/./&,/g' input.txt>output.txt

Note: '.' matches any character in the file and '&' replaces every match with itself. Take note that this is a basic command and can be improved in many ways to perform more sophisticated tasks.

16) Basic shell command to count number of fields in a file

Use the NF variable in awk

awk '{print NF}'

Tuesday, April 7, 2009

15) Perl script to randomly shuffle sequences in a fasta file

This perl script randomly shuffles the order of sequences in a fasta file. Upon execution, specify your input file (without .fasta extension) and total no. of sequences.

Output: Output file xxx_shuffled.fasta will be produced

Note: Again, this script is cut out from a sub-routine and there may be bugs in terms of variable names and definitions.

#! /usr/bin/perl

print "Please enter filename (without extension): ";

$input = <>;

chomp ($input);

print "Please enter total no. of sequence in fasta file: ";

$orig_size= <>*2-1;

chomp ($orig_size);

open (INFILE, "$input.fasta") or die "Error opening input file for shuffling!";

open (SHUFFLED, ">"."$input"."_shuffled.fasta") or die "Error creating shuffled output file!";

@array = (0); # Need to initialise 1st element in array1&2 for the shift function

@array2 = (0);

$i = 1;

$index = 0;

$index2 = 0;

while (my @line=<INFILE>){

while ($i<=$orig_size) {

$array[$i] = $line[$index];

$array[$i]=~ s/(.)\s/$1/seg;

$index++;

$array2[$i] = $line[$index];

$array2[$i]=~ s/(.)\s/$1/seg;

$i++;

$index++;

}

my $array = shift (@array);

my $array2 = shift (@array2);

for ($i = $header_size; --$i; ) {

my $j = int rand ($i+1);

next if $i == $j;

@array[$i,$j] = @array[$j,$i];

@array2[$i,$j] = @array2[$j,$i];

}

while ($index2<=$header_size) {

print SHUFFLED "$array[$index2]\n";

print SHUFFLED "$array2[$index2]\n";

$index2++;

}

close(INFILE);

close(SHUFFLED);

}

14) Perl script for separating a fasta file into individual sequence fasta files

This Perl script separates a fasta file into individual sequence fasta files. Upon execution, users have to specify the input file (without .fasta extension) and the total number of sequences in their input file. Useful for leave-one-out testing.

Output: Individual sequences will be cut out from the original fasta file and saved under file names query1.fasta, query2.fasta etc. In addition, the remaining sequences will be saved in a separate fasta file (file names out1.fasta, out2.fasta etc.)

Note: This script is cut out from a sub-routine and there may be bugs e.g. undefined variables. Please check and debug before using.

#! /usr/bin/perl

print "Please enter filename (without extension): ";

$input = <>;

chomp ($input);

print "Please enter total no. of sequence in fasta file: ";

$training_seq = <>;

chomp ($training_seq);

my $no = 0;

my $query = 0;

my $filenumber = 1;

while ($filenumber<=$training_seq) {

print $training_seq;

open (INFILE, "$input.fasta") or die "Error opening input file!";

open (QUERY, ">query$filenumber.fasta") or die "Can't open query($filenumber)!";

open (OUT, ">out$filenumber.fasta") or die "Can't open out($filenumber)!";

while (@line=<INFILE>) {

# Print query sequence to query file

print QUERY $line[$query++];

print QUERY $line[$query];

# Delete query sequence from database file

delete $line[$no++];

delete $line[$no];

# Print sequences to database file

print OUT @line;

close INFILE;

close QUERY;

close OUT;

}

$filenumber++;

$no++;

$query++;

}

13) Perl script for splitting a fasta file into several small files

This script splits a fasta file into several small fasta files, according to the specified number of sequences in each file. Upon execution, users have to input the original fasta file name (without .fasta extension) and to specify the intended number of sequences in each file.

Output: A chomped fasta file (xxx_chomped.fasta) will be produced and the small fasta files will be named as (xxx_1.fasta, xxx_2.fasta etc)

Note: Each subroutine can be run independently as a perl script by cutting and paste into another .pl file and removing the lines sub xxx { and }. If users do not want to run all of the sub-routines, just comment out those that you don't want to run, e.g. #split_fasta();

#! /usr/bin/perl

print "Please enter filename (without extension): ";

$input = <>;

chomp ($input);

print "Please enter no. of sequences you want in each file ";

$upper_limit = <>+1;

chomp ($upper_limit);

chomp_fasta();

split_fasta();

#--------------------------------------------------------------------------------

# chomp_fasta: Merges all the sequence lines in a fasta file into one line, so

# each sequence in the fasta file will have one header line and

# one sequence line only

#--------------------------------------------------------------------------------

sub chomp_fasta {

open (INFILE, "$input.fasta") or die "Cannot open infile!";

open (OUT, ">"."$input"."_chomped.fasta") or die "Cannot open outfile!";

while ($line=<INFILE>) { # Please remove the spaces

if ($line=~/>/) {

print OUT "\n$line";

}

else {

chomp ($line);

print OUT "$line";

}

close OUT;

}

#--------------------------------------------------------------------------------

# split_fasta: Splits a fasta file into several small files according to the

# specified no. of sequences in each file

#--------------------------------------------------------------------------------

sub split_fasta {

$count = 0;

$number = 1;

open (INFILE, "$input"."_chomped.fasta") or die "Cannot open infile!";

open (OUT, ">"."$input"."_"."$number".".fasta") or die "Cannot open outfile!";

while ($line=<INFILE>) {

if ($line=~/>/) {

$count++;

if ($count==1) {

print OUT "$line";

}

elsif ($count<$upper_limit) { #change this value to change upper limit

print OUT "\n$line";

}

else {

chomp ($line);

print OUT "$line";

}

#--------------------------------------------------------------------------------

# Note: The header >xxx when $count=3 has already been evaluated in the previous

# regex loop but is not printed out. So this if loop must print out the

# header >xxx in a new file manually and reset the count back to 1 instead

# of 0

#--------------------------------------------------------------------------------

if ($count==$upper_limit) { #change this value to change upper limit

close OUT;

$number++;

open (OUT, ">"."$input"."_"."$number".".fasta") or die "Cannot open outfile!";

print OUT "$line";

$count = 1;

}

Friday, April 3, 2009

12) UNIX Shell Script - Converting blast2seq results into FASTA

This is a modification of the shell script written by Murali (see Entry 11) to convert a blast2seq result (blastp) to FASTA format by extracting out all sbjct sequences. This is very useful especially for creating my dengue database (see http://benlogbook.blogspot.com for more details), as I used this script as a polyprotein separator to separate the sequences for each protein subtype. However, double hits will still have to be removed from your blast2seq file.

In the case for my dengue database, I have all my sample sequences for each protein subtype (C,prM,E,NS1,NS2a,NS2b,NS3,NS4a,NS4b,NS5) in FASTA, and paste all in the query textbox. Next, I paste my polyprotein which contain the full protein sequence in the sbjct textbox. The e-value threshold is set at 0.1 to prevent double hits from appearing. After obtaining the blast2seq results in text file, this script is executed to obtain each protein sequence in each subtype. Note that the results file would have to be placed in a folder called "input" before running the script, which have to be placed anywhere except inside the "input" folder itself.

Process Code


#!/usr/bin/bash
mkdir output
ls input | sed 's/.txt//g' > index.txt
# lists all the files in the folder called input, removes the extension and places them in a file called index.txt
#extension of BLAST output is assumed to be .txt. If your BLAST output is not a text file, change the code in order to comply with your file extension.
printf "The following lines in the following BLAST output files had gaps" > logfile.txt
# creates the logfile
for num in `cat index.txt`
# generates a for-loop where your variable $num takes the values of an input file name in each iteration
do cat input/$num.txt | awk '/>/ {printf "\n\n"$0"\n"} /Sbjct/ {printf $3}' | sed 's/-//g' > output/$num.fasta
# takes the sequence identifiers and the subject sequences, removes the gaps from subject sequences and funnels it to an output file which is of the same name, but is in an output folder
printf "\n" >> logfile.txt
# generates a new line and updates the logfile
printf $num >> logfile.txt
# generates a new entry for that input file
printf "\n" >> logfile.txt
# generates a new line
cat input/$num.txt | grep -n Sbjct | sed 's/Sbjct//g' | awk '/-/ {print $1," " ,$3}' >> logfile.txt
# takes an input file, numbers the lines, prints the line number of the subject sequence with gaps and the subject sequence and updates the logfile
done
# end of loop
printf "\n"
printf "Gaps have been removed in output" >> logfile.txt

Code by: Murali and BenBen
Blog Post by: BenBen

Saturday, March 21, 2009

11) Script converting BLAST output to FASTA format

The following script takes all your BLAST output from an input folder, processes the files, removes gaps and generates the ouput files in an output folder. It will also generate a logfile which indicates which lines in which input file contain gaps.

#!/bin/bash mkdir output ls input | sed 's/.txt//g' > index.txt # lists all the files in the folder called input, removes the extension and places them in a file called index.txt #extension of BLAST output is assumed to be .txt. If your BLAST output is not a text file, change the code in order to comply with your file extension. printf "The following lines in the following BLAST output files had gaps" > logfile.txt # creates the logfile for num in `cat index.txt` # generates a for-loop where your variable $num takes the values of an input file name in each iteration do cat input/$num.txt | awk '/>/ {print $0} /Sbjct:/ {print $3}' | sed 's/-//g' > output/$num.fasta # takes the sequence identifiers and the subject sequences, removes the gaps from subject sequences and funnels it to an output file which is of the same name, but is in an output folder printf "\n" >> logfile.txt # generates a new line and updates the logfile printf $num >> logfile.txt # generates a new entry for that input file printf "\n" >> logfile.txt # generates a new line cat input/$num.txt | grep -n Sbjct | sed 's/Sbjct://g' | awk '/-/ {print $1," " ,$3}' >> logfile.txt # takes an input file, numbers the lines, prints the line number of the subject sequence with gaps and the subject sequence and updates the logfile done # end of loop printf "\n" printf "Gaps have been removed in output" >> logfile.txt

To execute, copy the code to a text editor and save it as a .sh file. Create a directory called input and place all your BLAST output into the folder, nothing else. Run the script in the parent folder.

10) Script converting ACII files from DOS to UNIX format

Shell scripts written in a notepad, notepad2, notepad++ etc encounter errors when executed in shell. For users who use unix environments on their Windows system, this would be a problem as scripting is often done using text editors from windows. This occurs in scripts with multiple lines of code. In DOS formatting, a new line includes two characters, a line feed and a carriage return. Unix formats use only the line feed character for a new line. In order to make your file executable, run the following command in shell:

>tr -d '\15' < infile > OUTFILE

The above code removes all carriage returns (ASCII code \15) from each line. Make sure that your outfile and infile have different names, and remember to inlude your extensions. Your outfile can now be executed in unix shell once you have made it executable.

Wednesday, March 11, 2009

9) UNIX Shell Script - automating the benbo script for more than one input file

This is a simple shell script used for running benbo for more than one input file.

Process Code


for i in xx*
do
./benbo.sh 1 $i > $i\_out.txt
done

Usage Notes:

1) To run the script assuming the code is saved in a file called benbo2.sh: ./benbo2.sh.

2) This script will run benbo.sh for more than one input file, however all the input files would have to include "xx" in their filenames.

3)The user may switch to the perl version of benbo by substituting ./benbo.sh to ./benbo.pl in the above code.

4) This script assumes that the index protein/nucleotide sequence is on the first line, the user may change "1" to another number in the above code if their index sequence is not located on the first line of their input file.

5) The output files will be generated on the same folder and named as "input file name"_out.txt

Code by: BenBen
Blog Post by: BenBen

8) Perl Script for converting matched amino acid residues to dots

The following is a code to process an alignment input to an output whereby the amino acid residues or the nucleotides of all protein or DNA sequences variant to an index protein or DNA sequence are represented as dots, leaving the variable residues or nucleotides intact. This is similar to the previous method, however, this is a perl script which is suitable for matching longer index protein sequences.

Input

ASIFTANTINWEE
ASAFTANTINWEE
ASEETENTANWAA

Process Code


#!/usr/bin/perl 
# Usage benbo.pl  
$L=$ARGV[0]-1;
 
# Open file for reading
open (FILE, "$ARGV[1]") or die "cannot open file: $!";

# Store everything in the file into array @raw_data
@raw_data=;

# print the Target Line
print $raw_data[$L];
 
# trim off newline from the Target Line
chomp($raw_data[$L]);
@linearray = split("", $raw_data[$L] );
 
# test each line of the input file, as read into the rawdata array
foreach $line (@raw_data)
{
chomp($line);

# take line in question and split into an array of single chr
@linetest = split("", $line );

# initialise the character count
$nn=0;

# analyse each character in the current Test line
foreach $linechr (@linetest) {
 
# check if  Test character matches Target character
if ($linechr =~ m/$linearray[$nn]/) {

# if yes, print dot
print "."; 
} else  {

# if no, print the non-matching chr in the Test sequence
print "$linechr"; 
} # if

# increment the character count of the Target line
$nn=$nn+1;
} #foreach linechr

# once each line is completed, print new line, and standby to reset $nn
print "\n";
# run through every line of the sequence input file
} # foreach
 
close(FILE);

Usage Notes:

1) To run the script assuming the code is saved in a file called benbo.pl: ./benbo.pl 1 xx1.txt, where "1" is the line number of the index sequence and "xx1.txt" is the input file. All the characters of the index sequence are shown in the output.

Output

ASIFTANTINWEE
..A..........
..EE.E..A..AA

Code by: Dr. Tan Tin Wee
Blog Post by: BenBen

7) UNIX Shell Script for converting matched amino acid residues to dots

The following is a code to process an alignment input to an output whereby the amino acid residues or the nucleotides of all protein or DNA sequences variant to an index protein or DNA sequence are represented as dots, leaving the variable residues or nucleotides intact.

Input

ASIFTANTINWEE
ASAFTANTINWEE
ASEETENTANWAA

Process Code


#!/bin/bash
# grab the sequence in linenumber $1 in sequence file $2
seq=`head -n $1 $2 |tail -1`

# take sequence, break it down to each residue and iterate
(
for i in `echo $seq | fold -1`
do

#  Generate the sed script
echo "s_^\($dot\)${i}_\1._"

# Set up the dot for the last sed command
dot=".$dot"
done

# after finishing generating the sed script for each column
# construct the substitution for the sequence at linenumber $1
echo "$1s+$dot+$seq+" | sed -e 's/\./\\./g'
) > /tmp/y
# temp file created

# use the dynamically created temp file as program for sed to
# apply on the sequence file $2
sed -f /tmp/y $2
# remove temp file created
rm /tmp/y

exit
# Example of the temporary file /tmp/y template created dynamically
# slashdot=`echo $dot | sed -e 's/\./\\\\./g'`
n=11
n=`expr $n + 1`
s/^\(\)V/./
s/^\(.\)D/\1./
s_^\(..\)R_\1._
s_^\(...\)F_\1._
s_^\(....\)Y_\1._
s_^\(.....\)K_\1._
s_^\(......\)T_\1._
s_^\(.......\)L_\1._
s_^\(........\)R_\1._
s_^\(.........\)A_\1._
s_^\(..........\)E_\1._
s_^\(...........\)Q_\1._
s_^\(............\)A_\1._
s_^\(.............\)S_\1._
s_^\(..............\)Q_\1._
1s+\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.+VDRFYKTLRAEQASQ+

Usage Notes:

1) To run the script assuming the code is saved in a file called benbo.sh: ./benbo.sh 1 xx1.txt, where "1" is the line number of the index sequence and "xx1.txt" is the input file. All the characters of the index sequence are shown in the output.

Output

ASIFTANTINWEE
..A..........
..EE.E..A..AA

Code by: Dr. Tan Tin Wee
Blog Post by: BenBen
Post Edited by: Asif M. Khan

Tuesday, March 10, 2009

6) Process hit sequences in BLAST output files into FASTA formatted sequences

LAPIS is problematic to use with large input files because of its limited memory.
We end up having to break up the file, process it (often having to restart LAPIS every time the memory runs out. The following code is able to process or convert all the hit sequences in Blast output files into Fasta formatted sequences.

Follow the following steps:

1. copy the following code into a text editor


#!/bin/bash
mkdir output;
for num in `seq 4`;
do for pii in `seq -w 2 11`;
do `cat DENV$num"o"$pii.txt | sed 's/>/33ab>/g' | sed 's/Sbjct:/33abSbjct:/g' | grep '33ab' | sed 's/33ab//g' | sed 's/Sbjct://g' | sed 's/[ ][0-9]/33ab/g' | sed 's/33ab/ /g' | sed 's/[ ][0-9]/33ab/g' | sed 's/33ab/ /g' | sed 's/[ ][0-9]/33ab/g' | sed 's/33ab/ /g' | sed 's/[ ][0-9]/33ab/g' | sed 's/33ab/ /g' | sed 's/ //g' > output/DENV$num$pii.fasta`;
done;
done

2. Delete the line breaks. i.e. make it such that everything is in one line with the different commands separated by semicolons ";". Remember to put a space after the semicolon.

3. Save the above code in the text editor as extract.sh (or whatever you want)

4. Remember to run it in bash enviroment as "bash extract.sh" using Linux Konsole

Note:

i. My BLAST results were in the form DENV$num"o"$pii.txt e.g DENV1o02.txt, so, I've got 2 nested "for loops" to change 2 variables $num and $pii; DENV1o02, DENV2o02, etc and to change DENV1o02, DENV1o03. For convenience use files named abcd1, abcd2 etc so that the for loops can be used to run the program for a number of files

ii. I use sed to add in a dummy identifier "33ab" to ">" and "Sbjct". Then I can use grep to extract these lines. I use sed to remove 33ab, then I use sed to remove "Sbjct:", then subject sequence numbers and the spaces. Unfortunately all 1 digit to 4 digit numbers are removed (including from the sequence title line) except the gi and the accession numbers

iii. The processed files are created in a a folder called output. Make sure that there are no other folders of that name in your folder. Alternatively the output folder can be renamed from the code

Code by: Murali
Blog Post by: Murali
Post Edited by: Asif M. Khan

Wednesday, March 4, 2009

5) LAPIS: Extracting all sbjct sequences of a particular length from BLAST results

This lapis code is to extract all sbjct sequences with the length of 15 amino acids and at the same time getting rid of double blast hits. This code only works provided that each hit only has one line of sbjct sequence.

Steps


1) extract {from line starting with > to line starting with sbjct} # (this is to get rid of double hits)
2) extract {from line containing "/15" to line containing sbjct}
3) extract {line starting with sbjct}
4) omit sbjct:
5) omit number
6) omit spaces

4) LAPIS: Extracting all hit sequences from BLAST results (for hits with only one line of sbjct sequence)

This lapis code converts the blast results into FASTA format (and getting rid of double hits). This code only works provided that each hit only has one line of sbjct sequence.

Steps


1) extract {from line starting with > to line starting with sbjct} (this is to get rid of double hits)
2) extract {line either starting > or starting sbjct} ("or" must be after "either")
3) omit {number in line starting with sbjct}
4) omit {spaces in line starting with sbjct}
5) omit sbjct:

Output example corresponding the above codes:

a) Output of code: extract {from line starting with > to line starting with sbjct}

>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Length = 154

Score = 50.7 bits (112), Expect = 5e-09
Identities = 15/15 (100%), Positives = 15/15 (100%)

Query: 1 CRQILGQLQPSLQTG 15
CRQILGQLQPSLQTG
Sbjct:57 CRQILGQLQPSLQTG 71

Lines starting with ">" to lines starting with "sbjct" are extracted.

b) Output of code: extract {line either starting > or starting sbjct}

>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Sbjct: 57 CRQILGQLQPSLQTG 71

Lines starting with ">" and "sbjct" are extracted.

c) Output of code: omit {number in line starting with sbjct}

>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Sbjct: CRQILGQLQPSLQTG

Numbers in lines starting with "sbjct" are removed.

d) Output of code: omit {spaces in line starting with sbjct}

>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
Sbjct:CRQILGQLQPSLQTG

Spaces in lines starting with "sbjct" are removed.

e) Output of code: omit {spaces in line starting with sbjct}

>gi|190193559|dbj|BAG48486.1| gag protein [Human immunodeficiency virus 1]
CRQILGQLQPSLQTG

The word "sbjct:" is removed, thus obtaining the FASTA sequences in the end.

Code by: Benben & Asif M. Khan
Blog Post by: BenBen
Post Edited by: BenBen

3) UNIX Shell Script for HIV Standalone Blast

This is a simple script for using standalone blast against the HIV database to generate blast results for each HIV T-cell epitope.

HIVstandaloneblast.sh

Process Code


for i in xx*
do
blastall -p blastp -d HIV_db1.fasta -i $i -F F -e 200000 -I T -v 0 -b 20000 -f 11 -W 2 -T F -C F -M PAM30 -o $i.out
done

Code by: BenBen & Asif M. Khan
Blog Post by: BenBen
Post Edited by: BenBen

2) UNIX Shell Script for WNV Netblast

This is a simple shell script for using netblast on all organisms except WNV and artificial sequences to generate blast results for each WNV T-cell epitope.

WNVNetblast.sh

Process Code


for i in xx*
do
\./blastcl3 -p blastp -d nr -i $i -F F -e 200000 -I T -v 0 -b 20000 -f 11 -W 2 -T F -u "Root[ORGN] NOT txid11082[Organism:exp] NOT txid81077[ORGN]" -C F -M PAM30 -o $i.out
done

Code by: BenBen & Asif M. Khan
Blog Post by: BenBen
Post Edited by: BenBen

1) UNIX Shell Script for Mass Alignments using Muscle

This is a simple shell script for running muscle with one or more alignment automatedly.

Muscle.sh

Process Code


for i in xx*
do
muscle -in $i -out $i"_align".fasta
done

Usage Notes:

1) All your fasta files that are required for alignments must have xx1,xx2,etc included in the file name.

2) Your output files would have "_align" included in the file name.

Code by: BenBen
Blog Post by: BenBen
Post Edited by: BenBen

0) Understanding Unix Environment

{This is still work in progress..}

Unix File System

The Unix File System is a hierarchical file system.

/ is root of the system
/var is a directory in root
/var/www is the full path of the sub-directory in "var"
/var/www/htdocs is the full path of the sub-sub-directory in "/var/www" and so on...

Some examples

1. To know in which directory you are currently in: pwd

# It shows the path of the present working directory.

2. Navigate: cd
# If you want to get into a certain directory, for example:
cd /var/www/htdocs/db

3. To list the content of a directory: ls
# To get long directory listing: ls -l

4. Getting the source code: lynx -source
# Example
lynx -source http://aps.unmc.edu/AP/database/query_output.php?ID=00286
# lynx is a text-based browser. -source tells lynx NOT to display the HTML code, but to return the raw HTML code.

5. Unix command history
# Just press the ''Arrow up'' key to see your previous command
# Whatever you do, double check

6. Get the internet page to your folder
:Example: to put the internet page to the dir where you are at now, key in
lynx -source internet_address.html > your_pagename.html

7. Change mode
: ''chmod'' takes several arguments
: -x is to make executable
: ''chmod -x programfile'' will make program executable rather than just a plain text file.

Questions

1. Why put /var/www/htdocs/db/myprogram in front of myprogram?
: This is because whenever you execute a program, Unix shell needs to find the path of that program.
: In this case, we spell out the full directory path so that Unix shell knows where that program is.

2. Why then can we use the ''lynx'' command and don't have to put the full path?
: This is because lynx command is already in the Shell's list of known paths.
: Later on, we will show you what is the current known paths, e.g. echo $PATH

Command Line Arguments

1. Command Line command without arguments
: For example ''pwd'' doesn't need to take arguments

2. Command line command with one argument
: For example ''ls -l'' takes one argument, the flag ''-l'' to give a ''l''ong listing.

3. Command line command with two arguments
: For example ''lynx -source http://www.google.com''
: lynx the command
: -source the first argument
: http://www.google.com as the second argument

Loop Control

How do we make the numbers autoincrement without us to put in the numbers ourselves?

1. Make 6 into a program to create multiple files
:(1)pseudo code for ''my_program''
for (i=0;i<100;i++)> your_pagename[i].html}

:(2)Change the file to executable file
:(3)Key in [path]/my_program
:(4)chmod +x my_program (This is to make the file executable.)

Learning Unix Commands

'''Environment Control'''

# refers to comments

1) cd d # Change to directory d

2) mkdir d #Create new directory d

3) rmdir d #Remove directory d

4) rm -r existingdirectories # Will delete the existing directory named 'existingdirectories' and all directories and files below it.

5) rm -rf existingdirectories # Will delete the existing directory named 'existingdirectories' and all directories and files below it. Even if it runs into an exception that it would usually prompt for user interaction.
the r is recursive and the f is force.

Be careful with the "rm -r" command. Don't run it as root. And stay away from the root directory if you plan on running it.

6) mv f1 [f2...] d #Move file f to directory d

7) mv d1 d2 #Rename directory d1 as d2

8) passwd #Change password

9) alias name1 name2 #Create command alias (csh/tcsh)

10) alias name1="name2" #Create command alias (ksh/bash)

11) unalias name1[na2...] #Remove command alias na

12) ssh nd #Login securely to remote node

13) exit #End terminal session

14) setenv name v #Set env var to value v (csh/tcsh)

export name="v" #set environment variable to value v (ksh/bash)

Miscellaneous commands

1)

ls

# is to list the directory

2)

cp

# copy

3)

wc

# word count

4)

ls -al | grep root |awk '{print "file: $9, S8" }'

# to extract the 8th and the 9th column; details of all the files/directories called root is ls

5)

wc -l

# count number of lines

6)

# is to funnel to a new file

7)

 ls -al | grep root |awk '{print "file: $9, S8" }' > x.file

# transfer the output to a file name, in this case called x.file

8)

 pico x.file

# a text editor that helps you read the x file; to get out from pico, type ctrl x; pico is very sophisticated

9)

 vi x.file

# a text editor that helps you read the x file

10)

:q

# helps exit vi, which is very obscure.

11)

:q!

# helps exit vi, which is very obscure.

12)

pwd

# print the current working directory

13)

cd

# is to change the directory

14)

tab

#use it to automatically fill in the blanks; auto completion

15)

 key up and down

# help you look at commands typed earlier

16)

 echo "ls -al /root/Desktop/ |awk '{print $5 "b", S9 }' " >tinwee

# creates an executable file

17)

 ./tinwee

# executing tinwee, but will not work because you have no permission

18)

 chmod 755 tinwee

# give full access to run tinwee

19)

 pico tinwee

# to make the program more generic by modifying the code

20)

 ls -al $1 |awk '{print $5 "bytes", $9 }

#change the code to more generic instead of hardwiring it

21)

 tinwee /

# takes / as $1, which in this case is root; note that this $1 is different from the #awk $1

22)

 ln tinwee t

# creating a shortname or short cut (soft link) for tinwee (alias); make sure it is unique alias

23)

 echo $PATH | sed 's/:/\n/g'

# this is to show the path of all the directories and files line by #line

24)

 echo $PATH | sed 's/:/\n/g' |more

# show little by little

25)

 echo $PATH | sed 's/:/\n/g' |less

# show less

26)

 echo $PATH | sed 's/:/\n/g' |cat

# to show all

27)

 echo $PATH | sed 's/:/\n/g' |tac

# to show in reverse order

28)

 echo $PATH | sed 's/:/\n/g' |less |rev

# show less but in reverse order

29)

# to quit while looking at the results of these

30)

 ctrl + C

#to quit while looking at these

31) Create a translation tool, simple one


echo "ATGCTTA"
" |rev |tac |tr "atcg" "tagc"

# rev reverses the sequence, tac show it in reverse order and tr does the translation

32)

alias

#does not link a file but substitutes for the name

33)

>>

#will append to the bottom of a file that already exists

34)

#append to a new file

35) Creata database of your own

lynx http://aps.unmc.edu/AP/database/query_output.php?ID=00286 > AP00286.html

#saves the web page to AP00286.html but it is not saving the source code

lynx -source http://aps.unmc.edu/AP/database/query_output.php?ID=00286 > AP00286.html

#saves the web page to AP00286.html and this time it is saving the source code
echo "lynx -source http://aps.unmc.edu/AP/database/query_output.php?ID=00286 > AP00286.html" >getapd

pico getapd

# read the getapd file
change

lynx -source http://aps.unmc.edu/AP/database/query_output.php?ID=00286 > AP00286.html

lynx -source http://aps.unmc.edu/AP/database/query_output.php?ID=0$1 > AP0$1.html

# this is making the code generic

getapd AP00287

# type on the command prompt to download the record 00287 without typing the whole command

pico getapd

# open the getapd file to make the command more generic


for i in 'Seq -w 1137`
do
lynx -source http://aps.unmc.edu/AP/database/query_output.php?ID=0$i > AP0$i.html
sleep 3
done

./getapd

#run getapd for automatic download

36.

" "

# double quotes are used whe you want to evaluate the variable. For example "$PATH" will evaluate the dollar sign and give you relevant matches
37.

' '

# are used when one wants to print the special characters without evaluating them. For example, '$PATH' will return $PATH

Bioinformatics Codelets

Friday, March 20, 2099

Monday, October 26, 2009

Tuesday, October 20, 2009

Tuesday, September 22, 2009

Sunday, May 31, 2009

Tuesday, April 21, 2009

Sunday, April 19, 2009

Tuesday, April 7, 2009

Friday, April 3, 2009

Saturday, March 21, 2009

Wednesday, March 11, 2009

Tuesday, March 10, 2009

Wednesday, March 4, 2009

Subscribe To Codelets

Related Sites

Contributors