The following is a code to process an alignment input to an output whereby the amino acid residues or the nucleotides of all protein or DNA sequences variant to an index protein or DNA sequence are represented as dots, leaving the variable residues or nucleotides intact. This is similar to the previous method, however, this is a perl script which is suitable for matching longer index protein sequences.
Input
ASIFTANTINWEE
ASAFTANTINWEE
ASEETENTANWAA
Process Code
#!/usr/bin/perl
# Usage benbo.pl
$L=$ARGV[0]-1;
# Open file for reading
open (FILE, "$ARGV[1]") or die "cannot open file: $!";
# Store everything in the file into array @raw_data
@raw_data=;
# print the Target Line
print $raw_data[$L];
# trim off newline from the Target Line
chomp($raw_data[$L]);
@linearray = split("", $raw_data[$L] );
# test each line of the input file, as read into the rawdata array
foreach $line (@raw_data)
{
chomp($line);
# take line in question and split into an array of single chr
@linetest = split("", $line );
# initialise the character count
$nn=0;
# analyse each character in the current Test line
foreach $linechr (@linetest) {
# check if Test character matches Target character
if ($linechr =~ m/$linearray[$nn]/) {
# if yes, print dot
print ".";
} else {
# if no, print the non-matching chr in the Test sequence
print "$linechr";
} # if
# increment the character count of the Target line
$nn=$nn+1;
} #foreach linechr
# once each line is completed, print new line, and standby to reset $nn
print "\n";
# run through every line of the sequence input file
} # foreach
close(FILE);
Usage Notes:
1) To run the script assuming the code is saved in a file called benbo.pl: ./benbo.pl 1 xx1.txt, where "1" is the line number of the index sequence and "xx1.txt" is the input file. All the characters of the index sequence are shown in the output.
Output
ASIFTANTINWEE
..A..........
..EE.E..A..AA
Code by: Dr. Tan Tin Wee
Blog Post by: BenBen
No comments:
Post a Comment