Tuesday, April 7, 2009

13) Perl script for splitting a fasta file into several small files

This script splits a fasta file into several small fasta files, according to the specified number of sequences in each file. Upon execution, users have to input the original fasta file name (without .fasta extension) and to specify the intended number of sequences in each file.

Output: A chomped fasta file (xxx_chomped.fasta) will be produced and the small fasta files will be named as (xxx_1.fasta, xxx_2.fasta etc)

Note: Each subroutine can be run independently as a perl script by cutting and paste into another .pl file and removing the lines sub xxx { and }. If users do not want to run all of the sub-routines, just comment out those that you don't want to run, e.g. #split_fasta();

#! /usr/bin/perl

print "Please enter filename (without extension): ";
$input = <>;
chomp ($input);

print "Please enter no. of sequences you want in each file ";
$upper_limit = <>+1;
chomp ($upper_limit);

chomp_fasta();
split_fasta();

#--------------------------------------------------------------------------------
# chomp_fasta: Merges all the sequence lines in a fasta file into one line, so
#              each sequence in the fasta file will have one header line and 
#              one sequence line only
#--------------------------------------------------------------------------------

sub chomp_fasta {

open (INFILE, "$input.fasta") or die "Cannot open infile!";
open (OUT, ">"."$input"."_chomped.fasta") or die "Cannot open outfile!";

while ($line=<INFILE>) { # Please remove the spaces

if ($line=~/>/) {
print OUT "\n$line";
}

else {
chomp ($line);
print OUT "$line";
}

}
close OUT;
}

#--------------------------------------------------------------------------------
# split_fasta: Splits a fasta file into several small files according to the 
#              specified no. of sequences in each file
#--------------------------------------------------------------------------------

sub split_fasta {

$count = 0;
$number = 1;

open (INFILE, "$input"."_chomped.fasta") or die "Cannot open infile!";
open (OUT, ">"."$input"."_"."$number".".fasta") or die "Cannot open outfile!";

while ($line=<INFILE>) {

if ($line=~/>/) {

$count++;

if ($count==1) {
print OUT "$line";
}

elsif ($count<$upper_limit) { #change this value to change upper limit
print OUT "\n$line";
}
}

else {
chomp ($line);
print OUT "$line";
}

#--------------------------------------------------------------------------------
# Note: The header >xxx when $count=3 has already been evaluated in the previous
#       regex loop but is not printed out. So this if loop must print out the
#       header >xxx in a new file manually and reset the count back to 1 instead
#       of 0 
#--------------------------------------------------------------------------------


if ($count==$upper_limit) { #change this value to change upper limit

close OUT;
$number++;

open (OUT, ">"."$input"."_"."$number".".fasta") or die "Cannot open outfile!";
print OUT "$line";
$count = 1;
}
}
}

6 comments:

  1. Its not working

    ReplyDelete
    Replies
    1. #! /usr/bin/perl

      print "Please enter filename (without extension): ";
      $input = <>;
      chomp ($input);

      print "Please enter no. of sequences you want in each file ";
      $upper_limit = <>+1;
      chomp ($upper_limit);

      chomp_fasta();
      split_fasta();

      #--------------------------------------------------------------------------------
      # chomp_fasta: Merges all the sequence lines in a fasta file into one line, so
      # each sequence in the fasta file will have one header line and
      # one sequence line only
      #--------------------------------------------------------------------------------

      sub chomp_fasta {

      open (INFILE, "$input") or die "Cannot open infile!";
      open (OUT, ">"."$input"."_chomped.fasta") or die "Cannot open outfile!";

      while ($line=) { # Please remove the spaces

      if ($line=~/>/) {
      print OUT "\n$line";
      }

      else {
      chomp ($line);
      print OUT "$line";
      }

      }
      close OUT;
      }

      #--------------------------------------------------------------------------------
      # split_fasta: Splits a fasta file into several small files according to the
      # specified no. of sequences in each file
      #--------------------------------------------------------------------------------

      sub split_fasta {

      $count = 0;
      $number = 1;

      open (INFILE, "$input"."_chomped.fasta") or die "Cannot open infile!";
      open (OUT, ">"."$input"."_"."$number".".fasta") or die "Cannot open outfile!";

      while ($line=) {

      if ($line=~/>/) {

      $count++;

      if ($count==1) {
      print OUT "$line";
      }

      elsif ($count<$upper_limit) { #change this value to change upper limit
      print OUT "\n$line";
      }
      }

      else {
      chomp ($line);
      print OUT "$line";
      }

      #--------------------------------------------------------------------------------
      # Note: The header >xxx when $count=3 has already been evaluated in the previous
      # regex loop but is not printed out. So this if loop must print out the
      # header >xxx in a new file manually and reset the count back to 1 instead
      # of 0
      #--------------------------------------------------------------------------------


      if ($count==$upper_limit) { #change this value to change upper limit

      close OUT;
      $number++;

      open (OUT, ">"."$input"."_"."$number".".fasta") or die "Cannot open outfile!";
      print OUT "$line";
      $count = 1;
      }
      }
      }

      Delete
    2. That is a lot of lines of perl code. Splitting a file into several smaller files according to number of lines can be achieved with the Unix split command.
      $ split -l 10 x.fasta "seq." # will split x.fasta into files of 10 lines each named seq.aa, seq.ab, seq.ac etc.
      So if we arrange for a multi-sequence fasta file to be in two lines per sequence (i.e. one line header, one line containing the entire sequence), we can simply split according to the number of sequences per file, say 5 per file, using split -l 10, since each sequence contains exactly two lines.
      To convert a fasta format into two lines, we need to catenate the multi-line sequence into one single long line. As indicated in this tutlet, perl can do this in so many lines. However, awk can do this in one short line of awk script:
      $ awk 'BEGIN{ORS=""} {if (/^>/) {sub(/^.*$/,"\n&\n");print} else {print}} END{print "\n"}' multiseq.fasta | grep -v "^$"| split -l 20 - "seq."

      This will generate seq.aa, seq.ab, seq.ac etc each of 10 sequences each (20 divided by 2 since each fasta sequence after the awk script would have been catenated into two lines each).

      Explanation:
      Perl descended from awk. Awk can do most of the things perl routinely does, including simple text manipulations like turning a multisequence fasta file into many files of a fixed number of sequences.

      ORS is the Output Record Separator parameter of awk. If set to "". Every line outputed will be joined together in one great big line with no newlines. Default ORS="\n". So before awk starts doing anything, we set ORS to null.
      At the END, because we know the end of the last line when processed will also be missing a newline character, we need to put that back. Hence END {print "\n"}.
      So what is in between the BEGIN and the END of the awk script, is how we keep the fasta header lines but catenate all the sequence lines.
      Awk reads files one record at a time, with each record defined by RS variable, default "\n". So in this case, awk reads the file multiseq.fasta, one line a time by default. Because ORS is set to null, instead of the default - newline \n, each record will be outputed as one long line, unless we specify where we want to put the newline characters. So this is what we do after BEGIN, which I explain next.
      If condition inside ( ) is true, i.e. /^>/ which means match beginning with >, then this is the header line, which we want to substitute at the beginning and at the end, with newlines \n. Which is what sub(/^.*$/,"\n&\n") does. /^.*$/ matches the entire line, and the .* is kept in the variable denoted by & (just like in sed). So the replacement \n&\n means that everything in the header line, should be preceded by a \n newline, and followed by a newline. Otherwise, all other lines, i.e. sequence containing lines, substitute the end with nothing, and print. This is what the
      else {print}
      does.
      Because the first line being a header line is substituted with \n>something\n, the output will be preceded by a blankline. We remove that with grep -v, and send the whole output consisting of two lines per sequence, and no blank lines anywhere, straight to split.
      If we want 10 sequences per file, i.e. 20 lines, we split -l 20 , 15 sequences, split -l 30. Just multiply the number of sequences we want by two. The - in split -l 30 - is to accept input from stdin. The "seq." is to prefix the output files with seq. to create seq.aa, seq.ab ac ad ... You can change it to whatever you want.

      So in one line, we have achieved the same thing which perl does in so many lines.
      (Of course, we can skip the grep by making awk check NR=1 matching >header line, and prevent it from putting a \n>header\n and put in a header\n instead. But I didn't want to confuse the already complicated awk script).

      Delete
  2. Hi,

    This script was posted some time ago.
    Please try the updated script here instead: http://everest.bic.nus.edu.sg/mediawiki/index.php/FASTA_Manipulation
    (Subsection: Split fasta file into individual sequence files)

    If it's not working, do let me know the exact error message.

    Thanks.

    ReplyDelete
  3. Hi there,

    The script was posted some time ago.
    The updated script can be found at http://everest.bic.nus.edu.sg/mediawiki/index.php/FASTA_Manipulation
    (Section 1.2).

    Please try the updated script. If you still encounter any error, please post the error message.

    Thanks!

    ReplyDelete
  4. Thank you Sir for your expertise. With your program I m able to split my file. I would like to seek your guidance in future also

    ReplyDelete