Monday, October 26, 2009

24) Unix one-liner to split one-line sequences into FASTA formatted files

In Unix, one line of code is enough to carry out the process of transforming a text file containing x sequences, one sequence per line, into a fasta formatted file of x sequences with >n where n is the number of the sequence line.

$ pr -n:3 -t -T inputfile.txt | sed 's/^/>/' | tr ":" "\n"

pr is print command.
-n:3 is to number the line with 3 digits or more depending on how many lines you have
and : is to separate the line number from the sequence
-t is to switch off header
-T is to switch off pagination

sed 's/^/>/' is to insert a > at the beginning of each line,
and
tr is to change the separator ":" with a new line character "\n"

And if you want to wrap the sequence into a width of 60 characters,

$ pr -n:3 -t -T input.txt | sed 's/^/>/' | tr ":" "\n" | fold -w 60

where the command
fold -w 60 means to wrap with a width of 60 characters.

By adjusting the -n:3 to 4 you can change the number of spaces for the digits
By adjusting -w 60, you can change the width of your sequences
By substituting s/^/>/ with s/^[ ]*/>/ you can remove spaces in front,
or you can add prefixes in front of the number e.g

s/^[ ]*/>NP/

And if you want to split each line into a separate FASTA formatted file with filename corresponding to the number

$ pr -n:3 -t -T input.txt | sed 's/^[ ]*/>NP/' | tr ":" "\n" | \
fold -w 60 | csplit -f NP - "/>/" "{*}"

where the command csplit is told to name the split files starting with a
prefix "-f NP"
and to take the input file as from the standard input "-"
and to split the files according to a matching regular pattern starting with "/>/"
and to do this repetitively until the end of the file "{*}"

Say for 107 lines of sequences, you will get 108 files listed NP00 to NP107 where the contents of NP01 are

>NP01

tataatatcagtatatctat

or whatever the sequence is folded to 60 characters per line.
NP00 is blank file. Ignore.

So, just one line of code. Five unix commands: pr, sed, tr, fold, csplit will do the job really fast
because it is pipelined. Therefore for large inputfiles, the first files will start to pop out as soon as they are finished, so you can actually read them almost immediately after you start the program.

No comments:

Post a Comment