Thursday, June 28, 2012

awk - 10 examples to split a file into multiple files



 In this article of the awk series, we will see the different scenarios in which we need to split a file into multiple files using awk. The files can be split into multiple files either based on a condition, or based on a pattern or because the file is big and hence needs to split into smaller files.

Sample File1:
Let us consider a sample file with the following contents:
$ cat file1
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
1. Split the file into 3 different files, one for each item. i.e, All records pertaining to Item1 into a file, records of Item2 into another, etc.
$ awk -F, '{print > $1}' file1
       The files generated by the above command are as  below:
$ cat Item1
Item1,200
Item1,600
$ cat Item3
Item3,900
$ cat Item2
Item2,500
Item2,800
   This looks so simple, right? print prints the entire line, and the line is printed to a file whose name is $1, which is the first field. This means, the first record will get written to a file named 'Item1', and the second record to 'Item2', third to 'Item3',  4th goes to 'Item2', and so on.

2. Split the files by having an extension of .txt to the new file names.
$ awk -F, '{print > $1".txt"}' file1
    The only change here from the above is concatenating the string ".txt" to the $1 which is the first field. As a result, we get the extension to the file names. The files created are below:
$ ls *.txt
Item2.txt  Item1.txt  Item3.txt
3. Split the files by having only the value(the second field) in the individual files, i.e, only 2nd field in the new files without the 1st field:
$ awk -F, '{print $2 > $1".txt"}' file1
     The print command prints the entire record. Since we want only the second field to go to the output files, we do: print $2.
$ cat Item1.txt
200
600
4. Split the files so that all the items whose value is greater than 500 are in the file "500G.txt", and the rest in the  file "500L.txt".
$ awk -F, '{if($2<=500)print > "500L.txt";else print > "500G.txt"}' file1
        The output files created will be as below:
$ cat 500L.txt
Item1,200
Item2,500
$ cat 500G.txt
Item3,900
Item2,800
Item1,600
     Check the second field($2). If it is lesser or equal to 500, the record goes to "500L.txt", else to "500G.txt".
  Other way to achieve the same thing is using the ternary operator in awk:
$ awk -F, '{x=($2<=500)?"500L.txt":"500G.txt"; print > x}' file1
      The condition for greater or lesser than 500 is checked and the appropriate file name is assigned to variable x. The record is then written to the file present in the variable x.

Sample File2:
Let us consider another file with a different set of contents. This file has a pattern 'START' at frequent intervals.
$ cat file2
START
Unix
Linux
START
Solaris
Aix
SCO
5. Split the file into multiple files at every occurrence of the pattern START .
$ awk '/START/{x="F"++i;}{print > x;}' file2
     This command contains 2 sets of curly braces: The control goes to the first set of braces only on encountering a line containing the pattern START. The second set will be encountered by every line since there is no condition and hence always true.
   On encountering the pattern START, a new file name is created and stored. When the first START comes, x will contain "F1" and the control goes to the next set of braces and the record is written to F1, and the subsequent records go the file "F1" till the next START comes. On encountering next START, x will contain "F2" and the subsequent lines goes to "F2" till the next START, and it continues.
$ cat F1
START
Unix
Linux
Solaris
$ cat F2
START
Aix
SCO
6. Split the file into multiple files at every occurrence of the pattern START. But the line containing the pattern should not be in the new files.
$ awk '/START/{x="F"++i;next}{print > x;}' file2
       The only difference in this from the above is the inclusion of the next command. Due to the next command, the lines containing the START enters the first curly braces and then starts reading the next line immediately due to the next command. As a result, the START lines does not get to the second curly braces and hence the START does not appear in the split files.
$ cat F1
Unix
Linux
$ cat F2
Solaris
Aix
SCO
7. Split the file by inserting a header record in every new file.
$ awk '/START/{x="F"++i;print "ANY HEADER" > x;next}{print > x;}' file2
       The change here from the earlier one is this: Before the next command, we write the header record into the file. This is the right place to write the header record since this is where the file is created first.
$ cat F1
ANY HEADER
Unix
Linux
$ cat F2
ANY HEADER
Solaris
Aix
SCO
Sample File3:
Let us consider a file with the sample contents:
$ cat file3
Unix
Linux
Solaris
AIX
SCO
8. Split the file into multiple files at every 3rd line . i.e, First 3 lines into F1, next 3 lines into F2 and so on.
$ awk 'NR%3==1{x="F"++i;}{print > x}'  file3
      In other words, this is nothing but splitting the file into equal parts. The condition does the trick here: NR%3==1 : NR is the line number of the current record. NR%3 will be equal to 1 for every 3rd line such as 1st, 4th, 7th and so on. And at every 3rd line, the file name is changed in the variable x, and hence the records are written to the appropriate files.
$ cat F1
Unix
Linux
Solaris
$ cat F2
Aix
SCO 
Sample File4:
Let us update the above file with a header and trailer:
$ cat file4
HEADER
Unix
Linux
Solaris
AIX
SCO
TRAILER
9. Split the file at every 3rd line without the header and trailer in the new files.
sed '1d;$d;' file4 | awk 'NR%3==1{x="F"++i;}{print > x}' 
         The earlier command does the work for us, only thing is to pass to the above command without the header and trailer. sed does it for us.  '1d' is to delete the 1st line, '$d' to delete the last line.
$ cat F1
Unix
Linux
Solaris
$ cat F2
AIX
SCO
10.  Split the file at every 3rd line, retaining the header and trailer in every file.
$ awk 'BEGIN{getline f;}NR%3==2{x="F"++i;a[i]=x;print f>x;}{print > x}END{for(j=1;j<i;j++)print> a[j];}' file4
     This one is little tricky. Before the file is processed, the first line is read using getline into the variable f. NR%3 is checked with 2 instead of 1 as in the earlier case because since the first line is a header, we need to split the files at 2nd, 5th, 8th lines, and so on. All the file names are stored in the array "a" for later processing.
    Without the END label, all the files will have the header record, but only the last file will have the trailer record. So, the END label is to precisely write the trailer record to all the files other than the last file.
$ cat F1
HEADER
Unix
Linux
Solaris
TRAILER
$ cat F2
HEADER
Aix
SCO
TRAILER
Related Posts Plugin for WordPress, Blogger...

No comments:

Post a Comment

Author's Note:

This UNIX school is a learning place for all the unix/linux lovers to share and learn!!!Subscribe and enjoy our article updates!!! Know more about the author here.

Copyright Protected


MyFreeCopyright.com Registered & Protected