Splitting a file according to context
We have already met split as a method of splitting
a file into smaller units, and have indicated its use when mailing
large text files. Another reason for splitting a file is if you
know that the file contains separate identifiable portions. For
instance, suppose you had a simple text file consisting of
paragraphs of English separated by blank lines, and you wanted the
paragraphs to be in separate files. The blank lines would identify
where to break the file, and you can specify a blank line by means
of the BRE, ^$ .
The command csplit splits a file into sections
where the sections either contain specified numbers of lines or are
delimited by text that can be described by a basic regular
expression. To start with a simple example,
$ csplit data 10
will take file data , and create two new files
called xx00 and xx01 . File
xx00 will contain lines 1 through 9 of file
data , and xx01 will contain line 10 up to
the end of file data . File data remains
unaltered. When each new file is created, csplit will
print the size (in bytes) of that file, on the standard error
stream (this can be suppressed with option -s ).
The first argument to csplit is the name of a file
(or - (hyphen) if standard input) and the following
one or more arguments indicate where the file is to be split. An
argument that is a line number instructs a break to be made at the
beginning of that line (hence in the previous example line 10 is
sent to the second file). The new, smaller, files are named
xx nn, where nn starts at
00 and counts upwards. With option -f
followed by a string that string will be used as the
prefix instead of xx . Any number of
arguments can follow the filename.
Worked example
Split /usr/dict/words into three files called
words00 , words01 and
words02 , the first two containing 10000 lines, the
final one containing the rest of
/usr/dict/words .
Solution: Use csplit to split the
file with option -f to specify that the prefix is
words . The next argument is
/usr/dict/words , and this is to be split at lines
10001 and 20001.
$ csplit -f words /usr/dict/words 10001
20001
Now use wc to check that the files you have created
are of the specified length:
$ wc words??
The three files you have created are fairly big, so don't forget
to delete them.
If an argument to csplit is a number n and
is then followed immediately by an argument of the form
{ count} , then it will be split
at line n and then repeatedly every n
lines up to a maximum of count times. Delete any previous
xx files you have created, and try the following:
$ csplit -s /usr/dict/words 1000
{2}
$ wc -l xx??
You will see that /usr/dict/words has been split
into four files. The first split is at the start of line 1000, so
the first file is 999 lines long, then the subsequent two splits
are each 1000 lines longer. The final file xx03
contains the rest of /usr/dict/words .
If you specify that the file be split at too many places, no
split files will be created and an error message will be generated.
For instance, to try to split /usr/dict/words into 50
files of 10000 lines each (which we clearly cannot do):
$ csplit /usr/dict/words 10000
{50}
82985
82982
csplit: {50} - out of range
82985 is the number of bytes in xx01 ,
etc.
The reason for this behaviour is to encourage you to be aware of
how you are splitting your files, and csplit errs on
the side of caution. Mistakes when specifying the arguments to
csplit would otherwise be prone to causing large
volumes of unwanted split files to be generated, thus wasting
valuable storage space. There are some instances, however, when
this behaviour is undesirable, especially when the length of a file
is not initially known. If you give csplit option
-k it will warn you if you try to split the input file
too many times, but it will create the xx files
anyway. So, to split /usr/dict/words into as many
files as possible each containing (roughly) 5000 lines:
$ csplit -k /usr/dict/words 5000
{10000}
Worked example
Split /usr/dict/words into three files called
w0 , w1 and w2 , each
containing a roughly equal number of lines.
Solution: Use wc to count the lines
in /usr/dict/words , then arithmetic expansion to
calculate one-third and two-thirds of that number.
$ LINES=$( wc -l <
/usr/dict/words)
$ ONETHIRD=$(( $LINES / 3))
$ TWOTHIRDS=$(( $ONETHIRD + $ONETHIRD))
$ csplit -f w -n 1 /usr/dict/words $ONETHIRD
$TWOTHIRDS
When performing wc we redirected the standard input
from the file /usr/dict/words ; by doing that,
wc does not include the filename on its output. Had we
used wc -l /usr/dict/words it would have been
necessary to pipe the output to cut in order to
isolate the first field, as the output from wc would
have included the filename /usr/dict/words .
An argument to csplit can be a basic regular
expression enclosed between two / (slash) symbols, in
which case the file being split will be broken at the start of the
next line matching that expression.
Worked example
Split /usr/dict/words into two files, the first
containing all words commencing with characters up to and including
m , the second containing words commencing
n through z .
Solution: Use csplit with argument
'/^ [Nn]/' indicating that /usr/dict/words
should be split at the start of the first line commencing either
N or n .
$ csplit /usr/dict/words
'/^[Nn]/'
Consider the problem posed at the start of the section, namely
splitting a text file into paragraphs. The BRE that denotes a blank
line is ^$ and so if we have in file X
some such text, we might have:
$ csplit X '/^$/'
This will not work; it will split the file at the first blank
line only. Just as with number arguments we can follow a BRE
argument to csplit by a number in braces, to indicate
that the split should occur multiple times. If we don't know how
big X is, we must use option -k as
above:
$ csplit -k X '/^$/'
{10000}
Create a small file containing a few paragraphs of text and try
this command.
Worked example
File book contains the text for a book, with each
of 10 chapters commencing with a line starting Chapter
... so:
Title: ...
...
Chapter 1: Introduction
...
Chapter 2: Getting started
...
Split this file into several files, called
chapter00 , etc., one for each chapter.
Solution: Use csplit with option
-f (to denote the names of the split files), and split
at the start of each line commencing Chapter . The
split will need to be repeated an extra 9 times:
$ csplit -f chapter book '/^Chapter/'
{9}
|