Main index

Introducing UNIX and Linux


Overview
Using filters
      Collating sequence
      Character classes
Character-to-character transformation
Selecting lines by content
      Regular expressions
      Basic regular expressions
      Extended regular expressions
      Grep
Stream editor
      Sed addresses
Splitting a file according to context
Choosing between the three filters
More on Vi
Summary
Exercises

Splitting a file according to context

We have already met split as a method of splitting a file into smaller units, and have indicated its use when mailing large text files. Another reason for splitting a file is if you know that the file contains separate identifiable portions. For instance, suppose you had a simple text file consisting of paragraphs of English separated by blank lines, and you wanted the paragraphs to be in separate files. The blank lines would identify where to break the file, and you can specify a blank line by means of the BRE, ^$.

The command csplit splits a file into sections where the sections either contain specified numbers of lines or are delimited by text that can be described by a basic regular expression. To start with a simple example,

csplit data 10

will take file data, and create two new files called xx00 and xx01. File xx00 will contain lines 1 through 9 of file data, and xx01 will contain line 10 up to the end of file data. File data remains unaltered. When each new file is created, csplit will print the size (in bytes) of that file, on the standard error stream (this can be suppressed with option -s).

The first argument to csplit is the name of a file (or - (hyphen) if standard input) and the following one or more arguments indicate where the file is to be split. An argument that is a line number instructs a break to be made at the beginning of that line (hence in the previous example line 10 is sent to the second file). The new, smaller, files are named xxnn, where nn starts at 00 and counts upwards. With option -f followed by a string that string will be used as the prefix instead of xx. Any number of arguments can follow the filename.

Worked example

Split /usr/dict/words into three files called words00, words01 and words02, the first two containing 10000 lines, the final one containing the rest of /usr/dict/words.
Solution: Use csplit to split the file with option -f to specify that the prefix is words. The next argument is /usr/dict/words, and this is to be split at lines 10001 and 20001.

csplit -f words /usr/dict/words 10001 20001

Now use wc to check that the files you have created are of the specified length:

wc words??

The three files you have created are fairly big, so don't forget to delete them.

If an argument to csplit is a number n and is then followed immediately by an argument of the form {count}, then it will be split at line n and then repeatedly every n lines up to a maximum of count times. Delete any previous xx files you have created, and try the following:

csplit -s /usr/dict/words 1000 {2}
wc -l xx??

You will see that /usr/dict/words has been split into four files. The first split is at the start of line 1000, so the first file is 999 lines long, then the subsequent two splits are each 1000 lines longer. The final file xx03 contains the rest of /usr/dict/words.

If you specify that the file be split at too many places, no split files will be created and an error message will be generated. For instance, to try to split /usr/dict/words into 50 files of 10000 lines each (which we clearly cannot do):

csplit /usr/dict/words 10000 {50}
82985
82982
csplit: {50} - out of range

82985 is the number of bytes in xx01, etc.

The reason for this behaviour is to encourage you to be aware of how you are splitting your files, and csplit errs on the side of caution. Mistakes when specifying the arguments to csplit would otherwise be prone to causing large volumes of unwanted split files to be generated, thus wasting valuable storage space. There are some instances, however, when this behaviour is undesirable, especially when the length of a file is not initially known. If you give csplit option -k it will warn you if you try to split the input file too many times, but it will create the xx files anyway. So, to split /usr/dict/words into as many files as possible each containing (roughly) 5000 lines:

csplit -k /usr/dict/words 5000 {10000}

Worked example

Split /usr/dict/words into three files called w0, w1 and w2, each containing a roughly equal number of lines.
Solution: Use wc to count the lines in /usr/dict/words, then arithmetic expansion to calculate one-third and two-thirds of that number.

LINES=$( wc -l < /usr/dict/words)
ONETHIRD=$(( $LINES / 3))
TWOTHIRDS=$(( $ONETHIRD + $ONETHIRD))
csplit -f w -n 1 /usr/dict/words $ONETHIRD
$TWOTHIRDS

When performing wc we redirected the standard input from the file /usr/dict/words; by doing that, wc does not include the filename on its output. Had we used wc -l /usr/dict/words it would have been necessary to pipe the output to cut in order to isolate the first field, as the output from wc would have included the filename /usr/dict/words.

An argument to csplit can be a basic regular expression enclosed between two / (slash) symbols, in which case the file being split will be broken at the start of the next line matching that expression.

Worked example

Split /usr/dict/words into two files, the first containing all words commencing with characters up to and including m, the second containing words commencing n through z.
Solution: Use csplit with argument '/^[Nn]/' indicating that /usr/dict/words should be split at the start of the first line commencing either N or n.

csplit /usr/dict/words '/^[Nn]/'

Consider the problem posed at the start of the section, namely splitting a text file into paragraphs. The BRE that denotes a blank line is ^$ and so if we have in file X some such text, we might have:

csplit X '/^$/'

This will not work; it will split the file at the first blank line only. Just as with number arguments we can follow a BRE argument to csplit by a number in braces, to indicate that the split should occur multiple times. If we don't know how big X is, we must use option -k as above:

csplit -k X '/^$/' {10000}
Create a small file containing a few paragraphs of text and try this command.

Worked example

File book contains the text for a book, with each of 10 chapters commencing with a line starting Chapter ... so:

Title: ...
 ...
Chapter 1: Introduction
 ...
Chapter 2: Getting started
 ...

Split this file into several files, called chapter00, etc., one for each chapter.
Solution: Use csplit with option -f (to denote the names of the split files), and split at the start of each line commencing Chapter. The split will need to be repeated an extra 9 times:

csplit -f chapter book '/^Chapter/' {9}


Copyright © 2002 Mike Joy, Stephen Jarvis and Michael Luck