Filtering files
Text files - especially ones containing 'raw data' - often
contain repeated lines. It is sometimes useful to know either how
often this occurs, or to filter out the repeated occurrences. The
command uniq ('unique') is provided for this purpose.
For instance, supposing file A contains
aaa
bbb
bbb
bbb
bbb
bbb
ccc
ccc
aaa
ddd
then the following dialogue might take place:
$ uniq A
aaa
bb
ccc
aaa
ddd
$ uniq -c A
1 aaa
5 bbb
2 ccc
1 aaa
1 ddd
With no options, uniq simply filters out
consecutive repeated lines; option -c ('count')
prepends each line of output with a count of the number of times
that line was repeated. Option -d ('duplicate') causes
uniq only to write out lines that are repeated, and
-u ('unique') only to write out lines that are not
repeated consecutively. Thus:
$ uniq -d A
bbb
ccc
$ uniq -u A
aaa
aaa
ddd
Another common situation arises when you have two or more files,
containing what can be thought of as columns in a table. You
require corresponding lines from the files to be concatenated so as
to actually produce a table. Using the command paste
will achieve this - corresponding lines of its arguments are joined
together separated by a single TAB character. For example,
suppose file A contains
hello
Chris
and file B contains
there
how are you?
then the following dialogue can take place:
$ paste A B
hello there
Chris how are you?
Both paste and uniq , though only of
use in limited situations, save a great deal of time editing files
when they can in fact be used.
Sometimes, when dealing with files that are presented in a rigid
format, you may wish to select character columns from such a file.
The utility cut is a very simple method for extracting
columns. Suppose we have a file myfile containing the
following data (dates of birth and names):
17.04.61 Smith Fred
22.01.63 Jones Susan
03.11.62 Bloggs Zach
We can choose the years from each line by selecting character
columns 7 to 8, thus:
$ cut -c7-8 myfile
61
63
62
This command can also distinguish between
fields (where a line is thought of as divided into
fields separated by a known delimiter), and to select family names
from myfile (Smith, Jones and Bloggs), we could use
cut -f2 -d' ' myfile , which specifies that we select
field number 2 where the delimiter (option
-d ) is the space character:
$ cut -f2 -d' ' myfile Smith
Jones
Bloggs
Related to cut is fold ;
cut will assume that you want the same number of lines
in the output as the input, but you wish to select part of those
input lines. On the other hand, fold assumes that you
want all of your input, but that your output needs to fit within
some lines of maximum width - for example, if you had a file with
some very long lines in it that you needed printing on a printer
that was fairly narrow. The action performed by fold
is to copy its standard input, or names mentioned as arguments, to
standard output, but whenever a line of length greater than a
certain number (default 80 characters) is met, then a
Newline character is inserted at that point. With option
-w ('width') followed by a number, that number is
taken to be the maximum length of output lines rather than 80. Try
the following:
$ fold -w 15 <<END
Let's start
with three
short lines
and finish with an extremely long one with lots of words
END
For more sophisticated processing of files divided into records
and fields we can use Awk (see the chapter on Awk later).
Another exceptionally useful command is sort , which
sorts its input into alphabetical order line-by-line. It has many
options, and can sort on a specified field of the input rather than
the first, or numerically (using option -n )
('numerical') rather than alphabetically. So using file
A above, we could have:
$ sort A
aaa
aaa
bbb
bbb
bbb
bbb
bbb
ccc
ccc
ddd
A feature of uniq is that it will only filter out
repeated lines if they are consecutive; if we wish to display each
line that occurs in a file once and only once, we could first of
all sort the file into an order and then use uniq :
$ sort A | uniq
aaa
bbb
ccc
ddd
This has the same effect as using sort with option
-u , which we have already mentioned.
Worked example
Find out how many separate inodes are represented by the files
(excluding 'dot' files) in the current directory.
Solution: Using ls -i1 we can list
the files, one per line, preceded by their inode number. Piping the
output into cut we can isolate the first six character
columns, which contain the inode number, and sort with
option -u , which will sort these into order and remove
all duplicates. These can then be counted by counting the number of
lines of output using wc -l :
$ ls -i1 | cut -c1-6 | sort -u | wc
-l
|