Grep
We have defined regular expressions; in order to use them, we
begin with a utility called Grep. The function of Grep is to select
lines from its input (either standard input or named files given as
arguments) that match a BRE normally given as first argument to
grep . The BRE is known as a script.
Those lines of input that match the BRE are then copied to standard
output. For instance, to print out all words ending in
ise or ize from
/usr/dict/words , you could have:
$ grep 'i[sz]e$'
/usr/dict/words
Note that single quotes are needed here as $ is in
the BRE.
With option -E , grep will use EREs
instead of BREs. With option -F , grep
uses only fixed strings - there are no regular
expressions, the string given as argument to grep is
matched against the input exactly as it appears. With option
-c instead of copying matched lines to standard
output, a count of the number of matched lines is displayed
instead.
Worked example
How many words in /usr/dict/words begin with a
vowel?
Solution: Use grep with option
-c , to select and then count lines beginning with
upper-case or lower-case vowels. The BRE contains a list of all
such vowels, preceded with a ^ to indicate that the vowel must be
at the start of each word:
$ grep -c '^[AEIOUaeiou]'
/usr/dict/words
On some systems separate commands egrep ('Extended
GREP') and fgrep ('Fixed GREP') are used instead of
grep -E and grep -F .
Option -i ('insensitive') causes grep
to ignore the case of letters when checking for matches, and
overrides any explicit specification regarding upper-case and
lower-case letters in the regular expression. Thus a solution to
the previous worked example could be:
$ grep -ci '^[aeiou]'
/usr/dict/words
With option -f ('file') followed by a filename,
regular expressions contained in that file are used instead of
being given as an argument to grep . If the file
contains more than one regular expression, then Grep selects lines
that match any of the REs in the file. This is the preferred method
by which Grep can select lines where there is a choice of matching
specifications.
The 'reverse behaviour' - namely displaying those lines not
matching the RE specified - can be enabled with option
-v ('inVert'). This is often simpler than constructing
a new regular expression. An example of this being useful might be
to a FORTRAN programmer. A program written in the computer language
FORTRAN treats any line starting with a C as a
comment; if you were examining such a program, and wished to search
for lines of code containing some identifier, and were not
interested in the lines of comments, you might wish to use
grep -v '^C'
to strip out the comments to begin with.
If grep is given several files as arguments, option
-l ('list') displays a list of those files containing
a matching line, rather than those lines themselves.
Worked example
Suppose you have saved many mail messages in files in the
current directory, and you want to check which file or files
contain messages whose subject is something to do with
'examinations'. Each mail message contains a line beginning with
the string Subject: followed by the subject of the
message (if any).
Solution: We require grep -l followed
by a BRE followed by * to list the filenames. The following lines
might occur as the 'subject' lines of the messages:
Subject: Examinations
Subject: examinations
Subject: NEXT MONTH'S EXAMS
Subject: Exams
These all have a common string, namely exam , in
upper-case or lower-case (or a mixture of cases). So, to match
these lines, a BRE is required to recognise Subject:
at the start of the line, followed by some characters (possibly
none), followed by exam in any mixture of cases. The
Subject: at the start of the line is matched by
^Subject and .* matches the characters
between that and exam . In order to ensure that the
cases of the letters in exam do not matter, you can
either explicitly match them with [Ee][Xx][Aa][Mm] , or
you can instruct grep to be 'case-insensitive' with
option -i . The following two solutions would be
acceptable:
grep -l '^Subject: .*[Ee][Xx][Aa][Mm]' *
grep -li '^Subject: .*exam' *
Note that this is not an infallible solution. It will also
select files with subjects related to counterexamples
and hexameters , and will not find a file with subject
examinations . When using UNIX tools to process data
from electronic mail or other documents containing English text,
you must be conscious of human fallibility. Some solutions will of
necessity be approximate.
|