4.18. 字符介绍——是什么单词?第二部分


4.18 字符介绍——是什么单词?第二部分


首先我们尝试为 grep 写一个正则表达式来搜索以下面表达式结尾的单词:


这个表达式相当简单,它匹配一个空格,接着字符串 “book” 再接着是任意数量的字符,再跟一个空格。然而它不能匹配所有可能的情况,并且它确实匹配了一些讨厌的单词。

下面这个测试文件包含了多次出现的 “book”。我们添加了一个标记,它不是这个文件的一部分,只是为了显示是否这个输入行应该是一次 “命中”(>)并且包括在输出中,或者是一次 “错失”(<)。我们已经努力包括了尽可能多的不同的例子。

$ cat bookwords


> This file tests for book in various places, such as
> book at the beginning of a line or
> at the end of a line book
> as well as the plural books and
< handbooks. Here are some
< phrases that use the word in different ways:
> "book of the year award"
> to look for a line with the word "book"
> A GREAT book!
> A great book? No.
> told them about (the books) until it
> Here are the books that you requested
> Yes, it is a good book for children
> amazing that it was called a "harmful book" when
> once you get to the end of the book, you can’t believe
< A well-written regular expression should
< avoid matching unrelated words,
< such as booky (is that a word?)
< and bookish and
< bookworm and so on.

当我们搜索单词 “book” 的出现时,应该有 13 行被匹配,有 7 行不应该被匹配。首先让我们对样本文件运行之前的那个正则表达式。然后检查结果:

➜ ch03 git:(daily) ✗ grep ' book.* ' bookwords
This file tests for book in various places, such as
as well as the plural books and
A great book? No.
told them about (the books) until it
Here are the books that you requested
Yes, it is a good book for children
amazing that it was called a "harmful book" when
once you get to the end of the book, you can't believe
such as booky (is that a word?)
and bookish and

它只打印了我们要匹配的 13 行中的 8 行,并且它打印了我们不想匹配的行中的两行。

这个表达式匹配了包含单词 “booky” 和 “bookish” 的行。它忽略了 “book” 在行首或行尾的情况。当涉及特定的标点符号时,它也忽略了 “book”。


? . , ! ; : ’

另外,引号,小括号() 、大括号{} ,方括号[] 可能围绕一个单词或以一个单词开头或结尾。

" () {} []


因此你将有两个不同的字符类:在单词之前的和之后的。记住,所有我们需要做的是在方括号[] 里面列出类的成员。在单词前面,我们现在有:




注意把结尾的方括号[] 作为类中的第一个字符,使得它是类的一个成员而不是关闭这个集合。把两个类放一起,我们得到了这个表达式:



➜ ch03 git:(daily) ✗ grep " [\"[{(]*book[]})\"?\!.,;:'s]* " bookwords
This file tests for book in various places, such as
as well as the plural books and
A great book? No.
told them about (the books) until it
Here are the books that you requested
Yes, it is a good book for children
amazing that it was called a "harmful book" when
once you get to the end of the book, you can't believe

注意:! 要进行转义。


book at the beginning of a line or
at the end of a line book
"book of the year award"
A GREAT book!

所有这些问题都是由出现在行首和行尾的字符串引起的。因为在行首和行尾没有空格,这个模式就不会被匹配。我们可以使用定位元字符—— ^ 和 $。因为我们想匹配行首和行尾的一个空格。我们可以使用 egrep 并且指定 “或” 这个元字符和小括号() 来分组。比如要匹配行首或一个空格,你可以写这个表达式:

(^| )

因为竖线(|) 和小括号() 是扩展元字符集的一部分,如果你在使用 sed 你将必须写不同的表达式来处理每一种情况。)


(ˆ| )["[{(]*book[]})"?\!.,;:’s]*( |$)


➜ ch03 git:(daily) ✗ egrep "(^| )[\"[{(]*book[]})\"?\!.,;:'s]*( |$)" bookwords
This file tests for book in various places, such as
book at the beginning of a line or
at the end of a line book
as well as the plural books and
"book of the year award"
to look for a line with the word "book"
A GREAT book!
A great book? No.
told them about (the books) until it
Here are the books that you requested
Yes, it is a good book for children
amazing that it was called a "harmful book" when
once you get to the end of the book, you can't believe


你还可以创建一个简单的 shell 脚本,来用一个命令行参数替代 “book”。唯一的问题可能是有些单词的复数不是简单的 “s”。你可以巧妙地通过添加 “e” 到单词后的字符类中来处理 “es” 复数;它将在很多情况下都可用。

进一步说明,exvi 文本编辑器有用于匹配单词开头的字符串的特殊元字符 \<,也有用于匹配单词结尾的字符串的特殊元字符 \>。成对使用,它们就可以匹配一个仅当它是一个完整单词时的字符串。(对这些操作符,一个单词是一个没有空格的字符串,两边有空格,或者在行首或行尾。)匹配一个单词是很常见的,以至于如果这些元字符在所有的正则表达式中都可用,那么它们将肯定会被广泛地使用。

GNU 程序,比如 GNU 版本的 awk sed grep 都支持 \<\>

vi 中 按 / 进入搜索模式,输入:


这样是一次匹配一个,按 n 移动到下一个匹配。如果要一次性显示所有匹配,输入:



更详细的搜索匹配,可以参考我的博文 vi 中一次性显示出所有搜索的匹配 | 开发者工具论坛

另外,在使用 gsed 时, \<\> 还可以用 \b 来代替,书写更方便。


本文章首发在 LearnKu.com 网站上。

上一篇 下一篇
讨论数量: 0
发起讨论 查看所有版本
