regular expression operations in python 2.7
python支持正则表达式是借助外部的re库, 该库在安装python的时候就已经绑定了.
但是注意, 这个库是PERE的阉割版, 仅仅支持核心部分正则语法(就常用而言,足够了).
引子
本文没有像官方文档写re
module一样去全篇幅的介绍相关的库, 而是从python开发者的角度去使用正则表达式, 当然你也可以去看文档, 其网址如下:
https://docs.python.org/2/library/re.html
正文
Raw Python strings
When writing regular expression in Python, it is recommended that you use raw strings instead of regular Python strings. Raw strings begin with a special prefix (r) and signal Python not to interpret backslashes and special metacharacters in the string, allowing you to pass them through directly to the regular expression engine.
This means that a pattern like “\n\w” will not be interpreted and can be written as r”\n\w” instead of “\n\w” as in other languages, which is much easier to read.
Matching a string
The re package has a number of top level methods, and to test whether a regular expression matches a specific string in Python, you can use re.search(). This method either returns None if the pattern doesn’t match, or a re.MatchObject with additional information about which part of the string the match was found.
Note that this method stops after the first match, so this is best suited for testing a regular expression more than extracting data.
1 | matchObject = re.search(pattern, input_str, flags=0) |
example:
1 | import re |
Capturing groups
Unlike the re.search() method above, we can use re.findall() to perform a global search over the whole input string. If there are capture groups in the pattern, then it will return a list of all the captured data, but otherwise, it will just return a list of the matches themselves, or an empty list if no matches are found.
If you need additional context for each match, you can use re.finditer() which instead returns an iterator of re.MatchObjects to walk through. Both methods take the same parameters.
如下:
1 | matchList = re.findall(pattern, input_str, flags=0) |
以及1
matchList = re.finditer(pattern, input_str, flags=0)
Example
1 | import re |
Finding and replacing strings
Another common task is to find and replace a part of a string using regular expressions, for example, to replace all instances of an old email domain, or to swap the order of some text. You can do this in Python with the re.sub() method.
The optional count argument is the exact number of replacements to make in the input string, and if this is value is less than or equal to zero, then every match in the string is replaced.
1 | replacedString = re.sub(pattern, replacement_pattern, input_str, count, flags=0) |
example:
1 | import re |
re Flags
In the Python regular expression methods above, you will notice that each of them also take an optional flags argument. Most of the available flags are a convenience and can be written into the into the regular expression itself directly, but some can be useful in certain cases.
- re.IGNORECASE makes the pattern case insensitive so that it matches strings of different capitalizations
- re.MULTILINE is necessary if your input string has newline characters (\n) and allows the start and end metacharacter (^ and $ respectively) to match at the beginning and end of each line instead of at the beginning and end of the whole input string
- re.DOTALL allows the dot (.) metacharacter match all characters, including the newline character (\n)
Compiling a pattern for performance
In Python, creating a new regular expression pattern to match many strings can be slow, so it is recommended that you compile them if you need to be testing or extracting information from many input strings using the same expression. This method returns a re.RegexObject
.
1 | regexObject = re.compile(pattern, flags=0) |
The returned object has exactly the same methods as above, except that they take the input string and no longer require the pattern or flags for each call.
1 | import re |
一个方便的工具, 可以在线测试您写的相关语句:
https://regex101.com/#python
尾巴
本文已经做了相当多的介绍和演示, 在python2.7中具体使用了regex.
如果上面的介绍还是不够, 下面也给出了更多的参考(可以参考官方文档):