Java中已经有相当成熟的包提供了正则表达式支持, 直接引用就可以了.
oracle官方的java tutorials里面有相关章节专门讲这个主题,也很详细, 链接
个人觉得太啰嗦了, 本文只写java.util.regex
的主要(核心)内容.
引子
java.util.regex这个包主要涉及以下几个类
- Pattern
- Matcher
- PatternSyntaxException
主要用于2个方面:
- Match String
- Capture Group
注意java.util.regex
属于标准库, 和PCRE
也是兼容的.
正文
正文内容可以参考:
regular expression string patterns
简单的说, 稍微特殊点儿的就是, 在Java中如果要写正则表达式, 就要对backslash进行转义.
原因就是Java中的正则表达式是以\
开头的, 而这个反斜杠本身是做转义的, 所以要对”\”做转义, 转换成普通含义;之后才能表面后面写的特殊符号是正则表达式的符号.
例如说, 在c语言中使用正则表达式的时候, "\w"
, 在Java中就要写成"\\w"
.
(effectively escaping the backslash with itself)
This is only necessary when hard-coding patterns into Java code, as strings that are read in from user input or from files are read by character individually and escape sequences are not interpreted.(只有硬编码的时候才需要”\“, 从文件中读取的pattern, 没必要这么做)
This is a common approach to get around this problem, by either putting the patterns in a Properties or resource file so they can be easier to read and understand.
Other languages like C# or Python support the notion of raw strings, but Java has yet to add this useful feature into the core language.
Matching a string
这里主要是借助java.util.regex.Pattern
这个类, 一般使用代码如下:
1 | Pattern p = Pattern.compile("a*b"); |
或者
1 | boolean isMatch = Pattern.matches(String regex, String inputStr) |
其中上面使用Matcher可以获得更多的信息(inpt text的哪部分匹配), 而简单使用Pattern这个类只能判定是否匹配.
具体可以看下面这个例子:
1 | // Lets use a regular expression to match a date string. |
(matcher指向当前匹配时的全串)
Capturing groups
上面已经演示了, 使用Matcher类如何截取匹配的group, you can just iterate through the extracted groups in the returned Matcher.
下面演示一下, 分别匹配两组字符串(第0组是整个串, 第一组才是第一个子串)
1 | // capture date |
说明:matcher.start(1), matcher.end(1);//查看第一个子组的start和end.
不传参数的matcher.start(), matcher.end()查看的本次匹配全串的首尾位置,而不是子组.
Finding and replacing strings
如果给定的text中包含你要确定的字符串(子串), 替换成别的串, 也是常见需求之一.
(replace a part of a string using regular expressions)
主要借助下面两个方法:
- Matcher.replaceAll()
- Matcher.replaceFirst()
1 | String replacedString = matcher.replaceAll(String inputStr) |
替换的时候, 使用占位符就好; $
后面跟着数字
, 并且”$0”表示full matched text
比如说, 我想交换一个串的第一部分和第二部分, 可以这样写:
1 | // Lets try and reverse the order of the day and month in a few date |
Pattern Flags
和posix c下使用regex一样, 编译的时候也可以传入参数flag, 从编译出不同兼容性的正则表达式.
(perl, posix基本, posix扩展; 区分大小写, 是否忽略换行等等)
1 | //Compiles the given regular expression into a pattern with the given flags. |
主要介绍4中:(文档中给了9种, 实际没用到那么多)
- Pattern.CASE_INSENSITIVE makes the pattern case insensitive so that it matches strings of different capitalizations
- Pattern.MULTILINE is necessary if your input string has newline characters (\n) and allows the start and end metacharacter (^ and $ respectively) to match at the beginning and end of each line instead of at the beginning and end of the whole input string
- Pattern.DOTALL allows the dot metacharacter (.) to match new line characters as well
- Pattern.LITERAL makes the pattern literal, in the sense that the escaped characters are matched as-is. For example, the pattern “\d” will match a backslash followed by a ‘d’ character as opposed to a digit character
(其中第二个是比较实用的, 识别”换行”, “行头”以及”行尾”)
尾巴
上面已经演示了最常见的用法:
- Pattern.compile();
- pattern.matcher();
- matcher.replace相关
若有更多的需求, 可以详细参考一下Pattern Class
和Matcher Class
两个的相关源码和文档.