2016-05-12

技术: Java中正则表达式支持

Java中已经有相当成熟的包提供了正则表达式支持, 直接引用就可以了.

oracle官方的java tutorials里面有相关章节专门讲这个主题,也很详细, 链接

个人觉得太啰嗦了, 本文只写java.util.regex的主要(核心)内容.

引子

java.util.regex这个包主要涉及以下几个类

Pattern
Matcher
PatternSyntaxException

主要用于2个方面:

Match String
Capture Group

注意java.util.regex属于标准库, 和PCRE也是兼容的.

正文

正文内容可以参考:

regular expression string patterns

简单的说, 稍微特殊点儿的就是, 在Java中如果要写正则表达式, 就要对backslash进行转义.

原因就是Java中的正则表达式是以\开头的, 而这个反斜杠本身是做转义的, 所以要对”\”做转义, 转换成普通含义;之后才能表面后面写的特殊符号是正则表达式的符号.

例如说, 在c语言中使用正则表达式的时候, "\w", 在Java中就要写成"\\w".
(effectively escaping the backslash with itself)

This is only necessary when hard-coding patterns into Java code, as strings that are read in from user input or from files are read by character individually and escape sequences are not interpreted.(只有硬编码的时候才需要”\“, 从文件中读取的pattern, 没必要这么做)

This is a common approach to get around this problem, by either putting the patterns in a Properties or resource file so they can be easier to read and understand.
Other languages like C# or Python support the notion of raw strings, but Java has yet to add this useful feature into the core language.

Matching a string

这里主要是借助java.util.regex.Pattern这个类, 一般使用代码如下:

1
2
3

Pattern p = Pattern.compile("a*b");
Matcher m = p.matcher("aaaaab");
boolean b = m.matches(); //true

或者

1	boolean isMatch = Pattern.matches(String regex, String inputStr)

其中上面使用Matcher可以获得更多的信息(inpt text的哪部分匹配), 而简单使用Pattern这个类只能判定是否匹配.
具体可以看下面这个例子:

// Lets use a regular expression to match a date string.
Pattern ptrn = Pattern.compile("([a-zA-Z]+) (\\d+)");
Matcher matcher = ptrn.matcher("June 24");

if (matcher.matches()) {
    // Indeed, the expression "([a-zA-Z]+) (\d+)" matches the date string

    // This will print [0, 7], since it matches at the beginning and end of the 
    System.out.println("Match at index [" + matcher.start() + 
        ", " + matcher.end() + ")");

    // To get the fully matched text, you can read the Matcher object's group
    System.out.println("Match: " + matcher.group()); //print "June 24"
}

(matcher指向当前匹配时的全串)

Capturing groups

上面已经演示了, 使用Matcher类如何截取匹配的group, you can just iterate through the extracted groups in the returned Matcher.
下面演示一下, 分别匹配两组字符串(第0组是整个串, 第一组才是第一个子串)

// capture date
String pattern = "([a-zA-Z]+) (\\d+)";
Pattern ptrn = Pattern.compile("([a-zA-Z]+) (\\d+)");
Matcher matcher = ptrn.matcher("June 24, August 9, Dec 12");

// This will print each of the matches and the index in the input string
// where the match was found:
//   June 24 at index [0, 7)
//   August 9 at index [9, 17)
//   Dec 12 at index [19, 25)
while (matcher.find()) {
    System.out.println(String.format("Match: %s at index [%d, %d]",
        matcher.group(), matcher.start(), matcher.end()));
}

// If we are iterating over the groups in the match again, first reset the
// matcher to start at the beginning of the input string.
matcher.reset();

// For each match, we can extract the captured information by reading the 
// captured groups.
while (matcher.find()) {
    // This will print the number of captured groups in this match
    System.out.println(String.format("%d groups captured", 
        matcher.groupCount()));

    // This will print the month and day of each match.  Remember that the
    // first group is always the whole matched text, so the month starts at
    // index 1 instead.
    System.out.println("Month: " + matcher.group(1) + ", Day: " + 
        matcher.group(2));

    // Each group in the match also has a start and end index, which is the
    // index in the input string that the group was found.
    System.out.println(String.format("Month found at[%d, %d)", 
        matcher.start(1), matcher.end(1)));

}

说明:matcher.start(1), matcher.end(1);//查看第一个子组的start和end.
不传参数的matcher.start(), matcher.end()查看的本次匹配全串的首尾位置,而不是子组.

Finding and replacing strings

如果给定的text中包含你要确定的字符串(子串), 替换成别的串, 也是常见需求之一.
(replace a part of a string using regular expressions)

主要借助下面两个方法:

Matcher.replaceAll()
Matcher.replaceFirst()

1 2	String replacedString = matcher.replaceAll(String inputStr) String replacedString = matcher.replaceFirst(String inputStr)

替换的时候, 使用占位符就好; $后面跟着数字, 并且”$0”表示full matched text

比如说, 我想交换一个串的第一部分和第二部分, 可以这样写:

// Lets try and reverse the order of the day and month in a few date 
// strings. Notice how the replacement string also contains metacharacters
// (the back references to the captured groups) so we use a verbatim 
// string for that as well.
Pattern ptrn = Pattern.compile("([a-zA-Z]+) (\\d+)");
Matcher matcher = ptrn.matcher("June 24, August 9, Dec 12");

// This will reorder the string inline and print:
//   24 of June, 9 of August, 12 of Dec
// Remember that the first group is always the full matched text, so the 
// month and day indices start from 1 instead of zero.
String replacedString = matcher.replaceAll("$2 of $1");
System.out.println(replacedString);

Pattern Flags

和posix c下使用regex一样, 编译的时候也可以传入参数flag, 从编译出不同兼容性的正则表达式.
(perl, posix基本, posix扩展; 区分大小写, 是否忽略换行等等)

1 2	//Compiles the given regular expression into a pattern with the given flags. public static Pattern compile(String regex, int flags)

主要介绍4中:(文档中给了9种, 实际没用到那么多)

Pattern.CASE_INSENSITIVE makes the pattern case insensitive so that it matches strings of different capitalizations
Pattern.MULTILINE is necessary if your input string has newline characters (\n) and allows the start and end metacharacter (^ and $ respectively) to match at the beginning and end of each line instead of at the beginning and end of the whole input string
Pattern.DOTALL allows the dot metacharacter (.) to match new line characters as well
Pattern.LITERAL makes the pattern literal, in the sense that the escaped characters are matched as-is. For example, the pattern “\d” will match a backslash followed by a ‘d’ character as opposed to a digit character

(其中第二个是比较实用的, 识别”换行”, “行头”以及”行尾”)

尾巴

上面已经演示了最常见的用法:

Pattern.compile();
pattern.matcher();
matcher.replace相关

若有更多的需求, 可以详细参考一下Pattern Class和Matcher Class两个的相关源码和文档.

本文标题:技术: Java中正则表达式支持

文章作者:Merlin

发布时间:2016-05-12, 15:23:17

最后更新:2018-04-12, 14:55:17

原始链接:http://www.merlinblog.site/posts/c8bce253/

许可协议: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。