esProc provides regex() function to match a string or a string field value in a record sequence using a pattern defined by the regular expression. The aim is to analyze and examine the string in order to find the pattern in it and replace it. Here we look at the uses for regular expression in esProc.
1. The basics
A regular expression is a string specifying a pattern. Its most basic use is to match a string s with the regular expression rs through s.regex(rs) function. For example:
A | B | C | |
1 | =”a12b”.regex(“(a[0-9])”) | =”a12b”.regex(“(a[0-9]*)”) | =”a12b”.regex(“[0-9]b”) |
2 | =”a12b”.regex(“\\S*([0-9][a-z])”) | ‘\S*([0-9]a) | =”a12b”.regex(B2) |
Here’re the results of A1, B1and C1:
The regular expression used in A1 is “(a[0-9])”. a is the literal character a, [0-9] matches any single character in the range 0-9, the parentheses () specify that a string with the pattern “letter a plus a one digit” will be retrieved and returned. The result is a sequence comprising one member a1. In B1the * following [0-9] matches the preceding character consecutively and repeatedly, which here means matching any number of characters between 0 and 9 appearing in a row. So B1 returns a result of a12. C1 searches for a string starting with a number followed by letter b. But a12b doesn’t start with a number, and they don’t match. Thus the returned result is a null.
The regular expression A2 uses is “\\S*([0-9][a-z])”, in which [a-z] stands for any character of letters from a to z, i.e. any lower-case letter, and \S* stands for any number of non-whitespace characters appearing in a row. Since the \ also means the escape character in a string, it needs to be escaped to be a literal – that is the \\S*. The parentheses specify that the returned substring is the match of [0-9][a-z] and the match of \S*be discarded. B2 is the string constant representing the regular expression in A2, without needing to use an escape character. C2 thus gets the same result as A2 does. Below are their values:
esProc uses () to define the substring to be returned when matching a string with the regular expression. The returned result is a sequence of the members in the parentheses. Without the parentheses, the string itself will be returned if the matching is successful.
By using @c option, the regex() function becomes case-insensitive in string matching according to regular expression. For example:
A | B | C | |
1 | =”a12b”.regex@c(“(A[0-9])”) | =”a12b”.regex@c(“([A-Z][0-9])”) | =”a12b”.regex(“([A-Z][0-9])”) |
Below are the results of A1, B1 and C1:
A1’s regular expression includes the uppercase letter A, and B1’s matches any uppercase letter, both return a result of a1 because regex() function adds the @c option. But, C1 returns a null because the function works alone without the @c option and can’t find a match for the regular expression.
In short, a sequence consisting of the matching result will be returned when regular expression rs finds its match in string s; and null will be returned when it can’t find a match.
esProc allows using the universal unicode symbols to display non-English characters in regular expression so that the character set setting can’t intervene. To do that, remember adding @u option to the regex() function. For example:
A | B | |
1 | =”Gerente de Fábrica”.regex(“.* (.*á.*)”) | =”Gerente de Fábrica”.regex@u(“.* (.*\\u00e1.*)”) |
The regular expression .* (.*á.*) in A1 includes a dot . and the .*. The . represents any single character except carriage return and new line; .* represents any number of random characters. A1’s regular expression finds the last word substring containing the character á. So does B1’s. But the regex() function in B1 adds @u option to display á as \u00e1. Usually the regex@u() is used to parse a string on the outside, keeping the regular expression from being affected by the character set setting. Both A1 and B1 get the same result:
2. The metacharacters
A regular expression comprises literal characters and metacharacters that have special meanings. In the regular expression (a[0-9]) in the previous section’s first example, a is a literal character that matches itself while both [0-9] and () are metacharacters that respectively match one digit and return one matching result.
Table of common esProc metacharacters:
There are also other metacharacters, such as \t (that matches the tab character), \W (that matches a non-word character) and \s (that matches a whitespace character). But they are no so often used in esProc.
These metacharacters are used in regular expressions for string matching. For example:
A | B | C | |
1 | ‘(\+?[1-9]\d*)$ | =”15432″.regex(A1) | =”-132″.regex(A1) |
2 | =”3.14″.regex(A1) | =”0032″.regex(A1) | =”123b”.regex(B2) |
In the regular expression in A1, \+ means a literal plus sign, and followed by a question mark ? the plus sign is allowed to be missing; [1-9] denotes a non-zero digit; \d* indicates any number of digits appearing in row; and the $ at the end asserts the end of a line, meaning there are no extra characters after finishing the matching. The regular expression matches a string of positive integer and returns it through (). So, only B1has a matching result while C1, A2, B2 and C2 which respectively contain a negative number, a decimal, a string beginning with a zero and a string with a non-digit character can’t find a match and thus return nulls. B1’s result is as follows:
Here’s another example:
A | |
1 | =”Tom and Jerry”.regex(“.*(\\w{5}).*”) |
In A1’s regular expression .*(\\w{5}), .* represents any number of characters following one after another, and \w{5} finds and returns a match of 5-letter word which using (). A1 returns the first 5-letter word in the string as follows:
To make the matching more precisely, the regular expression should be written as “.*(\\b\\w{5}\\b).*”, where the word boundaries are matched to strictly specify a 5-character word.
3. Generating a sequence or a table sequence using the regular expression
Apart from using a regular expression in string matching to get one member, esProc also uses one to generate a sequence of multiple members from a string, which are returned separately through (). For example:
A | |
1 | =”2016-1-30″.regex(“(\\d{4})-(\\d{1,2})-(\\d{1,2})”) |
2 | =”Jerry”.regex(“(.)(.)(.)(.)(.)”) |
3 | =”Tom and Jerrry”.regex(“(\\w*) (\\w*) (\\w*)”) |
A1 extracts the year, the month and the day from a date string. A2 splits a word to form a sequence composed of 5 characters. A3 extracts three words from a string. Below are their results:
If the target string has its own format, as with the case of A1, we can split it conveniently according to the regular expression. If we don’t know how many members there will be in the resulting sequence when dividing a string or a word, we can use a loop statement. For example:
A | B | C | |
1 | Do one thing at a time, and do well. | [] | =A1.words() |
2 | for len(A1)>0 | =A1.regex(“(\\w*)[ ,.]+(.*)”) | >B1=B1|B2(1) |
3 | >A1=B2(2) |
In B2, the regular expression (\\w*)[ ,.]+(.*) divides A1’s string into two parts – one word and the rest without the comma and whitespaces – over each loop. A2 runs the loop until the splitting process finishes, and C2 adds each extracted word to B1. Finally B1 gets a result as follows:
An alternative esProc method of extracting words from a string is using the words() function. So we can see the same result obtained by C1.
In addition to parsing a single string, esProc also allows analyzing a sequence of strings with the regular expression. For example:
A | |
1 | [Rebecca Moore,Ashley Wilson,Rachel Johnson,Ryan Williams,Richard] |
2 | =A1.regex(“R.* .*”) |
3 | =A1.regex(“(R.*) .*”) |
A2 uses A.regex() function to perform the regular expression matching on each member of A1. The regular expression R.* .* specifies a matching string that begins with R and contains a whitespace, but it doesn’t use the () to define the returned string. In this case, A2 will return all matching name strings in A1 that begin with R and contain a whitespace. The operation is similar to A.select(), which returns the members satisfying the condition specified by a regular expression. A3 has a seemingly similar regular expression where the () specifies that the first word be returned. Therefore, a new table sequence made up of the selected strings will be created. Below are the target sequence, A2’s result and A3’s result separately:
The regex() function can be also used to parse the string field values of a table sequence. For example:
A | B | |
1 | [Rebecca Moore,Ashley Wilson,Rachel Johnson,Ryan Williams,Richard] | =A1.new(#:ID,~:Name) |
2 | =B1.regex(“R.* .*”,Name) | |
3 | =B1.regex(“(R.*) .*”,Name) | =B1.regex(“(R.*) (.*)”,Name;Firstname,Surname) |
B1 generates a table sequence as follows:
The absence of () in A2’s regular expression enables an operation that is equivalent to the select operation. A record sequence consisting of the records in which the Name field values match the regular expression will be returned. A3 retrieves the first word of each name to return a new table sequence. Here’re results of A2 and A3:
The regular expression in B3 divides the name into two separate words to return them together. When calling the regex() function, two other parameters can be used after the semicolon to specify names for the newly-generated fields. Here’s the result:
4. Using regular expressions with cursor
Similar to the table sequences, the regex() function can handle string field values stored in a cursor according to a regular expression. For example:
A | B | C | |
1 | =file(“Cities.txt”) | ||
2 | =A1.cursor@t(CID,NAME) | =A2.regex(“.*n$”,NAME) | =B2.fetch() |
3 | =A1.cursor@t(CID,NAME) | =A3.regex(“(.*in)”,NAME) | =B3.fetch() |
B2 matches the NAME field values with the regular expression specifying that a city name end with n, but without extracted item. The effect is like filtering the cursor data. C2 gets a result as follows:
The regular expression in B3 specifies extracting the letters in together with the preceding part when matching with NAME field values. The result will be a new table sequence comprising the extracted strings. Here’s C3’s result: