Regular expressions
This page has been automatically translated using the Google Translate API services. We are working on improving texts. Thank you for your understanding and patience.
Functions
RegEx* | regex_create (...) |
void | regex_destroy (...) |
bool_t | regex_match (...) |
Regular expressions define a text pattern that can be used to find or compare strings.
- Use regex_create to create a regular expression.
- Use regex_match to check if a string matches the pattern.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
RegEx *regex = regex_create(".*.txt"); const char_t *str[] = { "file01.txt", "image01.png", "sun01.jpg", "films.txt", "document.pdf"}; uint32_t i, n = sizeof(str) / sizeof(char_t*); for (i = 0; i < n; ++i) { if (regex_match(regex, str[i]) == TRUE) bstd_printf("YES: %s\n", str[i]); else bstd_printf("NO: %s\n", str[i]); } regex_destroy(®ex); |
1 2 3 4 5 |
YES: file01.txt NO: image01.png NO: sun01.jpg YES: films.txt NO: document.pdf |
1. Define patterns
We can build a regular expression from a text string, following these simple rules:
- A string pattern corresponds only to that same string.
- A period
'.'
is equivalent to "any character". - A dash
'A-Z'
sets a range of characters, using the ASCII/Unicode code from both ends.
|
"hello" --> {"hello"} |
|
"h.llo" --> {"hello", "htllo", "hällo", "h5llo", ...} |
|
"A-Zello" --> {"Aello", "Bello", "Cello", ..., "Zello"} 'A-Z': (65-90) (ABCDEFGHIJKLMNOPQRSTUVWXYZ) '0-9': (48-57) (0123456789) 'á-ú': (225-250) (áâãäåæçèéêëìíîïðñòóôõö÷øùú) |
Like String objects, patterns are expressed in UTF-8, therefore the entire Unicode set can be used to create regular expressions.
- The brackets
'[áéíóú]'
allow you to switch between several characters. - The asterisk
'*'
allows the last character to appear zero or more times. - The parentheses
'(he*llo)'
allow grouping a regular expression, so that it behaves as a single character. - For
'.', '-', '[]', '*', '()'
to be interpreted as characters, use the backslash'\'
.
|
"h[áéíóú]llo" --> {"hállo", "héllo", "híllo", "hóllo", "húllo"} |
|
"he*llo" --> {"hllo", "hello", "heello", "heeello", "heeeello", ...} "h.*llo" --> {"hllo", "hello", "hallo", "hillo", "hasello", ...} "hA-Z*llo" --> {"hllo", "hAllo", "hABllo", "hVFFRREASllo" } --> {"hAQWEDllo", hAAABBRSllo", ...} "FILE_0-9*.PNG" --> {"FILE_.PNG", "FILE_0.PNG", "FILE_01.PNG" } --> {"FILE_456.PNG", "FILE_112230.PNG",...} |
|
"[(hello)(bye)]" --> {"hello", "bye" } "[(red)(blue)(1*)]" --> {"red", "blue", "", "1", "11", "111", ... } "(hello)*" --> {"", "hello", "hellohello", "hellohellohello", ... } "(he*llo)ZZ" --> {"hlloZZ", "helloZZ", "heelloZZ", "heeelloZZ", ... } |
|
"\(he\*\-llo\)" --> {"(he*-llo)"} |
Remember that for expressions inserted as constants in C code, the backslash character is represented by a double slash "\\\\(he\\\\(*\\\\(-llo\\\\()"
.
2. Regular languages and automata
Regular languages are those that are defined recursively using three basic operations on the set of characters (or symbols) available. They can be described using the regular expressions discussed above.
- Each character 'a' is a regular language 'A'.
- The union of two regular languages, is a regular language A∪B.
- The concatenation of two regular languages, is a regular language A·B.
- The closure of a regular language is a regular language A*. This is where recursion comes in.
In this context the symbols are all Unicode characters. But you can define languages based on other alphabets, including the binary {0, 1}.
To recognize whether or not a string belongs to a certain regular language, it is necessary to build a Finite Automata based on the rules reflected in (Figure 1).
regex_create ()
Create a regular expression from a pattern.
RegEx* regex_create(const char_t *pattern);
pattern | Search pattern. |
Return
Regular expression (automata).
Remarks
See Define patterns.
regex_destroy ()
Destroy a regular expression.
void regex_destroy(RegEx **regex);
regex | Regular expresion. Will be set to |
regex_match ()
Check if a string matches the search pattern.
bool_t regex_match(const RegEx *regex, const char_t *str);
regex | Regular expresion. |
str | String to evaluate. |
Return
TRUE
if the string is accepted by the regular expression.