Regular expressions

This page has been automatically translated using the Google Translate API services. We are working on improving texts. Thank you for your understanding and patience.

Functions

RegEx*	regex_create (...)
void	regex_destroy (...)
bool_t	regex_match (...)

1. Define patterns
2. Regular languages and automata

Regular expressions define a text pattern that can be used to find or compare strings.

Use regex_create to create a regular expression.
Use regex_match to check if a string matches the pattern.

Listing 1: Using regular expressions.

RegEx *regex = regex_create(".*.txt");

const char_t *str[] = {
    "file01.txt",
    "image01.png",
    "sun01.jpg",
    "films.txt",
    "document.pdf"};

uint32_t i, n = sizeof(str) / sizeof(char_t*);

for (i = 0; i < n; ++i)
{
    if (regex_match(regex, str[i]) == TRUE)
        bstd_printf("YES: %s\n", str[i]);
    else
        bstd_printf("NO:  %s\n", str[i]);
}

regex_destroy(®ex);

Result of (Listing 1).

YES: file01.txt
NO:  image01.png
NO:  sun01.jpg
YES: films.txt
NO:  document.pdf

1. Define patterns

We can build a regular expression from a text string, following these simple rules:

A string pattern corresponds only to that same string.

"hello" --> {"hello"}

A period '.' is equivalent to "any character".

"h.llo" --> {"hello", "htllo", "hällo", "h5llo", ...}

A dash 'A-Z' sets a range of characters, using the ASCII/Unicode code from both ends.

"A-Zello" --> {"Aello", "Bello", "Cello", ..., "Zello"}

'A-Z': (65-90) (ABCDEFGHIJKLMNOPQRSTUVWXYZ)
'0-9': (48-57) (0123456789)
'á-ú': (225-250) (áâãäåæçèéêëìíîïðñòóôõö÷øùú)

Like String objects, patterns are expressed in UTF-8, therefore the entire Unicode set can be used to create regular expressions.

The brackets '[áéíóú]' allow you to switch between several characters.

"h[áéíóú]llo" --> {"hállo", "héllo", "híllo", "hóllo", "húllo"}

The asterisk '*' allows the last character to appear zero or more times.

"he*llo" --> {"hllo", "hello", "heello", "heeello", "heeeello", ...}
"h.*llo" --> {"hllo", "hello", "hallo", "hillo", "hasello", ...}
"hA-Z*llo" --> {"hllo", "hAllo", "hABllo", "hVFFRREASllo" }
           --> {"hAQWEDllo", hAAABBRSllo", ...}
"FILE_0-9*.PNG" --> {"FILE_.PNG", "FILE_0.PNG", "FILE_01.PNG" }
                --> {"FILE_456.PNG", "FILE_112230.PNG",...}

The parentheses '(he*llo)' allow grouping a regular expression, so that it behaves as a single character.

"[(hello)(bye)]" --> {"hello", "bye" }
"[(red)(blue)(1*)]" --> {"red", "blue", "", "1", "11", "111", ... }
"(hello)*" --> {"", "hello", "hellohello", "hellohellohello", ... }
"(he*llo)ZZ" --> {"hlloZZ", "helloZZ", "heelloZZ", "heeelloZZ", ... }

For '.', '-', '[]', '*', '()' to be interpreted as characters, use the backslash '\'.

"\(he\*\-llo\)" --> {"(he*-llo)"}

Remember that for expressions inserted as constants in C code, the backslash character is represented by a double slash "\\\\(he\\\\(*\\\\(-llo\\\\()".

2. Regular languages and automata

Regular languages are those that are defined recursively using three basic operations on the set of characters (or symbols) available. They can be described using the regular expressions discussed above.

Each character 'a' is a regular language 'A'.
The union of two regular languages, is a regular language A∪B.
The concatenation of two regular languages, is a regular language A·B.
The closure of a regular language is a regular language A*. This is where recursion comes in.

In this context the symbols are all Unicode characters. But you can define languages based on other alphabets, including the binary {0, 1}.

To recognize whether or not a string belongs to a certain regular language, it is necessary to build a Finite Automata based on the rules reflected in (Figure 1).

Concatenation, Union and Closure of finite automata. — Figure 1: Construction of finite automata to filter regular expressions.

regex_create ()

Create a regular expression from a pattern.

RegEx*
regex_create(const char_t *pattern);

pattern

Search pattern.

Return

Regular expression (automata).

Remarks

See Define patterns.

regex_destroy ()

Destroy a regular expression.

void
regex_destroy(RegEx **regex);

regex

Regular expresion. Will be set to NULL after destruction.

regex_match ()

Check if a string matches the search pattern.

bool_t
regex_match(const RegEx *regex,
            const char_t *str);

regex	Regular expresion.
str	String to evaluate.