Cross-platform C SDK logo

Cross-platform C SDK

Regular expressions

❮ Back
Next ❯
This page has been automatically translated using the Google Translate API services. We are working on improving texts. Thank you for your understanding and patience.

Functions

RegEx*regex_create (...)
voidregex_destroy (...)
bool_tregex_match (...)

Regular expressions define a text pattern that can be used to find or compare strings.

  • Use regex_create to create a regular expression.
  • Use regex_match to check if a string matches the pattern.
  • Listing 1: Using regular expressions.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    
    RegEx *regex = regex_create(".*.txt");
    
    const char_t *str[] = {
        "file01.txt",
        "image01.png",
        "sun01.jpg",
        "films.txt",
        "document.pdf"};
    
    uint32_t i, n = sizeof(str) / sizeof(char_t*);
    
    for (i = 0; i < n; ++i)
    {
        if (regex_match(regex, str[i]) == TRUE)
            bstd_printf("YES: %s\n", str[i]);
        else
            bstd_printf("NO:  %s\n", str[i]);
    }
    
    regex_destroy(®ex);
    
    Result of (Listing 1).
    1
    2
    3
    4
    5
    
    YES: file01.txt
    NO:  image01.png
    NO:  sun01.jpg
    YES: films.txt
    NO:  document.pdf
    

1. Define patterns

We can build a regular expression from a text string, following these simple rules:

  • A string pattern corresponds only to that same string.
  • Listing 2:
     
    
    "hello" --> {"hello"}
    
  • A period '.' is equivalent to "any character".
  • Listing 3:
     
    
    "h.llo" --> {"hello", "htllo", "hällo", "h5llo", ...}
    
  • A dash 'A-Z' sets a range of characters, using the ASCII/Unicode code from both ends.
  • Listing 4:
     
    
    "A-Zello" --> {"Aello", "Bello", "Cello", ..., "Zello"}
    
    'A-Z': (65-90) (ABCDEFGHIJKLMNOPQRSTUVWXYZ)
    '0-9': (48-57) (0123456789)
    'á-ú': (225-250) (áâãäåæçèéêëìíîïðñòóôõö÷øùú)
    
Like String objects, patterns are expressed in UTF-8, therefore the entire Unicode set can be used to create regular expressions.
  • The brackets '[áéíóú]' allow you to switch between several characters.
  • Listing 5:
     
    
    "h[áéíóú]llo" --> {"hállo", "héllo", "híllo", "hóllo", "húllo"}
    
  • The asterisk '*' allows the last character to appear zero or more times.
  • Listing 6:
     
    
    "he*llo" --> {"hllo", "hello", "heello", "heeello", "heeeello", ...}
    "h.*llo" --> {"hllo", "hello", "hallo", "hillo", "hasello", ...}
    "hA-Z*llo" --> {"hllo", "hAllo", "hABllo", "hVFFRREASllo" }
               --> {"hAQWEDllo", hAAABBRSllo", ...}
    "FILE_0-9*.PNG" --> {"FILE_.PNG", "FILE_0.PNG", "FILE_01.PNG" }
                    --> {"FILE_456.PNG", "FILE_112230.PNG",...}
    
  • The parentheses '(he*llo)' allow grouping a regular expression, so that it behaves as a single character.
  • Listing 7:
     
    
    "[(hello)(bye)]" --> {"hello", "bye" }
    "[(red)(blue)(1*)]" --> {"red", "blue", "", "1", "11", "111", ... }
    "(hello)*" --> {"", "hello", "hellohello", "hellohellohello", ... }
    "(he*llo)ZZ" --> {"hlloZZ", "helloZZ", "heelloZZ", "heeelloZZ", ... }
    
  • For '.', '-', '[]', '*', '()' to be interpreted as characters, use the backslash '\'.
  • Listing 8:
     
    
    "\(he\*\-llo\)" --> {"(he*-llo)"}
    
Remember that for expressions inserted as constants in C code, the backslash character is represented by a double slash "\\\\(he\\\\(*\\\\(-llo\\\\()".

2. Regular languages ​​and automata

Regular languages ​​are those that are defined recursively using three basic operations on the set of characters (or symbols) available. They can be described using the regular expressions discussed above.

  • Each character 'a' is a regular language 'A'.
  • The union of two regular languages, is a regular language A∪B.
  • The concatenation of two regular languages, is a regular language A·B.
  • The closure of a regular language is a regular language A*. This is where recursion comes in.
In this context the symbols are all Unicode characters. But you can define languages ​​based on other alphabets, including the binary {0, 1}.

To recognize whether or not a string belongs to a certain regular language, it is necessary to build a Finite Automata based on the rules reflected in (Figure 1).

Concatenation, Union and Closure of finite automata.
Figure 1: Construction of finite automata to filter regular expressions.
❮ Back
Next ❯

regex_create ()

Create a regular expression from a pattern.

RegEx*
regex_create(const char_t *pattern);
pattern

Search pattern.

Return

Regular expression (automata).

Remarks

See Define patterns.


regex_destroy ()

Destroy a regular expression.

void
regex_destroy(RegEx **regex);
regex

Regular expresion. Will be set to NULL after destruction.


regex_match ()

Check if a string matches the search pattern.

bool_t
regex_match(const RegEx *regex,
            const char_t *str);
regex

Regular expresion.

str

String to evaluate.

Return

TRUE if the string is accepted by the regular expression.

❮ Back
Next ❯