Cross-platform C SDK logo

Cross-platform C SDK

Lexical scanner

❮ Back
Next ❯
This page has been automatically translated using the Google Translate API services. We are working on improving texts. Thank you for your understanding and patience.

The lexical analyzer will help us identify and label the different fields that compose a text.


Functions

LexScn*lexscn_create (void)
voidlexscn_destroy (...)
voidlexscn_spaces (...)
voidlexscn_newlines (...)
voidlexscn_escapes (...)
voidlexscn_comments (...)
voidlexscn_start (...)
voidlexscn_jump_bom (...)
lextoken_tlexscn_token (...)
uint32_tlexscn_row (...)
uint32_tlexscn_col (...)
const char_t*lexscn_lexeme (...)
const char_t*lexscn_string (...)
voidlexscn_jump (...)
uint32_tlexscn_read_u32 (...)
real32_tlexscn_read_r32 (...)
real64_tlexscn_read_r64 (...)

The lexical scanner will help us interpret the content of text files. While the Text stream are able to read characters, lines or delimited words, these aren't enough if our task is to process somewhat more complicated grammars like the source code of any programming language. Here we would enter into a translation process whose first phase is the scanner itself (Figure 1). It is the only one that will be in touch with the text stream, reading characters one by one and grouping them in intermediate symbols called tokens.

Different phases in the interpretation and translation of language.
Figure 1: Lexical analysis is the first step when interpreting or translating texts.

NAppGUI incorporates a lexical analyzer, very simple to use, integrated into the Core library. Recognize the own tokens of the C language, which are common in countless languages and grammars (Listing 1). It is implemented as a finite state machine and allows certain configuration by activating different options.

Listing 1: Reading the tokens of a C file.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
lextoken_t token;
Stream *stm = stm_from_file("source.c", NULL);
lexscn_start(lex, stm);
while ((token = lexscn_token(lex)) != ekTEOF)
{
    switch (token)
    {
        ...
    }
}

1. Tokens

A token is a group of one or more characters labeled as a grammatical symbol. Later stages of grammar processing will no longer work with characters, but will do so directly with these symbols. For example, imagine that we have a small language to process additions (Listing 2):

Listing 2: Small program of sums that we must interpret.
1
2
3
a = 45 + 12
b = 18 + 97
result = a + b

First of all we need to identify, within this set of characters, those elements relevant to our purpose. We have a lexical analysis of this file in (Listing 3) (Figure 2). As we see, it has identified the text "45" as a number, "result" as a variable and "=" as the equal symbol. In all cases, called lexeme to the text string associated with the token.

Listing 3: Lexical analysis of (Listing 2).
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
Token      Lexeme
-----      ------
var     -   'a'
equal   -   '='
number  -   '45'
plus    -   '+'
number  -   '12'
var     -   'b'
equal   -   '='
number  -   '18'
plus    -   '+'
number  -   '97'
var     -   'result'
equal   -   '='
var     -   'a'
plus    -   '+'
var     -   'b'
Representation of the different tokens that form a sentence.
Figure 2: The lexical analyzer fragments a text into labeled blocks.

1.1. Identifiers

An identifier is an alphanumeric "word" that must begin with a letter or '_' and contains numbers, letters or '_'. It is used to name variables, functions, reserved words, etc. They do not allow spaces or symbols. (Listing 4) (Figure 3).

Listing 4: Correct and incorrect identifiers.
1
2
OK: while cos _reSult a56B _06_t aG h9 _12AcVb
NO: 045 ?er "_5G _tg(
Finite automaton that recognizes an identifier in C.
Figure 3: Finite automata that recognizes an identifier.

1.2. Keywords

They are identifiers that have been reserved for the grammar and cannot be used for other purposes such as naming of variables or functions. Being general purpose, the scanner does not recognize any type of reserved word in programming languages or file formats. You have to label it expressly after reading the token (Listing 5).

Listing 5: Recognizing the keyword while.
1
2
3
4
5
6
while ((token = lexscn_token(lex)) != ekTEOF)
{
    if (token == ekTIDENT && str_equ_c(lexscn_lexeme(lex, NULL), "while"))
        token = ekTRESERVED;
    ...
}

1.3. Strings

A text string is a series of Unicode characters placed in quotes (") (Figure 4). LexScn recognize C escape sequences to represent unprintable codes or unavailable characters on the keyboard (Listing 6).

  • Use lexscn_escapes to make escape sequences effective when reading strings.
  • Listing 6: Escape sequences accepted in ekTSTRING.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    
    \a      07  Alert (Beep, Bell) (added in C89)
    \b      08  Backspace
    \f      0C  Formfeed Page Break
    \n      0A  Newline (Line Feed)
    \r      0D  Carriage Return
    \t      09  Horizontal Tab
    \v      0B  Vertical Tab
    \\      5C  Backslash
    \'      27  Single quotation mark
    \"      22  Double quotation mark
    \?      3F  Question mark (used to avoid trigraphs)
    \nnn        The byte whose numerical value is given by nnn interpreted as an octal number
    \xhh        The byte whose numerical value is given by hh interpreted as a hexadecimal number
    \Uhhhhhhhh  Unicode code point where h is a hexadecimal digit
    \uhhhh      Unicode code point below 10000 hexadecimal
    
    Finite automaton that recognizes a text string.
    Figure 4: Finite automata that recognizes a text string.

1.4. Numbers

In the case of numerical tokens the thing is complicated a bit due to the different numerical bases and the exponential representation of real numbers (Figure 5). We briefly summarize it, although it is common to many programming languages (C included).

  • If the number starts with 0 it will be considered octal (base 8), therefore, the following digits are limited to 0-7, eg: 043, 001, 0777.
  • If the number starts with 0x will be considered hexadecimal (base 16) with digits 0-9 a-f A-F, eg: 0x4F, 0XAA5, 0x01EAC.
  • At the moment a decimal point appears '.' will be considered real number. A point at starting is valid, eg: .56.
  • An integer or real number allows exponential notation with the character 'e' ('E'), eg: 12.4e2, .56e3, 1e4.
  • Finite automata that recognizes numbers in different bases.
    Figure 5: Finite automata that recognizes numbers.

1.5. Symbols

The symbols are single-character tokens that represent almost all US-ASCII punctuation marks and are often used as operators, separators or limiters within grammars. (Listing 7) (Figure 6).

Listing 7: Symbols recognized as tokens by LexScn.
1
< > , . ; : ( ) [ ] { } + - * = $ % # & ' " ^ ! ? | / \ @ 
Finite automaton that recognizes the plus, minus, asterisk and equal symbols.
Figure 6: Finite automata that recognizes some symbols.

lexscn_create ()

Create a lexical analyzer.

LexScn*
lexscn_create(void);

Return

The newly created lexical analyzer.


lexscn_destroy ()

Destroy a lexical analyzer.

void
lexscn_destroy(LexScn **lex);
lex

Lexical Analyzer. Will be set to NULL after destruction.


lexscn_spaces ()

Blank spaces option.

void
lexscn_spaces(LexScn *lex,
              const bool_t activate);
lex

Lexical Analyzer.

activate

TRUE the analyzer will return the ekTSPACE token when finding blank space sequences. FALSE will ignore blank spaces. Default FALSE.


lexscn_newlines ()

New Line Option.

void
lexscn_newlines(LexScn *lex,
                const bool_t activate);
lex

Lexical Analyzer.

activate

TRUE the analyzer will return the ekTEOL token upon finding new line characters. FALSE ignore new lines. Default FALSE.

Remarks

It only has an effect if lexscn_spaces is FALSE.


lexscn_escapes ()

Escape Sequence Option.

void
lexscn_escapes(LexScn *lex,
               const bool_t activate);
lex

Lexical Analyzer.

activate

TRUE escape sequences will be processed when reading ekTSTRING tokens. For example, the sequence "\n" will transform into character 0x0A (10). FALSE ignore escape sequences, reading text strings literally. Default FALSE.


lexscn_comments ()

Comments option.

void
lexscn_comments(LexScn *lex,
                const bool_t activate);
lex

Lexical Analyzer.

activate

TRUE the analyzer will return the ekTMLCOM token every time it finds C comments /* Comment */ and ekTSLCOM for C++ comments // Comment. FALSE will ignore comments. Default FALSE.


lexscn_start ()

Start the analyzer. Zero the line and column counter.

void
lexscn_start(LexScn *lex,
             Stream *stm);
lex

Lexical Analyzer.

stm

Text stream for read with the data source.


lexscn_jump_bom ()

Skip the possible sequence Byte Order Mark "" found at the beginning of some UTF8 files/streams.

void
lexscn_jump_bom(LexScn *lex);
lex

Lexical analyzer.

Remarks

This function will have no effect if there is no such sequence at the beginning of the stream. The BOM is common in streams coming from some web servers.


lexscn_token ()

Get the following token from the Text stream assigned in lexscn_start.

lextoken_t
lexscn_token(LexScn *lex);
lex

Lexical Analyzer.

Return

The token type.


lexscn_row ()

Get the row number of the last token read.

uint32_t
lexscn_row(const LexScn *lex);
lex

Lexical Analyzer.

Return

Row number.


lexscn_col ()

Get the column number of the first character of the last token read.

uint32_t
lexscn_col(const LexScn *lex);
lex

Lexical Analyzer.

Return

Column number.


lexscn_lexeme ()

Get the lexeme of the last token read. The lexeme is the text string associated with the token.

const char_t*
lexscn_lexeme(const LexScn *lex,
              uint32_t *size);
lex

Lexical Analyzer.

size

Size in bytes of the lexeme, not counting the null character '\0'. It is equivalent to making a str_len_c of the lexeme.

Return

The lexeme. It is stored in a temporary buffer and will be lost when reading the next token. If you need it, make a copy with str_c.


lexscn_string ()

Get descriptive text from a token type. It is useful for debugging tasks.

const char_t*
lexscn_string(const lextoken_t token);
1
2
const char_t *desc = lexscn_string(ekTEOL);
// desc = "newline";
token

Token.

Return

Descriptive text.


lexscn_jump ()

Skip the next token in the stream. If the token does not correspond to the one indicated, the stream will be marked as corrupt.

void
lexscn_jump(LexScn *lex,
            const lextoken_t token);
1
2
3
4
5
6
void lexscn_jump(LexScn *lex, const lextoken_t token)
{
    lextoken_t tok = lexscn_token(lex);
    if (tok != token)
        stm_corrupt(lex->stm);
}
lex

Lexical analyzer.

token

Expected token.


lexscn_read_u32 ()

Read the next token and transform it to uint32_t. If the token is not numeric, the stream will be marked as corrupt.

uint32_t
lexscn_read_u32(LexScn *lex);
lex

Lexical analyzer.

Return

The value read.


lexscn_read_r32 ()

Read the following token and transform it to real32_t. If the token is not numeric, the stream will be marked as corrupt.

real32_t
lexscn_read_r32(LexScn *lex);
lex

Lexical analyzer.

Return

The value read.


lexscn_read_r64 ()

Read the following token and transform it to real64_t. If the token is not numeric, the stream will be marked as corrupt.

real64_t
lexscn_read_r64(LexScn *lex);
lex

Lexical analyzer.

Return

The value read.

❮ Back
Next ❯