Lexical scanner
This page has been automatically translated using the Google Translate API services. We are working on improving texts. Thank you for your understanding and patience.
The lexical analyzer will help us identify and label the different fields that compose a text.
Functions
LexScn* | lexscn_create (void) |
void | lexscn_destroy (...) |
void | lexscn_spaces (...) |
void | lexscn_newlines (...) |
void | lexscn_escapes (...) |
void | lexscn_comments (...) |
void | lexscn_start (...) |
void | lexscn_jump_bom (...) |
lextoken_t | lexscn_token (...) |
uint32_t | lexscn_row (...) |
uint32_t | lexscn_col (...) |
const char_t* | lexscn_lexeme (...) |
const char_t* | lexscn_string (...) |
void | lexscn_jump (...) |
uint32_t | lexscn_read_u32 (...) |
real32_t | lexscn_read_r32 (...) |
real64_t | lexscn_read_r64 (...) |
The lexical scanner will help us interpret the content of text files. While the Text stream are able to read characters, lines or delimited words, these aren't enough if our task is to process somewhat more complicated grammars like the source code of any programming language. Here we would enter into a translation process whose first phase is the scanner itself (Figure 1). It is the only one that will be in touch with the text stream, reading characters one by one and grouping them in intermediate symbols called tokens.

NAppGUI incorporates a lexical analyzer, very simple to use, integrated into the Core library. Recognize the own tokens of the C language, which are common in countless languages and grammars (Listing 1). It is implemented as a finite state machine and allows certain configuration by activating different options.
1 2 3 4 5 6 7 8 9 10 |
lextoken_t token; Stream *stm = stm_from_file("source.c", NULL); lexscn_start(lex, stm); while ((token = lexscn_token(lex)) != ekTEOF) { switch (token) { ... } } |
- Use lexscn_create to create the scanner.
- Use lexscn_start to initialize the scanner.
- Use lexscn_token to read the next token of a text stream.
1. Tokens
A token is a group of one or more characters labeled as a grammatical symbol. Later stages of grammar processing will no longer work with characters, but will do so directly with these symbols. For example, imagine that we have a small language to process additions (Listing 2):
1 2 3 |
a = 45 + 12 b = 18 + 97 result = a + b |
First of all we need to identify, within this set of characters, those elements relevant to our purpose. We have a lexical analysis of this file in (Listing 3) (Figure 2). As we see, it has identified the text "45"
as a number, "result"
as a variable and "="
as the equal symbol. In all cases, called lexeme to the text string associated with the token.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Token Lexeme ----- ------ var - 'a' equal - '=' number - '45' plus - '+' number - '12' var - 'b' equal - '=' number - '18' plus - '+' number - '97' var - 'result' equal - '=' var - 'a' plus - '+' var - 'b' |

1.1. Identifiers
An identifier is an alphanumeric "word" that must begin with a letter or '_'
and contains numbers, letters or '_'
. It is used to name variables, functions, reserved words, etc. They do not allow spaces or symbols. (Listing 4) (Figure 3).
1 2 |
OK: while cos _reSult a56B _06_t aG h9 _12AcVb NO: 045 ?er "_5G _tg( |

1.2. Keywords
They are identifiers that have been reserved for the grammar and cannot be used for other purposes such as naming of variables or functions. Being general purpose, the scanner does not recognize any type of reserved word in programming languages or file formats. You have to label it expressly after reading the token (Listing 5).
1 2 3 4 5 6 |
while ((token = lexscn_token(lex)) != ekTEOF) { if (token == ekTIDENT && str_equ_c(lexscn_lexeme(lex, NULL), "while")) token = ekTRESERVED; ... } |
1.3. Strings
A text string is a series of Unicode characters placed in quotes (")
(Figure 4). LexScn
recognize C escape sequences to represent unprintable codes or unavailable characters on the keyboard (Listing 6).
- Use lexscn_escapes to make escape sequences effective when reading strings.
ekTSTRING
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
\a 07 Alert (Beep, Bell) (added in C89) \b 08 Backspace \f 0C Formfeed Page Break \n 0A Newline (Line Feed) \r 0D Carriage Return \t 09 Horizontal Tab \v 0B Vertical Tab \\ 5C Backslash \' 27 Single quotation mark \" 22 Double quotation mark \? 3F Question mark (used to avoid trigraphs) \nnn The byte whose numerical value is given by nnn interpreted as an octal number \xhh The byte whose numerical value is given by hh interpreted as a hexadecimal number \Uhhhhhhhh Unicode code point where h is a hexadecimal digit \uhhhh Unicode code point below 10000 hexadecimal |

1.4. Numbers
In the case of numerical tokens the thing is complicated a bit due to the different numerical bases and the exponential representation of real numbers (Figure 5). We briefly summarize it, although it is common to many programming languages (C included).
- If the number starts with
0
it will be considered octal (base 8), therefore, the following digits are limited to0-7
, eg:043, 001, 0777
. - If the number starts with
0x
will be considered hexadecimal (base 16) with digits0-9 a-f A-F
, eg:0x4F, 0XAA5, 0x01EAC
. - At the moment a decimal point appears
'.'
will be considered real number. A point at starting is valid, eg:.56
. - An integer or real number allows exponential notation with the character
'e' ('E')
, eg:12.4e2, .56e3, 1e4
.

1.5. Symbols
The symbols are single-character tokens that represent almost all US-ASCII punctuation marks and are often used as operators, separators or limiters within grammars. (Listing 7) (Figure 6).
LexScn
.
1 |
< > , . ; : ( ) [ ] { } + - * = $ % # & ' " ^ ! ? | / \ @ |

lexscn_create ()
Create a lexical analyzer.
LexScn* lexscn_create(void);
Return
The newly created lexical analyzer.
lexscn_destroy ()
Destroy a lexical analyzer.
void lexscn_destroy(LexScn **lex);
lex | Lexical Analyzer. Will be set to |
lexscn_spaces ()
Blank spaces option.
void lexscn_spaces(LexScn *lex, const bool_t activate);
lex | Lexical Analyzer. |
activate |
|
lexscn_newlines ()
New Line Option.
void lexscn_newlines(LexScn *lex, const bool_t activate);
lex | Lexical Analyzer. |
activate |
|
Remarks
It only has an effect if lexscn_spaces is FALSE
.
lexscn_escapes ()
Escape Sequence Option.
void lexscn_escapes(LexScn *lex, const bool_t activate);
lex | Lexical Analyzer. |
activate |
|
lexscn_comments ()
Comments option.
void lexscn_comments(LexScn *lex, const bool_t activate);
lex | Lexical Analyzer. |
activate |
|
lexscn_start ()
Start the analyzer. Zero the line and column counter.
void lexscn_start(LexScn *lex, Stream *stm);
lex | Lexical Analyzer. |
stm | Text stream for read with the data source. |
lexscn_jump_bom ()
Skip the possible sequence Byte Order Mark ""
found at the beginning of some UTF8 files/streams.
void lexscn_jump_bom(LexScn *lex);
lex | Lexical analyzer. |
Remarks
This function will have no effect if there is no such sequence at the beginning of the stream. The BOM is common in streams coming from some web servers.
lexscn_token ()
Get the following token from the Text stream assigned in lexscn_start.
lextoken_t lexscn_token(LexScn *lex);
lex | Lexical Analyzer. |
Return
The token type.
lexscn_row ()
Get the row number of the last token read.
uint32_t lexscn_row(const LexScn *lex);
lex | Lexical Analyzer. |
Return
Row number.
lexscn_col ()
Get the column number of the first character of the last token read.
uint32_t lexscn_col(const LexScn *lex);
lex | Lexical Analyzer. |
Return
Column number.
lexscn_lexeme ()
Get the lexeme of the last token read. The lexeme is the text string associated with the token.
const char_t* lexscn_lexeme(const LexScn *lex, uint32_t *size);
lex | Lexical Analyzer. |
size | Size in bytes of the lexeme, not counting the null character |
Return
The lexeme. It is stored in a temporary buffer and will be lost when reading the next token. If you need it, make a copy with str_c.
lexscn_string ()
Get descriptive text from a token type. It is useful for debugging tasks.
const char_t* lexscn_string(const lextoken_t token);
1 2 |
const char_t *desc = lexscn_string(ekTEOL); // desc = "newline"; |
token | Token. |
Return
Descriptive text.
lexscn_jump ()
Skip the next token in the stream. If the token does not correspond to the one indicated, the stream will be marked as corrupt.
void lexscn_jump(LexScn *lex, const lextoken_t token);
1 2 3 4 5 6 |
void lexscn_jump(LexScn *lex, const lextoken_t token) { lextoken_t tok = lexscn_token(lex); if (tok != token) stm_corrupt(lex->stm); } |
lex | Lexical analyzer. |
token | Expected token. |
lexscn_read_u32 ()
Read the next token and transform it to uint32_t
. If the token is not numeric, the stream will be marked as corrupt.
uint32_t lexscn_read_u32(LexScn *lex);
lex | Lexical analyzer. |
Return
The value read.
lexscn_read_r32 ()
Read the following token and transform it to real32_t
. If the token is not numeric, the stream will be marked as corrupt.
real32_t lexscn_read_r32(LexScn *lex);
lex | Lexical analyzer. |
Return
The value read.
lexscn_read_r64 ()
Read the following token and transform it to real64_t
. If the token is not numeric, the stream will be marked as corrupt.
real64_t lexscn_read_r64(LexScn *lex);
lex | Lexical analyzer. |
Return
The value read.