tokenizer.h

This section contains reference documentation for working with protocol buffer classes in C++.

#include <google/protobuf/io/tokenizer.h> namespace google::protobuf::io

Class for parsing tokenized text from a ZeroCopyInputStream.

Classes in this file
`ErrorCollector` Abstract interface for an object which collects the errors that occur during parsing.
`Tokenizer` This class converts a stream of raw text into a stream of tokens for the protocol definition parser to parse.
`Tokenizer::Token` Structure representing a token read from the token stream.

File Members These definitions are not part of any class.
`typedef`	`int ColumnNumber` By "column number", the proto compiler refers to a count of the number of bytes before a given byte, except that a tab character advances to the next multiple of 8 bytes. more...

`typedef io::ColumnNumber`

By "column number", the proto compiler refers to a count of the number of bytes before a given byte, except that a tab character advances to the next multiple of 8 bytes.

Note in particular that column numbers are zero-based, while many user interfaces use one-based column numbers.

class ErrorCollector

#include <google/protobuf/io/tokenizer.h> namespace google::protobuf::io

Abstract interface for an object which collects the errors that occur during parsing.

A typical implementation might simply print the errors to stdout.

Members
	`ErrorCollector()`
`virtual`	`~ErrorCollector()`
`virtual void`	`AddError(int line, ColumnNumber column, const std::string & message) = 0` Indicates that there was an error in the input at the given line and column numbers. more...
`virtual void`	`AddWarning(int , ColumnNumber , const std::string & )` Indicates that there was a warning in the input at the given line and column numbers. more...

`virtual void ErrorCollector::AddError( int line, ColumnNumber column, const std::string & message) = 0`

Indicates that there was an error in the input at the given line and column numbers.

The numbers are zero-based, so you may want to add 1 to each before printing them.

`virtual void ErrorCollector::AddWarning( int , ColumnNumber , const std::string & )`

Indicates that there was a warning in the input at the given line and column numbers.

The numbers are zero-based, so you may want to add 1 to each before printing them.

class Tokenizer

#include <google/protobuf/io/tokenizer.h> namespace google::protobuf::io

This class converts a stream of raw text into a stream of tokens for the protocol definition parser to parse.

The tokens recognized are similar to those that make up the C language; see the TokenType enum for precise descriptions. Whitespace and comments are skipped. By default, C- and C++-style comments are recognized, but other styles can be used by calling set_comment_style().

Members
`enum`	`TokenType` more...
	`Tokenizer(ZeroCopyInputStream * input, ErrorCollector * error_collector)` Construct a Tokenizer that reads and tokenizes text from the given input stream and writes errors to the given error_collector. more...
	`~Tokenizer()`
`const Token &`	`current()` Get the current token. more...
`const Token &`	`previous()` Return the previous token – i.e. more...
`bool`	`Next()` Advance to the next token. more...
`bool`	`NextWithComments(std::string * prev_trailing_comments, std::vector< std::string > * detached_comments, std::string * next_leading_comments)` Like Next(), but also collects comments which appear between the previous and next tokens. more...
Options
`enum`	`CommentStyle` Valid values for set_comment_style(). more...
`void`	`set_allow_f_after_float(bool value)` Set true to allow floats to be suffixed with the letter 'f'. more...
`void`	`set_comment_style(CommentStyle style)` Sets the comment style.
`void`	`set_require_space_after_number(bool require)` Whether to require whitespace between a number and a field name. more...
`void`	`set_allow_multiline_strings(bool allow)` Whether to allow string literals to span multiple lines. more...
`static bool`	`IsIdentifier(const std::string & text)` External helper: validate an identifier.
Parse helpers
`static double`	`ParseFloat(const std::string & text)` Parses a TYPE_FLOAT token. more...
`static void`	`ParseString(const std::string & text, std::string * output)` Parses a TYPE_STRING token. more...
`static void`	`ParseStringAppend(const std::string & text, std::string * output)` Identical to ParseString, but appends to output.
`static bool`	`ParseInteger(const std::string & text, uint64 max_value, uint64 * output)` Parses a TYPE_INTEGER token. more...

`enum Tokenizer::TokenType { TYPE_START, TYPE_END, TYPE_IDENTIFIER, TYPE_INTEGER, TYPE_FLOAT, TYPE_STRING, TYPE_SYMBOL }`

TYPE_START	Next() has not yet been called.
TYPE_END	End of input reached. "text" is empty.
TYPE_IDENTIFIER	A sequence of letters, digits, and underscores, not starting with a digit. It is an error for a number to be followed by an identifier with no space in between.
TYPE_INTEGER	A sequence of digits representing an integer. Normally the digits are decimal, but a prefix of "0x" indicates a hex number and a leading zero indicates octal, just like with C numeric literals. A leading negative sign is NOT included in the token; it's up to the parser to interpret the unary minus operator on its own.
TYPE_FLOAT	A floating point literal, with a fractional part and/or an exponent. Always in decimal. Again, never negative.
TYPE_STRING	A quoted sequence of escaped characters. Either single or double quotes can be used, but they must match. A string literal cannot cross a line break.
TYPE_SYMBOL	Any other printable character, like '!' or '+'. Symbols are always a single character, so "!+$%" is four tokens.

`Tokenizer::Tokenizer( ZeroCopyInputStream * input, ErrorCollector * error_collector)`

Construct a Tokenizer that reads and tokenizes text from the given input stream and writes errors to the given error_collector.

The caller keeps ownership of input and error_collector.

`const Token & Tokenizer::current()`

Get the current token.

This is updated when Next() is called. Before the first call to Next(), current() has type TYPE_START and no contents.

`const Token & Tokenizer::previous()`

Return the previous token – i.e.

what current() returned before the previous call to Next().

`bool Tokenizer::Next()`

Advance to the next token.

Returns false if the end of the input is reached.

`bool Tokenizer::NextWithComments( std::string * prev_trailing_comments, std::vector< std::string > * detached_comments, std::string * next_leading_comments)`

Like Next(), but also collects comments which appear between the previous and next tokens.

Comments which appear to be attached to the previous token are stored in *prev_tailing_comments. Comments which appear to be attached to the next token are stored in *next_leading_comments. Comments appearing in between which do not appear to be attached to either will be added to detached_comments. Any of these parameters can be NULL to simply discard the comments.

A series of line comments appearing on consecutive lines, with no other tokens appearing on those lines, will be treated as a single comment.

Only the comment content is returned; comment markers (e.g. //) are stripped out. For block comments, leading whitespace and an asterisk will be stripped from the beginning of each line other than the first. Newlines are included in the output.

Examples:

optional int32 foo = 1; // Comment attached to foo. // Comment attached to bar. optional int32 bar = 2;

optional string baz = 3; // Comment attached to baz. // Another line attached to baz.

// Comment attached to qux. // // Another line attached to qux. optional double qux = 4;

// Detached comment. This is not attached to qux or corge // because there are blank lines separating it from both.

optional string corge = 5; /* Block comment attached

to corge. Leading asterisks
will be removed. * / /* Block comment attached to
grault. * / optional int32 grault = 6;

`enum Tokenizer::CommentStyle { CPP_COMMENT_STYLE, SH_COMMENT_STYLE }`

Valid values for set_comment_style().

CPP_COMMENT_STYLE	Line comments begin with "//", block comments are delimited by "/" and " /".
SH_COMMENT_STYLE	Line comments begin with "#". No way to write block comments.

`void Tokenizer::set_allow_f_after_float( bool value)`

Set true to allow floats to be suffixed with the letter 'f'.

Tokens which would otherwise be integers but which have the 'f' suffix will be forced to be interpreted as floats. For all other purposes, the 'f' is ignored.

`void Tokenizer::set_require_space_after_number( bool require)`

Whether to require whitespace between a number and a field name.

Default is true. Do not use this; for Google-internal cleanup only.

`void Tokenizer::set_allow_multiline_strings( bool allow)`

Whether to allow string literals to span multiple lines.

Default is false. Do not use this; for Google-internal cleanup only.

`static double Tokenizer::ParseFloat( const std::string & text)`

Parses a TYPE_FLOAT token.

This never fails, so long as the text actually comes from a TYPE_FLOAT token parsed by Tokenizer. If it doesn't, the result is undefined (possibly an assert failure).

`static void Tokenizer::ParseString( const std::string & text, std::string * output)`

Parses a TYPE_STRING token.

This never fails, so long as the text actually comes from a TYPE_STRING token parsed by Tokenizer. If it doesn't, the result is undefined (possibly an assert failure).

`static bool Tokenizer::ParseInteger( const std::string & text, uint64 max_value, uint64 * output)`

Parses a TYPE_INTEGER token.

Returns false if the result would be greater than max_value. Otherwise, returns true and sets *output to the result. If the text is not from a Token of type TYPE_INTEGER originally parsed by a Tokenizer, the result is undefined (possibly an assert failure).

struct Tokenizer::Token

#include <google/protobuf/io/tokenizer.h> namespace google::protobuf::io

Structure representing a token read from the token stream.

Members
`TokenType`	`type`
`std::string`	`text` The exact text of the token as it appeared in the input. more...
`int`	`line` "line" and "column" specify the position of the first character of the token within the input stream. more...
`ColumnNumber`	`column`
`ColumnNumber`	`end_column`

`std::string Token::text`

The exact text of the token as it appeared in the input.

e.g. tokens of TYPE_STRING will still be escaped and in quotes.

`int Token::line`

"line" and "column" specify the position of the first character of the token within the input stream.

They are zero-based.

tokenizer.h

Classes in this file

File Members

typedef io::ColumnNumber

class ErrorCollector

Members

virtual void ErrorCollector::AddError( int line, ColumnNumber column, const std::string & message) = 0

virtual void ErrorCollector::AddWarning( int , ColumnNumber , const std::string & )

class Tokenizer

Members

Options

Parse helpers

enum Tokenizer::TokenType { TYPE_START, TYPE_END, TYPE_IDENTIFIER, TYPE_INTEGER, TYPE_FLOAT, TYPE_STRING, TYPE_SYMBOL}

Tokenizer::Tokenizer( ZeroCopyInputStream * input, ErrorCollector * error_collector)

const Token & Tokenizer::current()

const Token & Tokenizer::previous()

bool Tokenizer::Next()

bool Tokenizer::NextWithComments( std::string * prev_trailing_comments, std::vector< std::string > * detached_comments, std::string * next_leading_comments)

enum Tokenizer::CommentStyle { CPP_COMMENT_STYLE, SH_COMMENT_STYLE}

void Tokenizer::set_allow_f_after_float( bool value)

void Tokenizer::set_require_space_after_number( bool require)

void Tokenizer::set_allow_multiline_strings( bool allow)

static double Tokenizer::ParseFloat( const std::string & text)

static void Tokenizer::ParseString( const std::string & text, std::string * output)

static bool Tokenizer::ParseInteger( const std::string & text, uint64 max_value, uint64 * output)

struct Tokenizer::Token

Members

std::string Token::text

int Token::line

`typedef io::ColumnNumber`

`virtual void ErrorCollector::AddError( int line, ColumnNumber column, const std::string & message) = 0`

`virtual void ErrorCollector::AddWarning( int , ColumnNumber , const std::string & )`

`enum Tokenizer::TokenType { TYPE_START, TYPE_END, TYPE_IDENTIFIER, TYPE_INTEGER, TYPE_FLOAT, TYPE_STRING, TYPE_SYMBOL }`

`Tokenizer::Tokenizer( ZeroCopyInputStream * input, ErrorCollector * error_collector)`

`const Token & Tokenizer::current()`

`const Token & Tokenizer::previous()`

`bool Tokenizer::Next()`

`bool Tokenizer::NextWithComments( std::string * prev_trailing_comments, std::vector< std::string > * detached_comments, std::string * next_leading_comments)`

`enum Tokenizer::CommentStyle { CPP_COMMENT_STYLE, SH_COMMENT_STYLE }`

`void Tokenizer::set_allow_f_after_float( bool value)`

`void Tokenizer::set_require_space_after_number( bool require)`

`void Tokenizer::set_allow_multiline_strings( bool allow)`

`static double Tokenizer::ParseFloat( const std::string & text)`

`static void Tokenizer::ParseString( const std::string & text, std::string * output)`

`static bool Tokenizer::ParseInteger( const std::string & text, uint64 max_value, uint64 * output)`

`std::string Token::text`

`int Token::line`