Difference between revisions of "cpp/regex/regex token iterator"

Latest revision as of 22:58, 8 April 2024

Defined in header `<regex>`
template< class BidirIt, class CharT = typename std::iterator_traits<BidirIt>::value_type, class Traits = std::regex_traits<CharT> > class regex_token_iterator		(since C++11)

std::regex_token_iterator is a read-only LegacyForwardIterator that accesses the individual sub-matches of every match of a regular expression within the underlying character sequence. It can also be used to access the parts of the sequence that were not matched by the given regular expression (e.g. as a tokenizer).

On construction, it constructs an std::regex_iterator and on every increment it steps through the requested sub-matches from the current match_results, incrementing the underlying std::regex_iterator when incrementing away from the last submatch.

The default-constructed std::regex_token_iterator is the end-of-sequence iterator. When a valid std::regex_token_iterator is incremented after reaching the last submatch of the last match, it becomes equal to the end-of-sequence iterator. Dereferencing or incrementing it further invokes undefined behavior.

Just before becoming the end-of-sequence iterator, a std::regex_token_iterator may become a suffix iterator, if the index -1 (non-matched fragment) appears in the list of the requested submatch indices. Such iterator, if dereferenced, returns a match_results corresponding to the sequence of characters between the last match and the end of sequence.

A typical implementation of std::regex_token_iterator holds the underlying std::regex_iterator, a container (e.g. std::vector<int>) of the requested submatch indices, the internal counter equal to the index of the submatch, a pointer to std::sub_match, pointing at the current submatch of the current match, and a std::match_results object containing the last non-matched character sequence (used in tokenizer mode).

Defined in header `<regex>`
Type	Definition
`std::cregex_token_iterator`	std::regex_token_iterator<const char*>
`std::wcregex_token_iterator`	std::regex_token_iterator<const wchar_t*>
`std::sregex_token_iterator`	std::regex_token_iterator<std::string::const_iterator>
`std::wsregex_token_iterator`	std::regex_token_iterator<std::wstring::const_iterator>

[edit] Member types

Member type	Definition
`value_type`	std::sub_match<BidirIt>
`difference_type`	std::ptrdiff_t
`pointer`	const value_type*
`reference`	const value_type&
`iterator_category`	std::forward_iterator_tag
`iterator_concept` (C++20)	std::input_iterator_tag
`regex_type`	std::basic_regex<CharT, Traits>

[edit] Member functions

(constructor)	constructs a new `regex_token_iterator` (public member function) [edit]
(destructor) (implicitly declared)	destructs a `regex_token_iterator`, including the cached value (public member function) [edit]
operator=	assigns contents (public member function) [edit]
operator==operator!= (removed in C++20)	compares two `regex_token_iterator`s (public member function) [edit]
operator*operator->	accesses current submatch (public member function) [edit]
operator++operator++(int)	advances the iterator to the next submatch (public member function) [edit]

[edit] Notes

It is the programmer's responsibility to ensure that the std::basic_regex object passed to the iterator's constructor outlives the iterator. Because the iterator stores a std::regex_iterator which stores a pointer to the regex, incrementing the iterator after the regex was destroyed results in undefined behavior.

[edit] Example

Run this code

#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <regex>
 
int main()
{
    // Tokenization (non-matched fragments)
    // Note that regex is matched only two times; when the third value is obtained
    // the iterator is a suffix iterator.
    const std::string text = "Quick brown fox.";
    const std::regex ws_re("\\s+"); // whitespace
    std::copy(std::sregex_token_iterator(text.begin(), text.end(), ws_re, -1),
              std::sregex_token_iterator(),
              std::ostream_iterator<std::string>(std::cout, "\n"));
 
    std::cout << '\n';
 
    // Iterating the first submatches
    const std::string html = R"(<p><a href="http://google.com">google</a> )"
                             R"(< a HREF ="http://cppreference.com">cppreference</a>\n</p>)";
    const std::regex url_re(R"!!(<\s*A\s+[^>]*href\s*=\s*"([^"]*)")!!", std::regex::icase);
    std::copy(std::sregex_token_iterator(html.begin(), html.end(), url_re, 1),
              std::sregex_token_iterator(),
              std::ostream_iterator<std::string>(std::cout, "\n"));
}

Output:

Quick
brown
fox.
 
http://google.com
http://cppreference.com

[edit] Defect reports

The following behavior-changing defect reports were applied retroactively to previously published C++ standards.

DR	Applied to	Behavior as published	Correct behavior
LWG 3698 (P2770R0)	C++20	`regex_token_iterator` was a `forward_iterator` while being a stashing iterator	made `input_iterator`^[1]

↑ iterator_category was unchanged by the resolution, because changing it to std::input_iterator_tag might break too much existing code.

[1] iterator_category was unchanged by the resolution, because changing it to std::input_iterator_tag might break too much existing code.

[1]

@@ Line 1: / Line 1: @@
 {{cpp/title|regex_token_iterator}}
 {{cpp/regex/regex_token_iterator/navbar}}
-{{ddcl | header=regex | notes={{mark since c++11}} | 1=
+{{ddcl|header=regex|since=c++11|1=
 template<
      class BidirIt,
@@ Line 9: / Line 9: @@
 }}
-{{tt|std::regex_token_iterator}} is a read-only {{concept|ForwardIterator}} that accesses the individual sub-matches of every match of a regular expression within the underlying character sequence. It can also be used to access the parts of the sequence that were not matched by the given regular expression (e.g. as a tokenizer).
+{{tt|std::regex_token_iterator}} is a read-only {{named req|ForwardIterator}} that accesses the individual sub-matches of every match of a regular expression within the underlying character sequence. It can also be used to access the parts of the sequence that were not matched by the given regular expression (e.g. as a tokenizer).
-On construction, it constructs an {{lc|std::regex_iterator}} and on every increment it steps through the requested sub-matches from the current match_results, incrementing the underlying {{tt|regex_iterator}} when incrementing away from the last submatch.
+On construction, it constructs an {{lc|std::regex_iterator}} and on every increment it steps through the requested sub-matches from the current match_results, incrementing the underlying {{lc|std::regex_iterator}} when incrementing away from the last submatch.
 The default-constructed {{tt|std::regex_token_iterator}} is the end-of-sequence iterator. When a valid {{tt|std::regex_token_iterator}} is incremented after reaching the last submatch of the last match, it becomes equal to the end-of-sequence iterator. Dereferencing or incrementing it further invokes undefined behavior.
-Just before becoming the end-of-sequence iterator, a {{lc|std::regex_token_iterator}} may become a ''suffix iterator'', if the index {{c|-1}} (non-matched fragment) appears in the list of the requested submatch indexes. Such iterator, if dereferenced, returns a match_results corresponding to the sequence of characters between the last match and the end of sequence.
+Just before becoming the end-of-sequence iterator, a {{tt|std::regex_token_iterator}} may become a ''suffix iterator'', if the index {{c|-1}} (non-matched fragment) appears in the list of the requested submatch indices. Such iterator, if dereferenced, returns a match_results corresponding to the sequence of characters between the last match and the end of sequence.
-A typical implementation of {{tt|std::regex_token_iterator}} holds the underlying {{lc|std::regex_iterator}}, a container (e.g. {{c|std::vector<int>}}) of the requested submatch indexes, the internal counter equal to the index of the submatch, a pointer to {{lc|std::match_results}}, pointing at the current submatch of the current match, and a {{lc|std::match_results}} object containing the last non-matched character sequence (used in tokenizer mode).
+A typical implementation of {{tt|std::regex_token_iterator}} holds the underlying {{lc|std::regex_iterator}}, a container (e.g. {{c/core|std::vector<int>}}) of the requested submatch indices, the internal counter equal to the index of the submatch, a pointer to {{lc|std::sub_match}}, pointing at the current submatch of the current match, and a {{lc|std::match_results}} object containing the last non-matched character sequence (used in tokenizer mode).
 ===Type requirements===
 {{par begin}}
-{{par req concept | BidirIt | BidirectionalIterator}}
+{{par req named|BidirIt|BidirectionalIterator}}
 {{par end}}
 ===Specializations===
 Several specializations for common character sequence types are defined:
 {{dsc begin}}
-{{dsc header | regex}}
+{{dsc header|regex}}
-{{dsc hitem | Type | Definition}}
+{{dsc hitem|Type|Definition}}
-{{dsc | {{tt|cregex_token_iterator}} | {{c|regex_token_iterator<const char*>}}}}
+{{dsc|{{ttb|std::cregex_token_iterator}}|{{c/core|std::regex_token_iterator<const char*>}}}}
-{{dsc | {{tt|wcregex_token_iterator}} | {{c|regex_token_iterator<const wchar_t*>}}}}
+{{dsc|{{ttb|std::wcregex_token_iterator}}|{{c/core|std::regex_token_iterator<const wchar_t*>}}}}
-{{dsc | {{tt|sregex_token_iterator}} | {{c|regex_token_iterator<std::string::const_iterator>}}}}
+{{dsc|{{ttb|std::sregex_token_iterator}}|{{c/core|std::regex_token_iterator<std::string}}{{c/core|::const_iterator>}}}}
-{{dsc | {{tt|wsregex_token_iterator}} | {{c|regex_token_iterator<std::wstring::const_iterator>}}}}
+{{dsc|{{ttb|std::wsregex_token_iterator}}|{{c/core|std::regex_token_iterator<std::wstring}}{{c/core|::const_iterator>}}}}
 {{dsc end}}
 ===Member types===
 {{dsc begin}}
-{{dsc hitem | Member type | Definition}}
+{{dsc hitem|Member type|Definition}}
-{{dsc | {{tt|value_type}} | {{c|std::sub_match<BidirIt>}} }}
+{{dsc|{{tt|value_type}}|{{c/core|std::sub_match<BidirIt>}}}}
-{{dsc | {{tt|difference_type}} | {{lc|std::ptrdiff_t}} }}
+{{dsc|{{tt|difference_type}}|{{lc|std::ptrdiff_t}}}}
-{{dsc | {{tt|pointer}} | {{c|const value_type*}} }}
+{{dsc|{{tt|pointer}}|{{c/core|const value_type*}}}}
-{{dsc | {{tt|reference}} | {{c|const value_type&}} }}
+{{dsc|{{tt|reference}}|{{c/core|const value_type&}}}}
-{{dsc | {{tt|iterator_category}} | {{lc|std::forward_iterator_tag}} }}
+{{dsc|{{tt|iterator_category}}|{{lc|std::forward_iterator_tag}}}}
-{{dsc | {{tt|regex_type}} | {{c|basic_regex<CharT, Traits>}} }}
+{{dsc|{{tt|iterator_concept}} {{mark c++20}}|{{lc|std::input_iterator_tag}}}}
+{{dsc|{{tt|regex_type}}|{{c/core|std::basic_regex<CharT, Traits>}}}}
 {{dsc end}}
 ===Member functions===
 {{dsc begin}}
-{{dsc inc | cpp/regex/regex_token_iterator/dsc constructor}}
+{{dsc inc|cpp/regex/regex_token_iterator/dsc constructor}}
-{{dsc inc | cpp/regex/regex_token_iterator/dsc destructor}}
+{{dsc inc|cpp/regex/regex_token_iterator/dsc destructor}}
-{{dsc inc | cpp/regex/regex_token_iterator/dsc operator{{=}}}}
+{{dsc inc|cpp/regex/regex_token_iterator/dsc operator{{=}}}}
-{{dsc inc | cpp/regex/regex_token_iterator/dsc operator_cmp}}
+{{dsc inc|cpp/regex/regex_token_iterator/dsc operator cmp}}
-{{dsc inc | cpp/regex/regex_token_iterator/dsc operator*}}
+{{dsc inc|cpp/regex/regex_token_iterator/dsc operator*}}
-{{dsc inc | cpp/regex/regex_token_iterator/dsc operator_arith}}
+{{dsc inc|cpp/regex/regex_token_iterator/dsc operator arith}}
 {{dsc end}}
@@ Line 62: / Line 62: @@
 ===Example===
 {{example
- |
+|code=
- | code=
+#include <algorithm>
 #include <fstream>
 #include <iostream>
-#include <algorithm>
 #include <iterator>
 #include <regex>
@@ Line 73: / Line 71: @@
 int main()
 {
-   std::string text = "Quick brown fox.";
+    // Tokenization (non-matched fragments)
-   // tokenization (non-matched fragments)
+    // Note that regex is matched only two times; when the third value is obtained
-   // Note that regex is matched only two times: when the third value is obtained
+    // the iterator is a suffix iterator.
-   // the iterator is a suffix iterator.
+    const std::string text = "Quick brown fox.";
-   std::regex ws_re("\\s+"); // whitespace
+    const std::regex ws_re("\\s+"); // whitespace
-   std::copy( std::sregex_token_iterator(text.begin(), text.end(), ws_re, -1),
+    std::copy(std::sregex_token_iterator(text.begin(), text.end(), ws_re, -1),
                std::sregex_token_iterator(),
                std::ostream_iterator<std::string>(std::cout, "\n"));
-   // iterating the first submatches
+    std::cout << '\n';
-   std::string html = "<p><a href=\"http://google.com\">google</a> "
-                      "< a HREF =\"http://cppreference.com\">cppreference</a>\n</p>";
+    // Iterating the first submatches
-   std::regex url_re("<\\s*A\\s+[^>]*href\\s*=\\s*\"([^\"]*)\"", std::regex::icase);
+    const std::string html = R"(<p><a href="http://google.com">google</a> )"
-   std::copy( std::sregex_token_iterator(html.begin(), html.end(), url_re, 1),
+                             R"(< a HREF ="http://cppreference.com">cppreference</a>\n</p>)";
+    const std::regex url_re(R"!!(<\s*A\s+[^>]*href\s*=\s*"([^"]*)")!!", std::regex::icase);
+    std::copy(std::sregex_token_iterator(html.begin(), html.end(), url_re, 1),
                std::sregex_token_iterator(),
                std::ostream_iterator<std::string>(std::cout, "\n"));
@@ Line 94: / Line 94: @@
 brown
 fox.
 http://google.com
 http://cppreference.com
 }}
-[[de:cpp/regex/regex token iterator]]
+===Defect reports===
-[[es:cpp/regex/regex token iterator]]
+{{dr list begin}}
-[[fr:cpp/regex/regex token iterator]]
+{{dr list item|wg=lwg|dr=3698|paper=P2770R0|std=C++20|before={{tt|regex_token_iterator}} was a {{lconcept|forward_iterator}}<br>while being a stashing iterator|after=made {{lconcept|input_iterator}}<ref>{{tt|iterator_category}} was unchanged by the resolution, because changing it to {{lc|std::input_iterator_tag}} might break too much existing code.</ref>}}
-[[it:cpp/regex/regex token iterator]]
+{{dr list end}}
-[[ja:cpp/regex/regex token iterator]]
+<references/>
-[[pt:cpp/regex/regex token iterator]]
-[[ru:cpp/regex/regex token iterator]]
+{{langlinks|de|es|fr|it|ja|pt|ru|zh}}
-[[zh:cpp/regex/regex token iterator]]

Compiler support
Freestanding and hosted
Language
Standard library
Standard library headers
Named requirements
Feature test macros (C++20)
Language support library
Concepts library (C++20)
Metaprogramming library (C++11)
Diagnostics library
General utilities library
Strings library
Containers library
Iterators library
Ranges library (C++20)
Algorithms library
Numerics library
Localizations library
Input/output library
Filesystem library (C++17)
Regular expressions library (C++11)
Concurrency support library (C++11)
Execution support library (C++26)
Technical specifications
Symbols index
External libraries

Classes
basic_regex (C++11)
sub_match (C++11)
match_results (C++11)
Algorithms
regex_match (C++11)
regex_search (C++11)
regex_replace (C++11)
Iterators
regex_iterator (C++11)
regex_token_iterator (C++11)
Exceptions
regex_error (C++11)
Traits
regex_traits (C++11)
Constants
syntax_option_type (C++11)
match_flag_type (C++11)
error_type (C++11)
Regex Grammar
Modified ECMAScript-262 (C++11)

Member functions
regex_token_iterator::regex_token_iterator
regex_token_iterator::operator=
Comparisons
regex_token_iterator::operator==regex_token_iterator::operator!= (until C++20)
Observers
regex_token_iterator::operator*regex_token_iterator::operator->
Modifiers
regex_token_iterator::operator++regex_token_iterator::operator++(int)

cppreference.com

Namespaces

Variants

Views

Actions

Difference between revisions of "cpp/regex/regex token iterator"

Latest revision as of 22:58, 8 April 2024

Contents

[edit] Type requirements

[edit] Specializations

[edit] Member types

[edit] Member functions

[edit] Notes

[edit] Example

[edit] Defect reports

Navigation

Toolbox