Difference between revisions of "cpp/regex/regex token iterator"
(+type reqs) |
m ("indexes" → "indices".) |
||
(21 intermediate revisions by 10 users not shown) | |||
Line 1: | Line 1: | ||
{{cpp/title|regex_token_iterator}} | {{cpp/title|regex_token_iterator}} | ||
− | {{cpp/regex/navbar}} | + | {{cpp/regex/regex_token_iterator/navbar}} |
− | {{ddcl | | + | {{ddcl|header=regex|since=c++11|1= |
template< | template< | ||
class BidirIt, | class BidirIt, | ||
Line 9: | Line 9: | ||
}} | }} | ||
− | {{tt|std::regex_token_iterator}} is a read-only {{ | + | {{tt|std::regex_token_iterator}} is a read-only {{named req|ForwardIterator}} that accesses the individual sub-matches of every match of a regular expression within the underlying character sequence. It can also be used to access the parts of the sequence that were not matched by the given regular expression (e.g. as a tokenizer). |
− | On construction, it constructs an {{ | + | On construction, it constructs an {{lc|std::regex_iterator}} and on every increment it steps through the requested sub-matches from the current match_results, incrementing the underlying {{lc|std::regex_iterator}} when incrementing away from the last submatch. |
The default-constructed {{tt|std::regex_token_iterator}} is the end-of-sequence iterator. When a valid {{tt|std::regex_token_iterator}} is incremented after reaching the last submatch of the last match, it becomes equal to the end-of-sequence iterator. Dereferencing or incrementing it further invokes undefined behavior. | The default-constructed {{tt|std::regex_token_iterator}} is the end-of-sequence iterator. When a valid {{tt|std::regex_token_iterator}} is incremented after reaching the last submatch of the last match, it becomes equal to the end-of-sequence iterator. Dereferencing or incrementing it further invokes undefined behavior. | ||
− | Just before becoming the end-of-sequence iterator, a {{ | + | Just before becoming the end-of-sequence iterator, a {{tt|std::regex_token_iterator}} may become a ''suffix iterator'', if the index {{c|-1}} (non-matched fragment) appears in the list of the requested submatch indices. Such iterator, if dereferenced, returns a match_results corresponding to the sequence of characters between the last match and the end of sequence. |
− | A typical implementation of {{tt|std::regex_token_iterator}} holds the underlying {{ | + | A typical implementation of {{tt|std::regex_token_iterator}} holds the underlying {{lc|std::regex_iterator}}, a container (e.g. {{c/core|std::vector<int>}}) of the requested submatch indices, the internal counter equal to the index of the submatch, a pointer to {{lc|std::sub_match}}, pointing at the current submatch of the current match, and a {{lc|std::match_results}} object containing the last non-matched character sequence (used in tokenizer mode). |
===Type requirements=== | ===Type requirements=== | ||
− | {{ | + | {{par begin}} |
− | {{ | + | {{par req named|BidirIt|BidirectionalIterator}} |
− | {{ | + | {{par end}} |
===Specializations=== | ===Specializations=== | ||
Several specializations for common character sequence types are defined: | Several specializations for common character sequence types are defined: | ||
− | + | {{dsc begin}} | |
− | {{ | + | {{dsc header|regex}} |
− | {{ | + | {{dsc hitem|Type|Definition}} |
− | {{ | + | {{dsc|{{ttb|std::cregex_token_iterator}}|{{c/core|std::regex_token_iterator<const char*>}}}} |
− | {{ | + | {{dsc|{{ttb|std::wcregex_token_iterator}}|{{c/core|std::regex_token_iterator<const wchar_t*>}}}} |
− | {{ | + | {{dsc|{{ttb|std::sregex_token_iterator}}|{{c/core|std::regex_token_iterator<std::string}}{{c/core|::const_iterator>}}}} |
− | {{ | + | {{dsc|{{ttb|std::wsregex_token_iterator}}|{{c/core|std::regex_token_iterator<std::wstring}}{{c/core|::const_iterator>}}}} |
− | {{ | + | {{dsc end}} |
− | {{ | + | |
===Member types=== | ===Member types=== | ||
− | {{ | + | {{dsc begin}} |
− | {{ | + | {{dsc hitem|Member type|Definition}} |
− | {{ | + | {{dsc|{{tt|value_type}}|{{c/core|std::sub_match<BidirIt>}}}} |
− | {{ | + | {{dsc|{{tt|difference_type}}|{{lc|std::ptrdiff_t}}}} |
− | {{ | + | {{dsc|{{tt|pointer}}|{{c/core|const value_type*}}}} |
− | {{ | + | {{dsc|{{tt|reference}}|{{c/core|const value_type&}}}} |
− | {{ | + | {{dsc|{{tt|iterator_category}}|{{lc|std::forward_iterator_tag}}}} |
− | {{ | + | {{dsc|{{tt|iterator_concept}} {{mark c++20}}|{{lc|std::input_iterator_tag}}}} |
− | {{ | + | {{dsc|{{tt|regex_type}}|{{c/core|std::basic_regex<CharT, Traits>}}}} |
+ | {{dsc end}} | ||
===Member functions=== | ===Member functions=== | ||
− | {{ | + | {{dsc begin}} |
− | {{ | + | {{dsc inc|cpp/regex/regex_token_iterator/dsc constructor}} |
− | {{ | + | {{dsc inc|cpp/regex/regex_token_iterator/dsc destructor}} |
− | {{ | + | {{dsc inc|cpp/regex/regex_token_iterator/dsc operator{{=}}}} |
− | {{ | + | {{dsc inc|cpp/regex/regex_token_iterator/dsc operator cmp}} |
− | {{ | + | {{dsc inc|cpp/regex/regex_token_iterator/dsc operator*}} |
− | {{ | + | {{dsc inc|cpp/regex/regex_token_iterator/dsc operator arith}} |
− | {{ | + | {{dsc end}} |
===Notes=== | ===Notes=== | ||
− | It is the programmer's responsibility to ensure that the {{ | + | It is the programmer's responsibility to ensure that the {{lc|std::basic_regex}} object passed to the iterator's constructor outlives the iterator. Because the iterator stores a {{lc|std::regex_iterator}} which stores a pointer to the regex, incrementing the iterator after the regex was destroyed results in undefined behavior. |
===Example=== | ===Example=== | ||
{{example | {{example | ||
− | + | |code= | |
− | + | #include <algorithm> | |
− | + | ||
#include <fstream> | #include <fstream> | ||
#include <iostream> | #include <iostream> | ||
− | |||
#include <iterator> | #include <iterator> | ||
#include <regex> | #include <regex> | ||
+ | |||
int main() | int main() | ||
{ | { | ||
− | + | // Tokenization (non-matched fragments) | |
− | + | // Note that regex is matched only two times; when the third value is obtained | |
− | + | // the iterator is a suffix iterator. | |
− | + | const std::string text = "Quick brown fox."; | |
− | + | const std::regex ws_re("\\s+"); // whitespace | |
− | + | std::copy(std::sregex_token_iterator(text.begin(), text.end(), ws_re, -1), | |
std::sregex_token_iterator(), | std::sregex_token_iterator(), | ||
std::ostream_iterator<std::string>(std::cout, "\n")); | std::ostream_iterator<std::string>(std::cout, "\n")); | ||
− | + | ||
− | + | std::cout << '\n'; | |
− | + | ||
− | + | // Iterating the first submatches | |
− | + | const std::string html = R"(<p><a href="http://google.com">google</a> )" | |
− | + | R"(< a HREF ="http://cppreference.com">cppreference</a>\n</p>)"; | |
+ | const std::regex url_re(R"!!(<\s*A\s+[^>]*href\s*=\s*"([^"]*)")!!", std::regex::icase); | ||
+ | std::copy(std::sregex_token_iterator(html.begin(), html.end(), url_re, 1), | ||
std::sregex_token_iterator(), | std::sregex_token_iterator(), | ||
std::ostream_iterator<std::string>(std::cout, "\n")); | std::ostream_iterator<std::string>(std::cout, "\n")); | ||
Line 93: | Line 94: | ||
brown | brown | ||
fox. | fox. | ||
+ | |||
http://google.com | http://google.com | ||
http://cppreference.com | http://cppreference.com | ||
}} | }} | ||
+ | |||
+ | ===Defect reports=== | ||
+ | {{dr list begin}} | ||
+ | {{dr list item|wg=lwg|dr=3698|paper=P2770R0|std=C++20|before={{tt|regex_token_iterator}} was a {{lconcept|forward_iterator}}<br>while being a stashing iterator|after=made {{lconcept|input_iterator}}<ref>{{tt|iterator_category}} was unchanged by the resolution, because changing it to {{lc|std::input_iterator_tag}} might break too much existing code.</ref>}} | ||
+ | {{dr list end}} | ||
+ | <references/> | ||
+ | |||
+ | {{langlinks|de|es|fr|it|ja|pt|ru|zh}} |
Latest revision as of 22:58, 8 April 2024
Defined in header <regex>
|
||
template< class BidirIt, |
(since C++11) | |
std::regex_token_iterator
is a read-only LegacyForwardIterator that accesses the individual sub-matches of every match of a regular expression within the underlying character sequence. It can also be used to access the parts of the sequence that were not matched by the given regular expression (e.g. as a tokenizer).
On construction, it constructs an std::regex_iterator and on every increment it steps through the requested sub-matches from the current match_results, incrementing the underlying std::regex_iterator when incrementing away from the last submatch.
The default-constructed std::regex_token_iterator
is the end-of-sequence iterator. When a valid std::regex_token_iterator
is incremented after reaching the last submatch of the last match, it becomes equal to the end-of-sequence iterator. Dereferencing or incrementing it further invokes undefined behavior.
Just before becoming the end-of-sequence iterator, a std::regex_token_iterator
may become a suffix iterator, if the index -1 (non-matched fragment) appears in the list of the requested submatch indices. Such iterator, if dereferenced, returns a match_results corresponding to the sequence of characters between the last match and the end of sequence.
A typical implementation of std::regex_token_iterator
holds the underlying std::regex_iterator, a container (e.g. std::vector<int>) of the requested submatch indices, the internal counter equal to the index of the submatch, a pointer to std::sub_match, pointing at the current submatch of the current match, and a std::match_results object containing the last non-matched character sequence (used in tokenizer mode).
Contents |
[edit] Type requirements
-BidirIt must meet the requirements of LegacyBidirectionalIterator.
|
[edit] Specializations
Several specializations for common character sequence types are defined:
Defined in header
<regex> | |
Type | Definition |
std::cregex_token_iterator
|
std::regex_token_iterator<const char*> |
std::wcregex_token_iterator
|
std::regex_token_iterator<const wchar_t*> |
std::sregex_token_iterator
|
std::regex_token_iterator<std::string::const_iterator> |
std::wsregex_token_iterator
|
std::regex_token_iterator<std::wstring::const_iterator> |
[edit] Member types
Member type | Definition |
value_type
|
std::sub_match<BidirIt> |
difference_type
|
std::ptrdiff_t |
pointer
|
const value_type* |
reference
|
const value_type& |
iterator_category
|
std::forward_iterator_tag |
iterator_concept (C++20)
|
std::input_iterator_tag |
regex_type
|
std::basic_regex<CharT, Traits> |
[edit] Member functions
constructs a new regex_token_iterator (public member function) | |
(destructor) (implicitly declared) |
destructs a regex_token_iterator , including the cached value (public member function) |
assigns contents (public member function) | |
(removed in C++20) |
compares two regex_token_iterator s (public member function) |
accesses current submatch (public member function) | |
advances the iterator to the next submatch (public member function) |
[edit] Notes
It is the programmer's responsibility to ensure that the std::basic_regex object passed to the iterator's constructor outlives the iterator. Because the iterator stores a std::regex_iterator which stores a pointer to the regex, incrementing the iterator after the regex was destroyed results in undefined behavior.
[edit] Example
#include <algorithm> #include <fstream> #include <iostream> #include <iterator> #include <regex> int main() { // Tokenization (non-matched fragments) // Note that regex is matched only two times; when the third value is obtained // the iterator is a suffix iterator. const std::string text = "Quick brown fox."; const std::regex ws_re("\\s+"); // whitespace std::copy(std::sregex_token_iterator(text.begin(), text.end(), ws_re, -1), std::sregex_token_iterator(), std::ostream_iterator<std::string>(std::cout, "\n")); std::cout << '\n'; // Iterating the first submatches const std::string html = R"(<p><a href="http://google.com">google</a> )" R"(< a HREF ="http://cppreference.com">cppreference</a>\n</p>)"; const std::regex url_re(R"!!(<\s*A\s+[^>]*href\s*=\s*"([^"]*)")!!", std::regex::icase); std::copy(std::sregex_token_iterator(html.begin(), html.end(), url_re, 1), std::sregex_token_iterator(), std::ostream_iterator<std::string>(std::cout, "\n")); }
Output:
Quick brown fox. http://google.com http://cppreference.com
[edit] Defect reports
The following behavior-changing defect reports were applied retroactively to previously published C++ standards.
DR | Applied to | Behavior as published | Correct behavior |
---|---|---|---|
LWG 3698 (P2770R0) |
C++20 | regex_token_iterator was a forward_iterator while being a stashing iterator |
made input_iterator [1]
|
- ↑
iterator_category
was unchanged by the resolution, because changing it to std::input_iterator_tag might break too much existing code.