Namespaces
Variants
Views
Actions

Difference between revisions of "cpp/regex/regex token iterator"

From cppreference.com
< cpp‎ | regex
m (Text replace - "{{mark c++0x feature}}" to "{{mark c++11 feature}}")
m ("indexes" → "indices".)
 
(32 intermediate revisions by 10 users not shown)
Line 1: Line 1:
{{cpp/title|regex_iterator}}
+
{{cpp/title|regex_token_iterator}}
{{cpp/regex/sidebar}}
+
{{cpp/regex/regex_token_iterator/navbar}}
{{ddcl | notes={{mark c++11 feature}} | 1=
+
{{ddcl|header=regex|since=c++11|1=
 
template<
 
template<
     class BidirectionalIterator,
+
     class BidirIt,
     class CharT = typename std::iterator_traits<BidirectionalIterator>::value_type,
+
     class CharT = typename std::iterator_traits<BidirIt>::value_type,
 
     class Traits = std::regex_traits<CharT>
 
     class Traits = std::regex_traits<CharT>
 
> class regex_token_iterator
 
> class regex_token_iterator
 
}}
 
}}
  
{{todo}}
+
{{tt|std::regex_token_iterator}} is a read-only {{named req|ForwardIterator}} that accesses the individual sub-matches of every match of a regular expression within the underlying character sequence. It can also be used to access the parts of the sequence that were not matched by the given regular expression (e.g. as a tokenizer).
  
 +
On construction, it constructs an {{lc|std::regex_iterator}} and on every increment it steps through the requested sub-matches from the current match_results, incrementing the underlying {{lc|std::regex_iterator}} when incrementing away from the last submatch.
 +
 +
The default-constructed {{tt|std::regex_token_iterator}} is the end-of-sequence iterator. When a valid {{tt|std::regex_token_iterator}} is incremented after reaching the last submatch of the last match, it becomes equal to the end-of-sequence iterator. Dereferencing or incrementing it further invokes undefined behavior.
 +
 +
Just before becoming the end-of-sequence iterator, a {{tt|std::regex_token_iterator}} may become a ''suffix iterator'', if the index {{c|-1}} (non-matched fragment) appears in the list of the requested submatch indices. Such iterator, if dereferenced, returns a match_results corresponding to the sequence of characters between the last match and the end of sequence.
 +
 +
A typical implementation of {{tt|std::regex_token_iterator}} holds the underlying {{lc|std::regex_iterator}}, a container (e.g. {{c/core|std::vector<int>}}) of the requested submatch indices, the internal counter equal to the index of the submatch, a pointer to {{lc|std::sub_match}}, pointing at the current submatch of the current match, and a {{lc|std::match_results}} object containing the last non-matched character sequence (used in tokenizer mode).
 +
 +
===Type requirements===
 +
{{par begin}}
 +
{{par req named|BidirIt|BidirectionalIterator}}
 +
{{par end}}
 +
 +
===Specializations===
 
Several specializations for common character sequence types are defined:
 
Several specializations for common character sequence types are defined:
 +
{{dsc begin}}
 +
{{dsc header|regex}}
 +
{{dsc hitem|Type|Definition}}
 +
{{dsc|{{ttb|std::cregex_token_iterator}}|{{c/core|std::regex_token_iterator<const char*>}}}}
 +
{{dsc|{{ttb|std::wcregex_token_iterator}}|{{c/core|std::regex_token_iterator<const wchar_t*>}}}}
 +
{{dsc|{{ttb|std::sregex_token_iterator}}|{{c/core|std::regex_token_iterator<std::string}}{{c/core|::const_iterator>}}}}
 +
{{dsc|{{ttb|std::wsregex_token_iterator}}|{{c/core|std::regex_token_iterator<std::wstring}}{{c/core|::const_iterator>}}}}
 +
{{dsc end}}
 +
 +
===Member types===
 +
{{dsc begin}}
 +
{{dsc hitem|Member type|Definition}}
 +
{{dsc|{{tt|value_type}}|{{c/core|std::sub_match<BidirIt>}}}}
 +
{{dsc|{{tt|difference_type}}|{{lc|std::ptrdiff_t}}}}
 +
{{dsc|{{tt|pointer}}|{{c/core|const value_type*}}}}
 +
{{dsc|{{tt|reference}}|{{c/core|const value_type&}}}}
 +
{{dsc|{{tt|iterator_category}}|{{lc|std::forward_iterator_tag}}}}
 +
{{dsc|{{tt|iterator_concept}} {{mark c++20}}|{{lc|std::input_iterator_tag}}}}
 +
{{dsc|{{tt|regex_type}}|{{c/core|std::basic_regex<CharT, Traits>}}}}
 +
{{dsc end}}
 +
 +
===Member functions===
 +
{{dsc begin}}
 +
{{dsc inc|cpp/regex/regex_token_iterator/dsc constructor}}
 +
{{dsc inc|cpp/regex/regex_token_iterator/dsc destructor}}
 +
{{dsc inc|cpp/regex/regex_token_iterator/dsc operator{{=}}}}
 +
{{dsc inc|cpp/regex/regex_token_iterator/dsc operator cmp}}
 +
{{dsc inc|cpp/regex/regex_token_iterator/dsc operator*}}
 +
{{dsc inc|cpp/regex/regex_token_iterator/dsc operator arith}}
 +
{{dsc end}}
 +
 +
===Notes===
 +
It is the programmer's responsibility to ensure that the {{lc|std::basic_regex}} object passed to the iterator's constructor outlives the iterator. Because the iterator stores a {{lc|std::regex_iterator}} which stores a pointer to the regex, incrementing the iterator after the regex was destroyed results in undefined behavior.
 +
 +
===Example===
 +
{{example
 +
|code=
 +
#include <algorithm>
 +
#include <fstream>
 +
#include <iostream>
 +
#include <iterator>
 +
#include <regex>
 +
 +
int main()
 +
{
 +
    // Tokenization (non-matched fragments)
 +
    // Note that regex is matched only two times; when the third value is obtained
 +
    // the iterator is a suffix iterator.
 +
    const std::string text = "Quick brown fox.";
 +
    const std::regex ws_re("\\s+"); // whitespace
 +
    std::copy(std::sregex_token_iterator(text.begin(), text.end(), ws_re, -1),
 +
              std::sregex_token_iterator(),
 +
              std::ostream_iterator<std::string>(std::cout, "\n"));
 +
   
 +
    std::cout << '\n';
 +
   
 +
    // Iterating the first submatches
 +
    const std::string html = R"(<p><a href="http://google.com">google</a> )"
 +
                            R"(< a HREF ="http://cppreference.com">cppreference</a>\n</p>)";
 +
    const std::regex url_re(R"!!(<\s*A\s+[^>]*href\s*=\s*"([^"]*)")!!", std::regex::icase);
 +
    std::copy(std::sregex_token_iterator(html.begin(), html.end(), url_re, 1),
 +
              std::sregex_token_iterator(),
 +
              std::ostream_iterator<std::string>(std::cout, "\n"));
 +
}
 +
|output=
 +
Quick
 +
brown
 +
fox.
 +
 +
http://google.com
 +
http://cppreference.com
 +
}}
  
{{tdcl list begin}}
+
===Defect reports===
{{tdcl list header | regex}}
+
{{dr list begin}}
{{tdcl list hitem | Type | Definition}}
+
{{dr list item|wg=lwg|dr=3698|paper=P2770R0|std=C++20|before={{tt|regex_token_iterator}} was a {{lconcept|forward_iterator}}<br>while being a stashing iterator|after=made {{lconcept|input_iterator}}<ref>{{tt|iterator_category}} was unchanged by the resolution, because changing it to {{lc|std::input_iterator_tag}} might break too much existing code.</ref>}}
{{tdcl list item | {{tt|cregex_token_iterator}} | {{cpp|regex_token_iterator<const char*>}}}}
+
{{dr list end}}
{{tdcl list item | {{tt|wcregex_token_iterator}} | {{cpp|regex_token_iterator<const wchar_t*>}}}}
+
<references/>
{{tdcl list item | {{tt|sregex_token_iterator}} | {{cpp|regex_token_iterator<std::string::const_iterator>}}}}
+
{{tdcl list item | {{tt|wsregex_token_iterator}} | {{cpp|regex_token_iterator<std::wstring::const_iterator>}}}}
+
{{tdcl list end}}
+
  
{{todo}}
+
{{langlinks|de|es|fr|it|ja|pt|ru|zh}}

Latest revision as of 22:58, 8 April 2024

Defined in header <regex>
template<

    class BidirIt,
    class CharT = typename std::iterator_traits<BidirIt>::value_type,
    class Traits = std::regex_traits<CharT>

> class regex_token_iterator
(since C++11)

std::regex_token_iterator is a read-only LegacyForwardIterator that accesses the individual sub-matches of every match of a regular expression within the underlying character sequence. It can also be used to access the parts of the sequence that were not matched by the given regular expression (e.g. as a tokenizer).

On construction, it constructs an std::regex_iterator and on every increment it steps through the requested sub-matches from the current match_results, incrementing the underlying std::regex_iterator when incrementing away from the last submatch.

The default-constructed std::regex_token_iterator is the end-of-sequence iterator. When a valid std::regex_token_iterator is incremented after reaching the last submatch of the last match, it becomes equal to the end-of-sequence iterator. Dereferencing or incrementing it further invokes undefined behavior.

Just before becoming the end-of-sequence iterator, a std::regex_token_iterator may become a suffix iterator, if the index -1 (non-matched fragment) appears in the list of the requested submatch indices. Such iterator, if dereferenced, returns a match_results corresponding to the sequence of characters between the last match and the end of sequence.

A typical implementation of std::regex_token_iterator holds the underlying std::regex_iterator, a container (e.g. std::vector<int>) of the requested submatch indices, the internal counter equal to the index of the submatch, a pointer to std::sub_match, pointing at the current submatch of the current match, and a std::match_results object containing the last non-matched character sequence (used in tokenizer mode).

Contents

[edit] Type requirements

-
BidirIt must meet the requirements of LegacyBidirectionalIterator.

[edit] Specializations

Several specializations for common character sequence types are defined:

Defined in header <regex>
Type Definition
std::cregex_token_iterator std::regex_token_iterator<const char*>
std::wcregex_token_iterator std::regex_token_iterator<const wchar_t*>
std::sregex_token_iterator std::regex_token_iterator<std::string::const_iterator>
std::wsregex_token_iterator std::regex_token_iterator<std::wstring::const_iterator>

[edit] Member types

Member type Definition
value_type std::sub_match<BidirIt>
difference_type std::ptrdiff_t
pointer const value_type*
reference const value_type&
iterator_category std::forward_iterator_tag
iterator_concept (C++20) std::input_iterator_tag
regex_type std::basic_regex<CharT, Traits>

[edit] Member functions

constructs a new regex_token_iterator
(public member function) [edit]
(destructor)
(implicitly declared)
destructs a regex_token_iterator, including the cached value
(public member function) [edit]
assigns contents
(public member function) [edit]
(removed in C++20)
compares two regex_token_iterators
(public member function) [edit]
accesses current submatch
(public member function) [edit]
advances the iterator to the next submatch
(public member function) [edit]

[edit] Notes

It is the programmer's responsibility to ensure that the std::basic_regex object passed to the iterator's constructor outlives the iterator. Because the iterator stores a std::regex_iterator which stores a pointer to the regex, incrementing the iterator after the regex was destroyed results in undefined behavior.

[edit] Example

#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <regex>
 
int main()
{
    // Tokenization (non-matched fragments)
    // Note that regex is matched only two times; when the third value is obtained
    // the iterator is a suffix iterator.
    const std::string text = "Quick brown fox.";
    const std::regex ws_re("\\s+"); // whitespace
    std::copy(std::sregex_token_iterator(text.begin(), text.end(), ws_re, -1),
              std::sregex_token_iterator(),
              std::ostream_iterator<std::string>(std::cout, "\n"));
 
    std::cout << '\n';
 
    // Iterating the first submatches
    const std::string html = R"(<p><a href="http://google.com">google</a> )"
                             R"(< a HREF ="http://cppreference.com">cppreference</a>\n</p>)";
    const std::regex url_re(R"!!(<\s*A\s+[^>]*href\s*=\s*"([^"]*)")!!", std::regex::icase);
    std::copy(std::sregex_token_iterator(html.begin(), html.end(), url_re, 1),
              std::sregex_token_iterator(),
              std::ostream_iterator<std::string>(std::cout, "\n"));
}

Output:

Quick
brown
fox.
 
http://google.com
http://cppreference.com

[edit] Defect reports

The following behavior-changing defect reports were applied retroactively to previously published C++ standards.

DR Applied to Behavior as published Correct behavior
LWG 3698
(P2770R0)
C++20 regex_token_iterator was a forward_iterator
while being a stashing iterator
made input_iterator[1]
  1. iterator_category was unchanged by the resolution, because changing it to std::input_iterator_tag might break too much existing code.