Home | Libraries | People | FAQ | More |
This library provides two kinds of operations on bidirectional ranges: conversion (e.g. converting a range in UTF-8 to a range in UTF-32) and segmentation (i.e. demarcating sections of a range, like code points, grapheme clusters, words, etc.).
Conversions can be applied in a variety of means, all generated from using the Pipe concept that performs one step of the conversion:
Pipe
until the
whole input range has been treated.
boost::pipe_iterator
.
boost::pipe_output_iterator
.
The naming scheme of the utilities within the library reflect this; here is for example what is provided to convert UTF-32 to UTF-8:
boost::unicode::u8_encoder
is a model of the OneManyPipe
concept.
boost::unicode::u8_encode
is an eager encoding algorithm.
boost::unicode::u8_encoded
returns a range adapter that does on-the-fly encoding.
boost::unicode::u8_encoded_out
returns an output iterator adapter that will encode its elements before
forwarding them to the wrapped output iterator.
Note | |
---|---|
The library considers a conversion from UTF-32 an "encoding", while a conversion to UTF-32 is called a "decoding". This is because code points is what the library mainly deals with, and UTF-32 is a sequence of code points. |
Segmentations are expressed in terms of the Consumer
concept, which is inherently very similar to the Pipe
concept except it doesn't perform any kind of transformation, it just reads
part of the input. As a matter of fact, a Pipe
can be
converted to Consumer
using boost::pipe_consumer
.
Segmentation may be done either by using the appropriate Consumer
directly, or by using the boost::consumer_iterator
template to adapt the range into a read-only range of subranges.
Additionally, the BoundaryChecker
concept may prove useful to tell whether a segment starts at a given position;
a Consumer
may also be defined in terms of it using boost::boundary_consumer
.
The naming scheme is as follows:
boost::unicode::u8_boundary
is a BoundaryChecker
that tells whether a position is
the start of a code point in a range of UTF-8 code units.
boost::unicode::grapheme_boundary
is a BoundaryChecker
that tells whether a position is
the start of a grapheme cluster in a range of code points.
boost::unicode::u8_bounded
adapts its input range in UTF-8 into a range of ranges of code units, each
range being a code point.
boost::unicode::grapheme_bounded
adapts its input range in UTF-32 into a range of ranges of code points,
each range being a grapheme cluster.
boost::unicode::u8_grapheme_bounded
adapts its input range in UTF-8 into a range of code units, each range
being a grapheme cluster.
Everytime there are two versions for a function or class, one for UTF-8 and the other for UTF-16, and deducing which type of UTF encoding to use is possible, additional ones are added that will automatically forward to it.
The naming scheme is as follows:
boost::unicode::utf_decode
either behaves like boost::unicode::u8_decode
,
boost::unicode::u16_decode
depending on the value_type
of its input range.
boost::unicode::utf_boundary
either behaves like boost::unicode::u8_boundary
or boost::unicode::u16_boundary
depending on the value_type
of the input ranges passed
to ltr
and rtl
.
Tip | |
---|---|
Not only UTF-8 and UTF-16 are recognized by UTF type deduction, UTF-32 is as well. |
Normalized forms are defined in terms of certain decompositions applied recursively, followed by certain compositions also applied recursively, and finally canonical ordering of combining character sequences.
A decomposition being a conversion of a single code point into several and a composition being the opposite conversion, with exceptions.
The Unicode Character Database associates with code points certain decompositions,
which can be obtained with boost::unicode::ucd::get_decomposition
,
but does not include Hangul syllable decompositions since those can be easily
procedurally generated, allowing space to be saved.
The library provides boost::unicode::hangul_decomposer
,
a OneManyPipe to decompose Hangul
syllables.
There are several types of decompositions, which are exposed by boost::unicode::ucd::get_decomposition_type
,
most importantly the canonical composition is obtained by applying both the
Hangul decompositions and the canonical decompositions from the UCD, while
the compatibility decomposition is obtained by applying the Hangul decompositions
and all decompositions from the UCD.
boost::unicode::decomposer
,
model of Pipe allows to perform any
decomposition that matches a certain mask, recursively, including Hangul
ones (which are treated as canonical decompositions), and canonically orders
combining sequences as well.
Likewise, Hangul syllable compositions are not provided by the UCD and are
implemented by boost::unicode::hangul_composer
instead.
Some distinct code points may have the same decomposition, so certain decomposed forms are preferred. That is why an exclusion table is also provided by the UCD.
The library uses a pre-generated prefix tree (or, in the current implementation, a lexicographically sorted array) of all canonical compositions from their fully decomposed and canonically ordered form to identity composable sequences and apply the compositions.
boost::unicode::composer
is a Pipe that uses that tree as well
as the Hangul compositions.
Normalization can be performed by applying decomposition followed by composition,
which is what the current version of boost::unicode::normalizer
does.
The Unicode standard however provides as well quick-check properties to avoid that operation when possible, but the current version of the library does not support that scheme at the moment.
Concatenating strings in a given normalization form does not guarantee the result is in that same normalization form if the right operand starts with a combining code point.
Therefore the library provides functionality to identity the boundaries where re-normalization needs to occur as well as eager and lazy versions of the concatenation that maintain the input normalization.
Note concatenation with Normalization Form D is slightly more efficient as it only requires canonical sorting of the combining character sequence placed at the intersection.
See:
boost::unicode::cat_limits
to partition into the different sub ranges.
boost::unicode::composed_concat
,
eager version with input in Normalization Form C.
boost::unicode::composed_concated
,
lazy version with input in Normalization Form C.
boost::unicode::decomposed_concat
,
eager version with input in Normalization Form D.
boost::unicode::decomposed_concated
,
lazy version with input in Normalization Form D.
The library provides mechanisms to perform searches at the code unit, code point, or grapheme level, and in the future will provide word and sentence level as well.
Different approaches to do that are possible:
boost::consumer_iterator
are EqualityComparable
).
Finder
in Boost.StringAlgo
with boost::algorithm::boundary_finder
,
and the boundary you are interested in testing, for example boost::unicode::utf_grapheme_boundary
.
Important | |
---|---|
You will have to normalize input before the search if you want canonically equivalent things to compare equal. |