Boost C++ Libraries Home Libraries People FAQ More

PrevUpHomeNext

Overview

Range operations
Composition and Normalization
String searching algorithms

This library provides two kinds of operations on bidirectional ranges: conversion (e.g. converting a range in UTF-8 to a range in UTF-32) and segmentation (i.e. demarcating sections of a range, like code points, grapheme clusters, words, etc.).

Conversion

Conversions can be applied in a variety of means, all generated from using the Pipe concept that performs one step of the conversion:

  • Eager evaluation, with simply loops the Pipe until the whole input range has been treated.
  • Lazy evaluation, where a new range is returned that wraps the input range and converts step-by-step as the range is advanced. The resulting range is however read-only. It is implemented in terms of boost::pipe_iterator.
  • Lazy output evaluation, where an output iterator is returned that wraps the output and converts every pushed element with a OneManyPipe. It is implemented in terms of boost::pipe_output_iterator.

The naming scheme of the utilities within the library reflect this; here is for example what is provided to convert UTF-32 to UTF-8:

[Note] Note

The library considers a conversion from UTF-32 an "encoding", while a conversion to UTF-32 is called a "decoding". This is because code points is what the library mainly deals with, and UTF-32 is a sequence of code points.

Segmentation

Segmentations are expressed in terms of the Consumer concept, which is inherently very similar to the Pipe concept except it doesn't perform any kind of transformation, it just reads part of the input. As a matter of fact, a Pipe can be converted to Consumer using boost::pipe_consumer.

Segmentation may be done either by using the appropriate Consumer directly, or by using the boost::consumer_iterator template to adapt the range into a read-only range of subranges.

Additionally, the BoundaryChecker concept may prove useful to tell whether a segment starts at a given position; a Consumer may also be defined in terms of it using boost::boundary_consumer.

The naming scheme is as follows:

UTF type deduction with SFINAE

Everytime there are two versions for a function or class, one for UTF-8 and the other for UTF-16, and deducing which type of UTF encoding to use is possible, additional ones are added that will automatically forward to it.

The naming scheme is as follows:

[Tip] Tip

Not only UTF-8 and UTF-16 are recognized by UTF type deduction, UTF-32 is as well.

Normalized forms are defined in terms of certain decompositions applied recursively, followed by certain compositions also applied recursively, and finally canonical ordering of combining character sequences.

A decomposition being a conversion of a single code point into several and a composition being the opposite conversion, with exceptions.

Decomposition

The Unicode Character Database associates with code points certain decompositions, which can be obtained with boost::unicode::ucd::get_decomposition, but does not include Hangul syllable decompositions since those can be easily procedurally generated, allowing space to be saved.

The library provides boost::unicode::hangul_decomposer, a OneManyPipe to decompose Hangul syllables.

There are several types of decompositions, which are exposed by boost::unicode::ucd::get_decomposition_type, most importantly the canonical composition is obtained by applying both the Hangul decompositions and the canonical decompositions from the UCD, while the compatibility decomposition is obtained by applying the Hangul decompositions and all decompositions from the UCD.

boost::unicode::decomposer, model of Pipe allows to perform any decomposition that matches a certain mask, recursively, including Hangul ones (which are treated as canonical decompositions), and canonically orders combining sequences as well.

Composition

Likewise, Hangul syllable compositions are not provided by the UCD and are implemented by boost::unicode::hangul_composer instead.

Some distinct code points may have the same decomposition, so certain decomposed forms are preferred. That is why an exclusion table is also provided by the UCD.

The library uses a pre-generated prefix tree (or, in the current implementation, a lexicographically sorted array) of all canonical compositions from their fully decomposed and canonically ordered form to identity composable sequences and apply the compositions.

boost::unicode::composer is a Pipe that uses that tree as well as the Hangul compositions.

Normalization

Normalization can be performed by applying decomposition followed by composition, which is what the current version of boost::unicode::normalizer does.

The Unicode standard however provides as well quick-check properties to avoid that operation when possible, but the current version of the library does not support that scheme at the moment.

Concatenation

Concatenating strings in a given normalization form does not guarantee the result is in that same normalization form if the right operand starts with a combining code point.

Therefore the library provides functionality to identity the boundaries where re-normalization needs to occur as well as eager and lazy versions of the concatenation that maintain the input normalization.

Note concatenation with Normalization Form D is slightly more efficient as it only requires canonical sorting of the combining character sequence placed at the intersection.

See:

The library provides mechanisms to perform searches at the code unit, code point, or grapheme level, and in the future will provide word and sentence level as well.

Different approaches to do that are possible:

  • Pipe- or Consumer-based, you may simply run classic search algorithms, such as the ones from Boost.StringAlgo, with ranges of the appropriate elements -- those elements being able to be ranges themselves (subranges returned by boost::consumer_iterator are EqualityComparable).
  • BoundaryChecker-based, the classic algorithms are run, then false positives that don't lie on the right boundaries are discarded. This has the advantage of reducing conversion and iteration overhead in certain situations. The most practical way to achieve this is to adapt a Finder in Boost.StringAlgo with boost::algorithm::boundary_finder, and the boundary you are interested in testing, for example boost::unicode::utf_grapheme_boundary.
[Important] Important

You will have to normalize input before the search if you want canonically equivalent things to compare equal.


PrevUpHomeNext