String, Char & Unicode
String
cilia::String (AKA UTF8String) with basic/standard unicode support.
Based on UTF-8, as that IMHO is (among all the Unicode formats)
- the most widespread nowadays,
- the most compatible (as it is ASCII based),
- the most efficient, at least for “western” use (and you are free to use UTF16- or UTF32String otherwise).
Iteration over a String or StringView by:
- Graphemes/Grapheme Clusters
- represented by
StringView. - This is the default form of iteration over a
StringorStringView - A single grapheme will often consist of multiple code units
and may even consist of multiple code points (then it is called a grapheme cluster). for grapheme in "abc 🥸👮🏻"- “a”, “b”, “c”, “ “, “🥸”, “👮🏻”
- “\x61”, “\x62”, “\x63”, “\x20”, “\xf0\x9f\xa5\xb8”, “\xf0\x9f\x91\xae\xf0\x9f\x8f\xbb”
- A bit slow, as it has to find grapheme (and cluster) boundaries.
- It is recommended to mostly use the standard functions for string manipulation anyway. But if you need to iterate manually over a Unicode-String, then grapheme-cluster-based iteration is the safe/right way.
- Additional/alternative names?
for graphemeCluster in text.asGraphemeClusters()?
- represented by
- Code Points
- represented by
UInt32,- independent of the encoding (i.e. the same for UTF-8, UTF-16, and UTF-32 strings).
- Called “auto decoding” in D.
for codePoint in "abc 🥸👮🏻".asCodePoints()- 0x00000061, 0x00000062, 0x00000063, 0x00000020, 0x0001F978, 0x0001F46E, 0x0001F3FB
- independent of the encoding (i.e. the same for UTF-8, UTF-16, and UTF-32 strings).
- Note: Not even with UTF-32 do all grapheme clusters fit into a single code point,
so not:- Emoji ZWJ Sequences (Zero Width Joiner),
- emoji with modifier characters like skin tone or variation selector,
- diacritical characters (äöü…, depending on the normal form chosen),
- surely some more …
- Emoji ZWJ Sequences (Zero Width Joiner),
- A bit faster than iteration over grapheme clusters, but still slow, as it has to find code point boundaries in UTF-8/16 strings.
- Fast with UTF-32 strings, but UTF-32 strings in general are often slower than UTF-8, simply due to their size (cache, memory bandwidth).
- represented by
- Code Units
- represented by
CharforString- it is
Char==Char8==UInt8andString==UTF8String
- it is
Char16forUTF16StringChar32forUTF32String
for aChar8 in "abc 🥸👮🏻".asArray()- 0x61, 0x62, 0x63, 0x20, 0xf0, 0x9f, 0xa5, 0xb8, 0xf0, 0x9f, 0x91, 0xae, 0xf0, 0x9f, 0x8f, 0xbb
- same for
for aChar8 in u8"abc 🥸👮🏻".asArray()for aChar8 in UTF8String("abc 🥸👮🏻").asArray()
for aChar16 in u"abc 🥸👮🏻".asArray()- 0x0061, 0x0062, 0x0063, 0x0020, 0xD83E, 0xDD78, 0xD83D, 0xDC6E, 0xD83C, 0xDFFB
- same for
for aChar16 in UTF16String("abc 🥸👮🏻").asArray()
for aChar32 in U"abc 🥸👮🏻".asArray()- 0x00000061, 0x00000062, 0x00000063, 0x00000020, 0x0001F978, 0x0001F46E , 0x0001F3FB
- same for
for aChar32 in UTF32String("abc 🥸👮🏻").asArray()
- represented by
Convert Upper/Lower Case
string.toUpper()string.toLower()toUpper(String) -> StringtoLower(String) -> String
Sorting
stringArray.sort()sort(Container<String>) -> Container<String>compare(stringA, stringB) -> Int
ByteString
ByteString to represent the strings with single byte encoding (i.e. the classical strings consisting of one-byte characters), like:
- ASCII,
- Latin-1,
- ANSI (mostly identical to Latin-1),
- almost every one of the “code pages”.
The encoding is not defined, the user has to take care of this.
Or a subclass with known encoding has to be used:
ASCIIString, a string containing only ASCII characters.- Iteration over an
ASCIIStringorASCIIStringViewbyChar==Char8for aChar in a"abc"- 0x61, 0x62, 0x63
- ‘a’, ‘b’, ‘c’
- Compilation error, if string literal contains non-ASCII characters.
- same for
for aChar in ASCIIString("abc")- but Exception thrown, if string contains non-ASCII characters.
- Implicitly convertible to
String==UTF8String.- Very fast conversion, as all characters have the same binary representation.
- Iteration over an
Latin1String, a string containing only Latin-1 (ISO 8859-1) characters.- Iteration over an
Latin1StringorLatin1StringViewbyChar==Char8for aChar in l"äßç"- 0xe4, 0xdf, 0xe7
- ‘ä’, ‘ß’, ‘ç’
- Compilation error, if string literal contains non-Latin-1 characters.
- same for
for aChar in Latin1String("abc")- but Exception thrown, if string contains non-Latin1 characters.
- Explicitly convertible to
String==UTF8String.- Not as fast a conversion as ASCIIString to String, because typically some characters need to be translated into two UTF-8 code units.
- Iteration over an
Char
Char8, Char16, Char32 are considered as different types for parameter overloading,
but otherwise are like UInt8, UInt16, UInt32.
ICU
International Components for Unicode (“ICU”) for advanced Unicode support.
The ICU libraries provide support for:
- The latest version of the Unicode standard
- Character set conversions with support for over 220 codepages
- Locale data for more than 300 locales
- Language sensitive text collation (sorting) and searching based on the Unicode Collation Algorithm (=ISO 14651)
- Regular expression matching and Unicode sets
- Transformations for normalization, upper/lowercase, script transliterations (50+ pairs)
- Resource bundles for storing and accessing localized information
- Date/Number/Message formatting and parsing of culture-specific input/output formats
- Calendar specific date and time manipulation
- Text boundary analysis for finding characters, word and sentence boundaries
import icu adds extension methods for cilia::String
- Allows iteration over:
- words (important/difficult for Chinese, Japanese, Thai or Khmer, needs list of words)
for word in text.asWords()
- lines
for line in text.asLines()
- sentences (needs list of abbreviations, like “e.g.”, “i.e.”, “o.ä.”)
for sentence in text.asSentences()
- words (important/difficult for Chinese, Japanese, Thai or Khmer, needs list of words)
- Depending on locale
string.toUpper(locale),string.toLower(locale)toUpper(String, locale) -> String,toLower(String, locale) -> String
stringArray.sort(locale)sort(Container<String>, locale) -> Container<String>
compare(stringA, stringB, locale) -> Int
- Even iterating through graphemes (or graphe clusters) is complicated for some/rare/historic scripts.
- Basic is Latin, combining marks, ZWJ, flags, variant selector, CJK (Han, Hiragana, Katakana, Hangul).
- So most everything is covered.
- Give more complex cases to ICU (Arabic, Devanagari, Thai).
- Maybe via weak linking.
- Basic is Latin, combining marks, ZWJ, flags, variant selector, CJK (Han, Hiragana, Katakana, Hangul).