This section describes the terminology used often in character conversions, such as BMP and Charconv converters.
Textual data in electronic devices is stored in terms of a character set. A character set is a group of characters, each of which is encoded as a different number. The appearance of each character is not a property of the character set, but rather of the font. So a character may be rendered using many different glyphs, but will always have the same numeric value within its character set. Other properties which can also be included in a character set’s definition are the direction of writing, and the way in which sets of characters are combined.
Character sets, and the ways of encoding them, have proliferated with the increasing acceptance of computers and communicators throughout the world. This has led to an international standard character set, which encompasses all commonly used character sets, including Eastern ideograms, in a single character set, Unicode, defined by the Unicode Consortium (http://www.unicode.org).
UCS is the name for Unicode Character Set. Unicode characters are generally encoded using one 16-bit value but written to files in two bytes. This is referred to as UCS-2 encoding formats. There are also other Unicode encoding formats such as UTF-16 and UTF-8 for different purposes. For the full definition of these formats, see The Unicode Standard published by the Unicode Consortium.
Unicode points between U+0000 to U+FFFF are called Basic Multilingual Plane (BMP). BMP covers almost all characters in different languages. Code points outside the BMP must be encoded using a "surrogate pair", which consists of two 16-bit values. The Symbian platform currently does not support scripts with characters mapped to code points above U+FFFF. Code points above U+FFFF are also known as supplementary characters.
UTF-16 is one of the Unicode encoding formats. It supports characters within and outside BMP using a number of 16-bit characters.
In the text-processing subsystem, the Symbian platform uses UTF-16 Unicode format. This means that any input to the text-processing subsystem must be in UTF-16. Different character converters can be used to convert text from other encoding formats to UTF-16.
The UCS-2 format of the Unicode character set encodes each character as 2 bytes (16 bits total). However it does not specify which of the bytes is most significant. The byte order, or endian-ness, is left up to the discretion of a particular operating system.
While this is not important within a system, it does mean that text encoded as UCS-2 cannot easily be shared between systems using a different endian-ness. To overcome this problem the Unicode Consortium has defined two transformation formats for sharing Unicode text. The transformation formats explicitly specify byte order, and cannot be misinterpreted by computers using a different byte order.
The two transformation formats, UTF-7 and UTF-8, are described below. For the full definition of these formats, see The Unicode Standard published by the Unicode Consortium.
UTF-7
UTF-7 allows Unicode characters to be encoded and transmitted as 8-bit bytes, of which only 7 bits are used. UTF-7 divides the set of Unicode characters into three subsets, which are encoded and transmitted differently.
Set D, is the set of characters which are encoded as a single byte. It includes lower and upper case A to Z, the numeric digits, and nine other characters.
Set O includes the characters ! " # $ % & * ; < = > @ [ ] ^ _ { | }. These characters can be encoded as a single byte, or with the modified base 64 encoding used for set B characters. When encoded as a single byte, set O characters can be misinterpreted by some applications — encoding as modified base 64 overcomes this problem.
Set B comprises the remaining characters, which are encoded as an escape byte followed by 2 or 3 bytes. The encoding format is a modified form of base 64 encoding.
UTF-8
UTF-8 encodes and transmits Unicode characters as a string of 8-bit bytes. All the ASCII characters 0 to 127 are encoded without change; the most significant bit being set to zero is a signal that they have not been changed. Unicode characters U0080 to U07FF are encoded in two bytes, the remaining Unicode characters — except for the surrogates — are encoded in three bytes. The Unicode surrogate characters are supported by the Character Conversion API, but are not currently supported by all Symbian platform components.
A variant of UTF-8 used internally by Java differs from standard UTF-8 in two ways. First, the specific case of the NULL character (0x0000) is encoded in the two-byte format, and second, only the one-, two- and three-byte formats are used, not the four-byte format which is normally used for Unicode surrogate-pairs. An argument to ConvertFromUnicodeToUtf8 controls whether the UTF-8 generated by this is the Java variant. Support for this was removed in v6.0.