TEXT, NATIVE LANGUAGE SUPPORT, AND TIME MEDIA - Representing Unicode characters

Representing Unicode characters

The CommonPoint application system provides a basic data type, UniChar, that represents a single Unicode character. Always store and manipulate individual Unicode characters using the UniChar data type.

TIP Some interfaces may currently accept char* arguments rather than an appropriate Unicode implementation--for example, TText and its subclasses. These interfaces will eventually be removed. Do not rely on them; use TText objects instead.

The TUnicode class encapsulates a character along with its associated semantic information. TUnicode member functions give you access to this semantic information. Always use these functions to access information about the character properties for a specific character.

Unicode character naming

The CommonPoint application system provides a name, through a set of enumerations, for every character in the Unicode set, with the exception of most of the Han ideographic characters.

NOTE The official name for the Han ideograph at a given code point U+XXXX is CJK UNIFIED IDEOGRAPH XXXX, so enumerated names provide no advantage. However, TUnicode does provide names for some particularly significant ideographs, such as digits and the 214 KangXi radicals.

Refer to a specific Unicode value using its character name rather than using the code point. For example, refer to TGeneralPunctuation::kQuestionMark rather than the value U+003F.

Because of the large number of characters, the names are scoped into a set of classes based on script or function: TLatin, TGreek, TASCII, TDingbats, TGeneralPunctuation, TMathematicalOperators, and so on. These classes are provided only for referencing the enumerated names they contain; do not use them for any other reason. For a complete list of classes used to enumerate character names, see the online class and member function documentation or the following header files:

File name Names included

UnicodeGeneral.h Characters for the Roman script and general utility characters such as punctuation and control codes

UnicodeEastAsia.h Characters for East Asian scripts such as Hangul and Kana

UnicodeEastEurope.h Characters for Eastern European scripts such as Cyrillic

UnicodeMidEast.h Characters for Middle Eastern scripts such as Arabic and Hebrew

UnicodeSouthAsia.h Characters for South and Southeast Asian scripts such as Bengali and Thai

UnicodeSymbols.h Symbol characters such as dingbats or mathematical operators

UnicodeCompatibility.h Additional characters provided for compatibility with existing character sets, such as Roman numerals (the CommonPoint application system provides these codes for compatibility only; it is recommended that you do not use them)

Querying Unicode character properties

TUnicode provides static member functions that return an enumerated value describing a character's script or type--GetScript and GetType. TUnicode also provides static member functions that check a UniChar for a certain property--for example, querying whether a character is an uppercase character, or a digit, or one of the space characters. These functions let you easily check a character for a specific property without needing to know all of the possibilities. For example, you can test for a space character with the TUnicode::IsASpace function without needing to know the full set of Unicode characters used to represent a space.

The following figure shows some of the TUnicode static member functions. See TUnicode in the online class and member function documentation for a complete list of character property functions and descriptions.

Use the TCharacterPropertyIterator class to scan the set of Unicode characters for characters with a specific set of properties. For example, you might use this class to return a list of punctuation characters for a particular script.

[Contents] [Previous] [Next]

Click the icon to mail questions or corrections about this material to Taligent personnel.

Generated with WebMaker

	File name	Names included
	`UnicodeGeneral.h`	Characters for the Roman script and general utility characters such as punctuation and control codes
	`UnicodeEastAsia.h`	Characters for East Asian scripts such as Hangul and Kana
	`UnicodeEastEurope.h`	Characters for Eastern European scripts such as Cyrillic
	`UnicodeMidEast.h`	Characters for Middle Eastern scripts such as Arabic and Hebrew
	`UnicodeSouthAsia.h`	Characters for South and Southeast Asian scripts such as Bengali and Thai
	`UnicodeSymbols.h`	Symbol characters such as dingbats or mathematical operators
	`UnicodeCompatibility.h`	Additional characters provided for compatibility with existing character sets, such as Roman numerals (the CommonPoint application system provides these codes for compatibility only; it is recommended that you do not use them)