Simple transcoding

To transcode text data using one of the CommonPoint transcoders, you:

  1. Instantiate the TTranscoder subclass for the external character set.
    Note that for some character encoding sets--for example, the set used on the Macintosh--transcoders must be created for the specific script being used.
  2. Call the correct member function--either AppendToText, ExtractFromText, or CreateStringFromText--to perform the conversion.
    These functions default to use the kNoRoundTrip transcoding scope, which means that any characters without a one-to-one mapping in the target set are converted to a standard substitution character. Override this to use a different scope. See "Handling exception characters" on page 30 for more information.
When you convert text from Unicode, the transcoder converts text data from a TText instance to a string of characters (char*). You manage the storage buffer for the converted string.

When you convert text to Unicode, the transcoder converts text data from a character string (char*) and appends it to a TText instance. Again, you manage the storage for the target text object. The transcoder appends the converted text at the end of the text object. If the text object already contains text, the transcoder inserts the converted text after the existing text.

The abstract class TTranscoder provides the basic protocol for transcoding characters both to and from Unicode. Call the following TTranscoder member functions to perform conversions:

The following figure shows how these transcoding functions process
character data:


Converting character data to Unicode

This example illustrates how to convert a string of ASCII characters to Unicode using TASCIITranscoder. The converted data is appended to the end of the text instance unicodeText.

      unsigned char myASCIIString[] = "ASCII character string";
      unsigned long stringLength = strlen(myASCIIString);
      TStandardText unicodeText;
      
      TASCIITranscoder transcoder;
      TTextCount numCharsConverted = transcoder.AppendToText(myASCIIString,
                   stringLength, unicodeText);

Converting character data from Unicode

When you convert text data from Unicode to another character set, you must manage the storage for the converted text.

The TTranscoder::CreateStringFromText member function provides the simplest way to convert text from Unicode. This function returns a null-terminated string that you should delete when you are finished.

This example shows how to use TASCIITranscoder to convert the Unicode text in the text instance unicodeText to ASCII character data, which is placed in the character string myASCIIString.

      unsigned char* myASCIIString = NIL;
      
      TASCIITranscoder transcoder;
      TTextCount stringLength = transcoder.CreateStringFromText(myASCIIString,
                                   unicodeText);
      
      // Process myASCIIString...
      
      delete myASCIIString;
You can also directly manage a storage buffer for the output text, and use the TTranscoder::ExtractFromText function to perform the conversion. The output string created by this function is not null-terminated; therefore, the length of the string is also returned.

This example calculates the required buffer size and uses the ExtractFromText function to convert the text in unicodeText to ASCII character data. The ASCII character data is returned in asciiCharBuffer. The length of the output string is returned in charBufferSize.

      TASCIITranscoder transcoder();
      
      // Calculate required buffer size and create the buffer.
      TTextCount charBufferSize = unicodeText.GetLength() *
                   transcoder.GetMaximumBytesPerCharacter();
      unsigned char* asciiCharBuffer = new (unsigned char)[charBufferSize];
      
      // Perform the text conversion.
      transcoder.ExtractFromText(unicodeText, TTextRange(0,unicodeText.GetLength()),
                      asciiCharBuffer, charBufferSize);

Converting between ASCII and Unicode without transcoders

Because the first 128 characters in the Unicode system correspond to the ASCII character set, you can convert text data between Unicode and ASCII without using a transcoder.

The following are samples of functions you can use to convert text between ASCII and Unicode. This code, like the CommonPoint ASCII transcoder, converts the ASCII new line and carriage return characters to the Unicode characters TGeneralPunctuation::kLineSeparator and TGeneralPunctuation::kParagraphSeparator.

      const char kASCIINewLine = 0x0A;
      const char kASCIICarriageReturn = 0x0D;
      const char kASCIISubstitute = 0x1A;
      const char kEndOfASCII = 0x7F;
      
      TTextCount
      ASCIIToUnicode(char* ascii, UniChar* unicode)
      {
          char aChar;
          UniChar* unicodePtr = unicode;
          while (aChar = *ascii++)
          {
              switch (aChar) {
                  case kASCIINewLine:
                      *unicodePtr++ = TGeneralPunctuation::kLineSeparator;
                      break;
                  casekASCIICarriageReturn:
                      *unicodePtr++ = TGeneralPunctuation::kParagraphSeparator;
                      break;
                  default:
                  {
                      if (aChar <= kEndOfASCII)
                          *unicodePtr++ = (UniChar) aChar;
                      else
                          *unicodePtr++ = TUnicodeSpecial::kReplacementCharacter;
                      break;
                  }
              }
          }
          return unicodePtr-unicode;
      }
      
      TTextCount
      UnicodeToASCII(UniChar* unicode, char* ascii)
      {
          UniChar aChar;
          char* asciiPtr = ascii;
          while (aChar = *unicode++)
          {
              switch (aChar)
              {
                  case TGeneralPunctuation::kLineSeparator:
                      *asciiPtr++ = kASCIINewLine;
                      break;
                  case TGeneralPunctuation::kParagraphSeparator:
                      *asciiPtr++ = kASCIICarriageReturn;
                      break;
                  default:
                  {
                      if (aChar <= kEndOfASCII)
                          *asciiPtr++ = (char) aChar;
                      else
                          *asciiPtr++ = kASCIISubstitute;
                      break;
                  }
              }
          }
      }

Handling separator characters

As shown in the example in "Converting between ASCII and Unicode without transcoders" above, Unicode provides two distinct characters used as line and paragraph separators. When transcoding, you need to make sure that these characters are transcoded meaningfully to and from the foreign encoding.

This table compares the use of the Unicode characters TGeneralPunctuation::kLineSeparator and TGeneralPuncutation::kParagraphSeparator with the use of the line feed (LF) and carriage return (CR) characters in the character encoding systems used by some other systems.

CommonPoint
application system
UNIX Macintosh System 7 DOS
TGeneralPunctuation:: kLineSeparator LF LF LF
TGeneralPunctuation:: kParagraphSeparator LF CR CR LF
(character sequence)


[Contents] [Previous] [Next]
Click the icon to mail questions or corrections about this material to Taligent personnel.
Copyright©1995 Taligent,Inc. All rights reserved.

Generated with WebMaker