Character Set Conversion —
convert strings between different character sets using
iconv()
.
glib.lib
#include <glib.h>
gchar* g_convert (const gchar *str, gssize len, const gchar *to_codeset, const gchar *from_codeset, gsize *bytes_read, gsize *bytes_written, GError **error);
gchar* g_convert_with_fallback (const gchar *str, gssize len, const gchar *to_codeset, const gchar *from_codeset, gchar *fallback, gsize *bytes_read, gsize *bytes_written, GError **error);
GIConv;
gchar* g_convert_with_iconv (const gchar *str, gssize len, GIConv converter, gsize *bytes_read, gsize *bytes_written, GError **error);
#define G_CONVERT_ERROR
GIConv g_iconv_open (const gchar *to_codeset, const gchar *from_codeset);
size_t g_iconv (GIConv converter, gchar **inbuf, gsize *inbytes_left, gchar **outbuf, gsize *outbytes_left);
gint g_iconv_close (GIConv converter);
gchar* g_locale_to_utf8 (const gchar *opsysstring, gssize len, gsize *bytes_read, gsize *bytes_written, GError **error);
gchar* g_filename_to_utf8 (const gchar *opsysstring, gssize len, gsize *bytes_read, gsize *bytes_written, GError **error);
gchar* g_filename_from_utf8 (const gchar *utf8string, gssize len, gsize *bytes_read, gsize *bytes_written, GError **error);
gchar* g_filename_from_uri (const gchar *uri, gchar **hostname, GError **error);
gchar* g_filename_to_uri (const gchar *filename, const gchar *hostname, GError **error);
gboolean g_get_filename_charsets (G_CONST_RETURN gchar ***charsets);
gchar* g_filename_display_name (const gchar *filename);
gchar* g_filename_display_basename (const gchar *filename);
gchar** g_uri_list_extract_uris (const gchar *uri_list);
gchar* g_locale_from_utf8 (const gchar *utf8string, gssize len, gsize *bytes_read, gsize *bytes_written, GError **error);
enum GConvertError;
gboolean g_get_charset (G_CONST_RETURN char **charset);
Historically, Unix has not had a defined encoding for file names: a file
name is valid as long as it does not have path separators in it ("/"). However,
displaying file names may require conversion: from the character set in which
they were created, to the character set in which the application operates.
Consider the Spanish file name "Presentación.sxi
". If the application which created it
uses ISO-8859-1 for its encoding, then the actual file name on disk would look
like this:
Character: P r e s e n t a c i ó n . s x i
Hex code: 50 72 65 73 65 6e 74 61 63 69 f3 6e 2e 73 78 69
However, if the application uses UTF-8, the actual file name on disk would look like this:
Character: P r e s e n t a c i ó n . s x i
Hex code: 50 72 65 73 65 6e 74 61 63 69 c3 b3 6e 2e 73 78 69
Glib uses UTF-8 for its strings, and GUI toolkits like GTK+ that use Glib do
the same thing. If a file name is obtained from the file system, for example, from
readdir(3)
or from
, then the file name must be converted into
UTF -8 before displaying it to the user. The
opposite case is when the user types the name of a file and wishes to save: the
toolkit will give a string in UTF-8 encoding, that needs to be converted to the character set used for file names before
the
file can be created either with
g_dir_read_name()
open(2),
or fopen(3)
.
By default, Glib assumes that file names on disk are in UTF-8 encoding. This
is a valid assumption for file systems which were created relatively recently,
most applications use UTF-8 encoding for their strings and that is also what
they use for the file names they create. However, older file systems may still
contain file names created in "older" encodings, such as ISO-8859-1. In this
case, for compatibility reasons, instruct Glib to use that
particular encoding for file names rather than UTF-8. This can be done by
specifying the encoding for file names in the
G_FILENAME_ENCODING
environment variable. For example, if the installation uses ISO-8859-1 for file
names, put this in the ~/.profile
:
export G_FILENAME_ENCODING=ISO-8859-1
Glib provides the functions
and g_filename_to_utf8()
to perform the
necessary conversions. These functions convert file names from the encoding
specified in g_filename_from_utf8()
G_FILENAME_ENCODING
to UTF-8 and vice-versa. Figure 1,
“Conversion between File Name Encodings” illustrates how these functions are
used to convert between UTF-8 and the encoding for file names in the file
system.
This section is a practical summary of the detailed description above. Use this as a checklist of things to do, to make sure the application's process file name encodings correctly.
If a file name is obtained from the file system, from a function such as
readdir(3)
or
, there is no
need to performany conversion to pass that file name to functions like gtk_file_chooser_get_filename()
open(2)
, rename(2)
, or fopen(3)
— those are "raw" file names which the file system understands.
If there is a need to display a file name, convert it to UTF-8 first by using
. If the conversion fails,
display a string like "g_filename_to_utf8()
Unknown file name
".
Do not convert this string back into
the encoding used for file names if is has to be passed to the file system;
use the original file name instead. For example, the document window of a
word processor could display "Unknown file name" in its title bar but still
let the user save the file, as it would keep the raw file name internally.
This can happen if the user has not set the
G_FILENAME_ENCODING
environment variable even though the user has files
whose names are not encoded in UTF-8.
If the user interface lets the user type a file name for saving or
renaming, convert it to the encoding used for file names in the file system
by using
. Pass the converted
file name to functions like g_filename_from_utf8()
fopen(3)
. If
conversion fails, ask the user to enter a different file name. This can
happen if the user types Japanese characters when G_FILENAME_ENCODING
is set to ISO-8859-1
, for example.
gchar* g_convert (const gchar *str, gssize len, const gchar *to_codeset, const gchar *from_codeset, gsize *bytes_read, gsize *bytes_written, GError **error);
Converts a string from one character set to another.
Use g_iconv()
for streaming conversions[2].
str : |
the string to convert |
len : |
the length of the string, or -1 if the string is nul-terminated[1]. |
to_codeset : |
name of character set into which to convert str
|
from_codeset : |
character set of str .
|
bytes_read : |
location to store the number of bytes in the input string that were
successfully converted, or NULL . Even if
the conversion was successful, this may be less than len if there were partial characters at the
end of the input. If the error
G_CONVERT_ERROR_ILLEGAL_SEQUENCE
occurs, the value stored will the byte offset after the last valid input
sequence.
|
bytes_written : |
the number of bytes stored in the output buffer (not including the terminating nul). |
error : |
location to store the error occurring, or NULL to ignore errors. Any of the errors in GConvertError may occur.
|
Returns : | If the conversion was successful, a newly allocated nul-terminated
string, which must be freed with g_free() .
Otherwise NULL and
error
will be set.
|
gchar* g_convert_with_fallback (const gchar *str, gssize len, const gchar *to_codeset, const gchar *from_codeset, gchar *fallback, gsize *bytes_read, gsize *bytes_written, GError **error);
Converts a string from one character set to another, possibly including fallback
sequences for characters not representable in the output. Note that it is not
guaranteed that the specification for the fallback sequences in
fallback
will be honored. Some systems may do a approximate conversion from
from_codeset
to to_codeset
in their iconv()
functions, in which case GLib will simply
return that approximate conversion.
Use g_iconv()
for streaming conversions[2].
str : |
the string to convert |
len : |
the length of the string, or -1 if the string is nul-terminated[1]. |
to_codeset : |
name of character set into which to convert str
|
from_codeset : |
character set of str .
|
fallback : |
UTF-8 string to use in place of character not present in the target
encoding. (The string must be representable in the target encoding). If
NULL , characters not in the target encoding will be represented
as Unicode escapes \uxxxx or \Uxxxxyyyy.
|
bytes_read : |
location to store the number of bytes in the input string that were
successfully converted, or NULL . Even if
the conversion was successful, this may be less than len if there were partial characters at the
end of the input.
|
bytes_written : |
the number of bytes stored in the output buffer (not including the terminating nul). |
error : |
location to store the error occurring, or NULL to ignore errors. Any of the errors in GConvertError may occur.
|
Returns : | If the conversion was successful, a newly allocated nul-terminated
string, which must be freed with g_free() .
Otherwise NULL and
error
will be set.
|
typedef struct _GIConv GIConv;
The GIConv struct wraps an
conversion descriptor. It contains private data and should only be accessed
using the following functions.
iconv()
gchar* g_convert_with_iconv (const gchar *str, gssize len, GIConv converter, gsize *bytes_read, gsize *bytes_written, GError **error);
Converts a string from one character set to another.
Use g_iconv()
for streaming conversions[2].
str : |
the string to convert |
len : |
the length of the string, or -1 if the string is nul-terminated[1]. |
converter : |
conversion descriptor from g_iconv_open()
|
bytes_read : |
location to store the number of bytes in the input string that were
successfully converted, or NULL . Even if
the conversion was successful, this may be less than len if there were partial characters at the
end of the input. If the error
G_CONVERT_ERROR_ILLEGAL_SEQUENCE
occurs, the value stored will the byte offset after the last valid input
sequence.
|
bytes_written : |
the number of bytes stored in the output buffer (not including the terminating nul). |
error : |
location to store the error occurring, or NULL to ignore errors. Any of the errors in GConvertError may occur.
|
Returns : | If the conversion was successful, a newly allocated nul-terminated
string, which must be freed with g_free() .
Otherwise NULL and
error
will be set.
|
#define G_CONVERT_ERROR g_convert_error_quark()
Error domain for character set conversions. Errors in this domain will be from the GConvertError enumeration. See GError for information on error domains.
GIConv g_iconv_open (const gchar *to_codeset, const gchar *from_codeset);
Same as the standard UNIX routine iconv_open()
,
but may be implemented via libiconv on UNIX flavors that lack a native
implementation.
GLib provides g_convert()
and g_locale_to_utf8()
which are likely more convenient
than the raw iconv wrappers.
to_codeset : |
destination codeset |
from_codeset : |
source codeset |
Returns : | a "conversion descriptor", or (GIConv)-1 if opening the converter failed. |
size_t g_iconv (GIConv converter, gchar **inbuf, gsize *inbytes_left, gchar **outbuf, gsize *outbytes_left);
Same as the standard UNIX routine iconv()
, but may
be implemented via libiconv on UNIX flavors that lack a native implementation.
GLib provides g_convert()
and g_locale_to_utf8()
which are likely more convenient
than the raw iconv wrappers.
converter : |
conversion descriptor from g_iconv_open()
|
inbuf : |
bytes to convert |
inbytes_left : |
inout parameter, bytes remaining to convert in inbuf
|
outbuf : |
converted output bytes |
outbytes_left : |
inout parameter, bytes available to fill in outbuf
|
Returns : | count of non-reversible conversions, or -1 on error |
gint g_iconv_close (GIConv converter);
Same as the standard UNIX routine iconv_close()
,
but may be implemented via libiconv on UNIX flavors that lack a native
implementation. Should be called to clean up the conversion descriptor from g_iconv_open()
when conversion is completed.
GLib provides g_convert()
and g_locale_to_utf8()
which are likely more convenient
than the raw iconv wrappers.
converter : |
a conversion descriptor from g_iconv_open()
|
Returns : | -1 on error, 0 on success |
gchar* g_locale_to_utf8 (const gchar *opsysstring, gssize len, gsize *bytes_read, gsize *bytes_written, GError **error);
Converts a string which is in the encoding used for strings by the C runtime (usually the same as that used by the operating system) in the current locale into a UTF-8 string.
opsysstring : |
a string in the encoding of the current locale. On Windows this means the system codepage. |
len : |
the length of the string, or -1 if the string is nul-terminated[1]. |
bytes_read : |
location to store the number of bytes in the input string that were
successfully converted, or NULL . Even if
the conversion was successful, this may be less than len if there were partial characters at the
end of the input. If the error
G_CONVERT_ERROR_ILLEGAL_SEQUENCE
occurs, the value stored will the byte offset after the last valid input
sequence.
|
bytes_written : |
the number of bytes stored in the output buffer (not including the terminating nul). |
error : |
location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError may occur.
|
Returns : | The converted string, or NULL
on an error.
|
gchar* g_filename_to_utf8 (const gchar *opsysstring, gssize len, gsize *bytes_read, gsize *bytes_written, GError **error);
Converts a string which is in the encoding used by GLib for filenames into a UTF-8 string. Note that on Windows GLib uses UTF-8 for filenames.
opsysstring : |
a string in the encoding for filenames |
len : |
the length of the string, or -1 if the string is nul-terminated[1]. |
bytes_read : |
location to store the number of bytes in the input string that were
successfully converted, or NULL . Even if
the conversion was successful, this may be less than len if there were partial characters at the
end of the input. If the error
G_CONVERT_ERROR_ILLEGAL_SEQUENCE
occurs, the value stored will the byte offset after the last valid input
sequence.
|
bytes_written : |
the number of bytes stored in the output buffer (not including the terminating nul). |
error : |
location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError may occur.
|
Returns : | The converted string, or NULL
on an error.
|
gchar* g_filename_from_utf8 (const gchar *utf8string, gssize len, gsize *bytes_read, gsize *bytes_written, GError **error);
Converts a string from UTF-8 to the encoding GLib uses for filenames. Note that on Windows GLib uses UTF-8 for filenames.
utf8string : |
a UTF-8 encoded string. |
len : |
the length of the string, or -1 if the string is nul-terminated. |
bytes_read : |
location to store the number of bytes in the input string that were
successfully converted, or NULL . Even if
the conversion was successful, this may be less than len if there were partial characters at the
end of the input. If the error
G_CONVERT_ERROR_ILLEGAL_SEQUENCE
occurs, the value stored will the byte offset after the last valid input
sequence.
|
bytes_written : |
the number of bytes stored in the output buffer (not including the terminating nul). |
error : |
location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError may occur.
|
Returns : | The converted string, or NULL
on an error.
|
gchar* g_filename_from_uri (const gchar *uri, gchar **hostname, GError **error);
Converts an escaped ASCII-encoded URI to a local filename in the encoding used for filenames.
uri : |
a uri describing a filename (escaped, encoded in ASCII). |
hostname : |
Location to store hostname for the URI, or NULL . If there is no hostname in the URI, NULL will be stored in this location.
|
error : |
location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError may occur.
|
Returns : | a newly-allocated string holding the resulting filename, or NULL on an error.
|
gchar* g_filename_to_uri (const gchar *filename, const gchar *hostname, GError **error);
Converts an absolute filename to an escaped ASCII-encoded URI.
filename : |
an absolute filename specified in the GLib file name encoding, which is the on-disk file name bytes on Unix, and UTF-8 on Windows |
hostname : |
A UTF-8 encoded hostname, or NULL
for none.
|
error : |
location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError may occur.
|
Returns : | a newly-allocated string holding the resulting URI, or NULL on an error.
|
gboolean g_get_filename_charsets (G_CONST_RETURN gchar ***charsets);
Determines the preferred character sets used for filenames. The first character
set from the charsets
is the filename encoding, the subsequent character sets are used when trying to
generate a displayable representation of a filename, see g_filename_display_name()
.
On Unix, the character sets are determined by consulting the environment
variables G_FILENAME_ENCODING
and
G_BROKEN_FILENAMES
. On Windows, the character set
used in the GLib API is always UTF-8 and said environment variables have no
effect.
G_FILENAME_ENCODING
may be set to a comma-separated
list of character set names. The special token "locale
" is taken to mean the character set
for the current locale. If G_FILENAME_ENCODING
is not set, but G_BROKEN_FILENAMES
is, the character
set of the current locale is taken as the filename encoding. If neither
environment variable is set, UTF-8 is taken as the filename encoding, but the
character set of the current locale is also put in the list of encodings.
The returned charsets
belong to GLib and
must not be freed.
Note that on Unix, regardless of the locale character set or
G_FILENAME_ENCODING
value, the actual file names
present on a system might be in any random encoding or just gibberish.
charsets : |
return location for the NULL -terminated
list of encoding names
|
Returns : | TRUE if the filename encoding is UTF-8.
|
gchar* g_filename_display_name (const gchar *filename);
Converts a filename into a valid UTF-8 string. The conversion is not necessarily
reversible, so keep the original around and use the return value of
this function only for display purposes. Unlike g_filename_to_utf8()
, the result is guaranteed to
be non-NULL even if the filename actually isn't in the GLib file name encoding.
If the whole pathname of the file is known, use
g_filename_display_basename()
, since it allows
for location-based translation of filenames.
filename : |
a pathname hopefully in the GLib file name encoding |
Returns : | a newly allocated string containing a rendition of the filename in valid UTF-8 |
gchar* g_filename_display_basename (const gchar *filename);
Returns the display basename for the particular filename, guaranteed to be valid UTF-8. The display name might not be identical to the filename, for instance there might be problems converting it to UTF-8, and some files can be translated in the display
Pass the whole absolute pathname to this functions so that translation of well known locations can be done.
This function is preferred over g_filename_display_name()
if the whole
path is known, as it allows translation.
filename : |
an absolute pathname in the GLib file name encoding |
Returns : | a newly allocated string containing a rendition of the basename of the filename in valid UTF-8 |
gchar** g_uri_list_extract_uris (const gchar *uri_list);
Splits an URI list conforming to the text/uri-list mime type defined in RFC 2483 into individual URIs, discarding any comments. The URIs are not validated.
uri_list : |
an URI list |
Returns : | a newly allocated NULL -terminated list
of strings holding the individual URIs. The array should be freed with g_strfreev() .
|
gchar* g_locale_from_utf8 (const gchar *utf8string, gssize len, gsize *bytes_read, gsize *bytes_written, GError **error);
Converts a string from UTF-8 to the encoding used for strings by the C runtime (usually the same as that used by the operating system) in the current locale.
utf8string : |
a UTF-8 encoded string |
len : |
the length of the string, or -1 if the string is nul-terminated[1]. |
bytes_read : |
location to store the number of bytes in the input string that were
successfully converted, or NULL . Even if
the conversion was successful, this may be less than len if there were partial characters at the
end of the input. If the error
G_CONVERT_ERROR_ILLEGAL_SEQUENCE
occurs, the value stored will the byte offset after the last valid input
sequence.
|
bytes_written : |
the number of bytes stored in the output buffer (not including the terminating nul). |
error : |
location to store the error occuring, or NULL to ignore errors. Any of the errors in GConvertError may occur.
|
Returns : | The converted string, or NULL
on an error.
|
typedef enum
{
G_CONVERT_ERROR_NO_CONVERSION,
G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
G_CONVERT_ERROR_FAILED,
G_CONVERT_ERROR_PARTIAL_INPUT,
G_CONVERT_ERROR_BAD_URI,
G_CONVERT_ERROR_NOT_ABSOLUTE_PATH
} GConvertError;
Error codes returned by character set conversion routines.
G_CONVERT_ERROR_NO_CONVERSION |
Conversion between the requested character sets is not supported. |
G_CONVERT_ERROR_ILLEGAL_SEQUENCE |
Invalid byte sequence in conversion input. |
G_CONVERT_ERROR_FAILED |
Conversion failed for some reason. |
G_CONVERT_ERROR_PARTIAL_INPUT |
Partial character sequence at end of input. |
G_CONVERT_ERROR_BAD_URI |
URI is invalid. |
G_CONVERT_ERROR_NOT_ABSOLUTE_PATH |
Pathname is not an absolute path. |
gboolean g_get_charset (G_CONST_RETURN char **charset);
Obtains the character set for the current locale; this character
set might be used as an argument to g_convert()
, to convert from the current locale's
encoding to some other encoding. (Frequently
g_locale_to_utf8()
and g_locale_from_utf8()
are nice shortcuts, though.)
The return value is TRUE
if the locale's encoding
is UTF-8, in that case avoid calling g_convert()
.
The string returned in charset
is not allocated, and should not be freed.
charset : |
return location for character set name |
Returns : | TRUE if the returned charset is UTF-8
|
For additional information or queries on this page send feedback
[1] Note that
some encodings may allow null bytes to occur inside strings. In that case,
using -1 for the len
parameter is unsafe.
[2]
Despite the fact that
byes_read
can return information about partial characters, the g_convert_...
functions are not generally suitable for
streaming. If the underlying converter being used maintains internal state,
then this will not be preserved across successive calls to
g_convert()
,
g_convert_with_iconv()
or
g_convert_with_fallback()
. (An example of this is
the GNU C converter for CP1255 which does not emit a base character until it
knows that the next character is not a mark that could combine with the base
character.)
© 2005-2007 Nokia |