Monday, February 12, 2007

Choosing a page encoding

Choosing a page encoding Choose UTF-8 or another Unicode encoding for all content.

When selecting a page encoding, consider both current and future localization requirements, and the benefits of using the same encoding across all pages and all languages. These considerations make the use of Unicode an attractive choice for the following reasons:

Unicode supports many languages, enabling the use of a single encoding across all pages and forms, regardless of language.

Unicode allows many more languages to be mixed on a single page than almost any other choice. If the set of languages to be represented on a single page cannot be represented directly by any single native encoding (such as ISO-8859-1, Shift-JIS, etc.), then Unicode is almost certainly the best choice.

For dynamically-generated pages, a single encoding for all pages eliminates the need for server-side logic to determine the character encoding for each page served.

For interactive applications using forms, a single encoding eliminates the need for server-side logic to determine the character encoding of incoming form data.

Unicode enables a form in one language (e.g. English) to accept input in a different language (e.g. Chinese).

Unicode (UTF-8) forms will be easier to migrate to XForms.

UTF-8 and UTF-16 are both Unicode encodings. Since support for Unicode is currently limited to UTF-8 in many user agents, UTF-8 is usually the appropriate Unicode encoding. However, as user agent support for UTF-16 expands, UTF-16 will become an increasingly viable alternative.

Although there are other multi-script encodings (such as ISO-2022 and GB18030), Unicode generally provides the best combination of user agent and script support.

If you don't use a Unicode encoding, select an encoding that best supports the languages / characters to be included in the page text.

There are some situations where selecting a Unicode encoding is not practical. If content is encoded in a native encoding (legacy content or content originating from an external source) and the system lacks functionality for converting content between encodings, Unicode may greatly complicate implementation. If such a site is only required to serve single-script pages (containing languages that can be represented by a single native encoding), then the cost of using a Unicode encoding may outweigh the benefits. In this case, a native encoding (such as ISO-8859-1, Shift-JIS, etc.) may be a better choice.

Be sure to select an encoding that covers most of the characters required for the content, and (if it is a form) all of the characters that must be accepted as input.

Check that user agents (all agents that must render the page) adequately support the page encoding that you have selected. If not, you might need to use a more widely supported encoding to achieve an adequate degree of user agent support.

Not all user agents support all page encodings, so it is important to understand which user agents must be able to render the page, and be sure that they have adequate support for the page encoding you have selected.

In general, user agents are most likely to support the commonly-used native character encodings for the major languages used on the web. Support for less commonly used encodings depends on the user agent. Older user agents, or user agents that operate under severe memory limitations, may not support UTF-8.

It is important to note that support for a given encoding does not necessarily imply support for all writing systems that encoding supports. For example, a user agent might support UTF-8, but not correctly display bidirectional Arabic text encoded in UTF-8. To display a page correctly, a user agents must support both the page encoding and the writing system.

No comments: