Contents IndexChapter 22.  Database Collations Support for multibyte character sets

User's Guide
   Part III. Using SQL Anywhere
     Chapter 22. Database Collations
      Collation overview

When you create a SQL Anywhere database, you specify a collating sequence or collation to be used by the database. A collation is a sorting order for characters in the database. Whenever the database compares strings, sorts strings, or carries out other string operations such as case conversion, it does so using the collation sequence. The database carries out sorting and string comparison when statements such as the following are submitted:

The database also uses character sets in identifiers (column names and so on). In deciding whether a string is a valid or unique identifier, the database is using the database collation.

Top of page


Character sets in applications and databases

For character strings to be both sorted properly by the database and displayed properly in an application, it is important that the application and the database are using the same, or at least compatible, character sets. This section describes how the responsibilities for handling characters are divided between the database and the application.

Important components

Different aspects of character storage and display are separated out by operating systems. The following aspects are treated distinctly:

The database engine receives strings as a stream of hexadecimal numbers, which it associates with characters and sorts according to the collation specified when the database was created.

Character strings

It is up to the operating system of the computer on which the client application is running to handle the following aspects of character strings:

Notes

Top of page


Character encodings

There are several different systems for encoding character sets as hexadecimal numbers. This section lists some of the more common.

Single-byte character sets and code pages

Many languages have few enough characters to be represented in a single-byte character set. In such a character set, each character is represented by a single two-digit hexadecimal number.

At most 256 characters can be represented in a single-byte set. No single single-byte character set can hold all the characters used, including accented characters, internationally. IBM developed a set of code pages in which each code page describes a set of characters appropriate for one national language. For example, code page 869 contains the Greek character set, code page 850 contains an international character set suitable for representing many characters in a variety of languages.

SQL Anywhere supports a set of single-byte collations (code pages and collation orderings) suitable for many languages of European origin.

For information about choosing a single-byte collation for your database, see "Choosing a character set".

Multibyte character sets

Some languages have many more than 256 characters, and these can be represented in multibyte character sets. In addition, there are character sets that use the much larger number of characters available in a multibyte representation to represent characters from many languages in a single, more universal, character set.

Multibyte character sets are of two types. Some are variable width, in which some characters are single-width characters, others are double-byte, and so on. Other sets are fixed width, in which all characters in the set have the same number of bytes. SQL Anywhere supports variable-width character sets.

For information on the multibyte character sets supported by SQL Anywhere, see "Support for multibyte character sets".

Top of page


Displaying your current character settings

Each operating system has its own system for handling character sets, encodings, and collation sequences. To find out information about the current settings on your operating system, you can:

Character sets are stored by the operating system and sent to a database as a set of hexadecimal numbers.

Top of page


Collation sequences

Roughly speaking, a collation sequence is a sorting order for characters in a character set encoding or code page. The collation sequence is based on the encoded value of the characters.

The collation sequence includes the notion of alphabetic ordering of letters, and extends it to include all characters in the character set, including digits and space characters.

The collation sequence includes more information than a simple ordering. Each character is assigned to a sort position, and the sort position defines the position of that character in any comparison or sorting of character strings.

Associating more than one character with each sort position

More than one character can be associated with each sort position. This is useful if you wish, for example, to treat an accented character the same as the character without an accent. Two characters with the same sort position are considered to be identical in all ways by the database. Therefore, if a collation assigned the characters a and e to the same sort order, then a query with the following search condition:

     WHERE col1 = 'want'.

is satisfied by a row for which col1 contains the entry "went".

At each sort position, lower- and uppercase forms of a character can be indicated. For case-sensitive databases, the lower- and uppercase characters are not treated as equivalent. For case-insensitive databases, the lower- and uppercase versions of the character are considered equivalent.

Top of page


Contents IndexChapter 22.  Database Collations Support for multibyte character sets