Collation overview

Chapter 22. Database Collations
Collation overview

When you create a SQL Anywhere database, you specify a collating sequence or collation to be used by the database. A collation is a sorting order for characters in the database. Whenever the database compares strings, sorts strings, or carries out other string operations such as case conversion, it does so using the collation sequence. The database carries out sorting and string comparison when statements such as the following are submitted:

Queries with an ORDER BY clause.
Expressions that use string functions, such as LOCATE, SIMILAR, SOUNDEX.
Conditions using the LIKE keyword.

The database also uses character sets in identifiers (column names and so on). In deciding whether a string is a valid or unique identifier, the database is using the database collation.

Character sets in applications and databases

For character strings to be both sorted properly by the database and displayed properly in an application, it is important that the application and the database are using the same, or at least compatible, character sets. This section describes how the responsibilities for handling characters are divided between the database and the application.

Important components

Different aspects of character storage and display are separated out by operating systems. The following aspects are treated distinctly:

Each operating system has a character set available to it. A character set is a set of symbols, including letters, digits, spaces and other symbols.
Each operating system employs a character encoding, in which each character is mapped onto one or more bytes of information, typically represented as hexadecimal numbers.
Characters are displayed on a screen using a font, which is a mapping between characters in the character set and their appearance.
Operating systems also use a keyboard mapping to map keys or key combinations on the keyboard to characters in the character set.
A collation is a combination of a character encoding (a map between characters and hexadecimal numbers) and a sorting order for the characters.

The database engine receives strings as a stream of hexadecimal numbers, which it associates with characters and sorts according to the collation specified when the database was created.

Character strings

It is up to the operating system of the computer on which the client application is running to handle the following aspects of character strings:

Which character is stored when a particular key on the keyboard is typed.
What a character looks like on your computer screen.
What characters are available to the application.
What hexadecimal encoding is stored for each character.

Notes

It is important that the operating system and the database be using compatible character sets, character set encodings, and (if the application itself does any string sorting or comparison) collation sequences if information is to be handled and displayed in a consistent manner by the database engine and the client application.
The ODBC interface provides mechanisms for translating strings on their way to the database engine, and on the way back, using a translation driver. For more information, see "Files needed for ODBC connections".

Character encodings

There are several different systems for encoding character sets as hexadecimal numbers. This section lists some of the more common.

Single-byte character sets and code pages

Many languages have few enough characters to be represented in a single-byte character set. In such a character set, each character is represented by a single two-digit hexadecimal number.

At most 256 characters can be represented in a single-byte set. No single single-byte character set can hold all the characters used, including accented characters, internationally. IBM developed a set of code pages in which each code page describes a set of characters appropriate for one national language. For example, code page 869 contains the Greek character set, code page 850 contains an international character set suitable for representing many characters in a variety of languages.

SQL Anywhere supports a set of single-byte collations (code pages and collation orderings) suitable for many languages of European origin.

For information about choosing a single-byte collation for your database, see "Choosing a character set".

Multibyte character sets

Some languages have many more than 256 characters, and these can be represented in multibyte character sets. In addition, there are character sets that use the much larger number of characters available in a multibyte representation to represent characters from many languages in a single, more universal, character set.

Multibyte character sets are of two types. Some are variable width, in which some characters are single-width characters, others are double-byte, and so on. Other sets are fixed width, in which all characters in the set have the same number of bytes. SQL Anywhere supports variable-width character sets.

For information on the multibyte character sets supported by SQL Anywhere, see "Support for multibyte character sets".

Displaying your current character settings

Each operating system has its own system for handling character sets, encodings, and collation sequences. To find out information about the current settings on your operating system, you can:

In DOS or OS/2, type chcp at the command prompt to display the current code page.
In Windows and Windows NT, see the International Settings in the Control Panel.

Character sets are stored by the operating system and sent to a database as a set of hexadecimal numbers.

Collation sequences

Roughly speaking, a collation sequence is a sorting order for characters in a character set encoding or code page. The collation sequence is based on the encoded value of the characters.

The collation sequence includes the notion of alphabetic ordering of letters, and extends it to include all characters in the character set, including digits and space characters.

The collation sequence includes more information than a simple ordering. Each character is assigned to a sort position, and the sort position defines the position of that character in any comparison or sorting of character strings.

Associating more than one character with each sort position

More than one character can be associated with each sort position. This is useful if you wish, for example, to treat an accented character the same as the character without an accent. Two characters with the same sort position are considered to be identical in all ways by the database. Therefore, if a collation assigned the characters a and e to the same sort order, then a query with the following search condition:

     WHERE col1 = 'want'.

is satisfied by a row for which col1 contains the entry "went".

At each sort position, lower- and uppercase forms of a character can be indicated. For case-sensitive databases, the lower- and uppercase characters are not treated as equivalent. For case-insensitive databases, the lower- and uppercase versions of the character are considered equivalent.