This section describes the collation file format. Collation files may include the following elements:
In the collation file, spaces are generally ignored. Comment lines start with either % or --.
The first non-comment line must be of the form:
Collation label (name)
In this statement:
Item | Description |
---|---|
Collation | a keyword and is required |
label | is the collation label and appears in the system tables as SYS.SYSCOLLATION.collation_label and SYS.SYSINFO.default_collation. The label must be no more than 10 characters, and must not be the same as one of the built-in collations. (In particular, do not leave the collation label unchanged.) |
name | a descriptive term, used for documentation purposes. The name should be no more than 128 charactersFor example, the Shift-JIS collation file contains the following collation line, with label SJIS and name (Japanese Shift-JIS Encoding): |
After the title line, each noncomment line describes one position in the collation. The ordering of the lines determines the sort ordering used by the database, and also determines the result of comparisons. Characters on lines appearing higher in the file (closer to the beginning) sort before characters that appear later.
The form of each line in the sequence is:
[sort-position] : character
or
[sort-position] : character [lowercase uppercase]
where:
Argument | Description |
---|---|
sort-position | is optional and specifies the position at which the characters on that line will sort. Smaller numbers represent a lesser value, so will sort closer to the beginning of the sorted item. Typically, the sort-position is omitted, and the characters sort immediately following the characters from the previous sort position |
character | is the character whose sort-position is being specified |
lowercase | is optional and specifies the lowercase equivalent of the character. If not specified, then the character has no lowercase equivalent |
uppercase | is optional and specifies the uppercase equivalent of the character. If not specified, then the character has no uppercase equivalent |
Multiple characters may appear on one line, separated by commas (,). In this case, these characters are sorted and compared as if they were the same character.
Each character and sortposition is specified in one of the following ways:
Specification | Description |
---|---|
\dnnn | Decimal number, using digits 0-9 (such as \d001) |
\xhh | Hexadecimal number, using 2 digits 0-9 and/or letters a-f or A-F (such as \xB4) |
'c' | Any character in place of c (such as ',') |
c | Any character other than quote ('), back-slash (\), colon (:) or comma (,). These characters must use one of the previous forms. The following are some sample lines for a collation:
|
For databases using case-insensitive sorting and comparing (no -c specified on the DBINIT command line), the lowercase and uppercase mappings are used to find the lowercase and uppercase characters that will be sorted together.
For multibyte character sets, the first byte of a character is listed in the collation sequence, and all characters with the same first byte are sorted together, and ordered according to the value of the second byte. For example, the following is part of a Shift-JIS collation file:
: \xfb
: \xfc
: \xfd
In this collation, all characters with first byte \xfc come after all characters with first byte \xfb and before all characters with first byte \xfd. The two-byte character \xfc \x01 would be ordered before the two-byte character \xfc \x02.
Any characters omitted from the collation will be added to the collation at the position equal to their binary value. DBINIT issues a message for each omitted character. However, it is recommended that any collation contain all 256 characters (first bytes).
The Encodings section is optional, and follows the collation sequence. It is not useful for single-byte character sets.
The Encodings section lists those combinations of bytes which are valid characters. The format of the section may be described by example.
The Shift-JIS Encodings section is as follows:
Encodings:
[\x00-\x80,\xa0-\xdf,\xf0-\xff]
[\x81-\x9f,\xe0-\xef][\x00-\xff]
The first line following the section title lists valid single-byte characters. The square brackets enclose a comma-separated list of ranges. Each range is listed as a hyphen-separated pair of values. In the Shift-JIS collation, values \x00 to \x80 are valid single-byte characters, but \x81 is not a valid single-byte character.
The second line following the section title lists valid double-byte characters. Any combination of bytes from ranges in the first pair of brackets with those in the second are valid characters. Therefore \x81\x00 is a valid double-byte character, but \xd0 \x00 is not.
The Properties section is optional, and follows the Encodings section. It is not useful for single-byte character sets.
If a Properties section is supplied, an Encodings section must be supplied also.
The Properties section lists values for the first-byte of each character that represent characters, digits, or spaces.
The Shift-JIS Properties section is as follows:
Properties:
space: [\x09-\x0d,\x20]
digit: [\x30-\x39]
alpha: [\x41-\x5a,\x61-\x7a,\x81-\x9f,\xe0-\xef]
This indicates that characters with first bytes \x09 to \x0d, as well as \x20, are to be treated as space characters, digits are found in the range \x30 to \x39 inclusive, and alphabetic characters in the four ranges \x41-\x5a, \x61-\x7a, \x81-\x9f, and \xe0-\xef.