Wintertree Software Inc.

Dictionaries

 

Home Site index Contact us Catalog Shopping Cart Products Support Search

You are here: Home > Products > Developer tools > Dictionaries, Word Lists, and Lexicons > How to create a dictionary


How to create a dictionary

Wintertree Software sells dictionaries in a number of languages, plus medical and legal dictionaries for English. Sometimes, customers have a requirement to support languages for which Wintertree Software does not sell dictionaries. This article contains some advice for customers who want to create their own dictionaries.

In a nutshell, the process is simple enough:

  1. Develop or acquire a word list in the target language

  2. Compress the word list

Acquiring a word list

A word list is a set of one or more text files containing the collection of words which will be used to validate spelling.

Unfortunately, acquiring a suitable word list may be a difficult challenge. If you know of another product that has a spelling checker for your target language, the manufacturer may be willing to license their word list to you. The publisher of a paper dictionary may have an electronic word list available for sale or licensing. Sometimes public domain or otherwise freely usable word lists are available for downloading from repository web sites (of course, the quality of such lists may be in question).

Developing your own word list is another possibility. (Note that developing a word list is more than simply "typing in the words from a dictionary," as that is likely a copyright violation.) The approach used by Wintertree Software involves amassing a large volume of text in the target language (ideally, millions of words from a variety of sources), performing statistical analysis to identify common words, and working with linguistic specialists in the target language to review and enhance. The process is long and expensive.

Word list file format

The complete word list will be stored in a set of text files. Each word will appear on a separate line. No extraneous spaces or punctuation will appear before or after the words. All of the words starting with the same first three letters will appear in the same text file. The entire word list may be contained in a single text file, but if the list is split, all the words beginning with a particular set of three letters must be in the same file.

Wintertree Software's spelling engines are case-sensitive, so capitalization should be used where required by the language. Words that may be either capitalized or uncapitalized should be in lower case. If a word appears in the list in capitalized form only, the spelling checker will report the uncapitalized form as a misspelling.

All common forms of a word must be listed explicitly. For example, each singular and plural form must be listed, as must different inflections, tenses, etc. (To the spelling engine, a word is simply a string; the job of the spelling engine is to determine if the string is defined or not.)

The word list should be sorted according to the numerical value of the character codes in the character set used to represent the words. Duplicate words must be removed.

Character sets

If the dictionary is intended for use with Sentry Windows SDK version 5.14 or earlier, WSpell version 5.14 or earlier, Wintertree Spelling Server version 1.7 or earlier, or the single-byte spelling engine compiled from Sentry Source SDK, the word list must contain single-byte ISO-8859-1 characters (the lower 128 characters of ISO-8859-1 are identical to ASCII, and the upper 128 characters contain mainly accented letters used in Western European languages).

If the dictionary is intended for use with Sentry Windows SDK version 5.15 or later, or WSpell version 5.15 or later, the word list may contain characters from any single set in the ISO-8859 family.

If the dictionary is intended for use with Sentry Java SDK, Wintertree Spell Check Applet, Wintertree Spelling Server version 1.8 or later, or the Unicode spelling engine compiled from Sentry Source SDK, the word list may contain 2-byte Unicode characters (UCS-2). The first character of each word-list file must be a Unicode byte-order mark (BOM) identifying the byte ordering (big endian or little endian) used in the file. UTF is not supported; each character must be two bytes, even if the character appears in the ASCII set.

If the dictionary is intended for use with Sentry Java SDK, Wintertree Spelling Server version 1.8 or later, or Wintertree Spell Check Applet, the word list may contain either ISO-8859-1 characters or Unicode characters.

Compressing the word list

Compared with acquiring or developing a word list, compressing a word list is relatively straightforward. You will need to purchase either Sentry Windows SDK or Sentry Java SDK, depending on the character set used in your word list:

The Sentry software development kits include the SqLex utility program, which will turn a word list into a compressed lexicon or dictionary (.clx file). Instructions for using SqLex are included with the Sentry products.


Home Site index Contact us Catalog Shopping Cart Products Support Search


Copyright © 2005 Wintertree Software Inc. Last modified