README

Path: README  (CVS)
Last Update: Wed, Feb 28 2007 10:39:12 +0000

Description

KLookup is a library for looking up han (漢字 - kanji, hànzì, hanja, and hán tự). by multiple radicals, stroke count, reading, meaning, etc..

At the moment, the default database component uses Jim Breen‘s RADKFILE and KANJIDIC (and so is Japanese-specific), but a Unihan component is in the process of being created.

There are a couple of interfaces to the library (command line, and CGI), but they don‘t take advantage of all of the features yet (hopefully there will be an interface that can be used to demonstrate all the features of the library soon).

The term ‘han’ is used to refer to the Chinese characters used in four East-Asian languages known collectively as CJKV (Chinese, Japanese, Korean, Vietnamese).

Get

KLookup requires at least Ruby 1.8.

To check out the latest revision via Subversion:

 $ svn co svn://rubyforge.org/var/svn/klookup/klookup/trunk klookup

To install via RubyGems (as root, probably):

 # gem install -y klookup

To download a tarball or gem:

rubyforge.org/frs/?group_id=2661

Installation

When installing from a tarball or checkout, the installation method is:

 $ rake gem

And as root, probably:

 # gem install activesupport klookup

activesupport need only be installed once, it may be omitted when installing the KLookup gem again.

Running without installation

cklookup can be run from the bin/ directory without installation:

 $ cd bin
 $ ./cklookup

The entire directory can be dropped into a CGI-enabled path and klookup.cgi will work (the klookup.cgi from a RubyGems installation can also be symlinked into a CGI-enabled path).

One can also run script/server from the main project directory, which will serve the source tree at localhost:3000/ using WEBrick, and bin/klookup.cgi can be executed from there.

Developers

The classes you need to know about are in the KLookup::Lookup module: they are Kanji and Radical.

If you wish to use a different handler (and thus a different data source) for the back-end, you will need to set the handler, for example:

 KLookup::Lookup.handler=KLookup::Database::Unihan

The KLookup::Database::FlatFile handler (the current default) uses Jim Breen‘s RADKFILE and KANJIDIC and so is Japanese-specific. This is the most mature data handler.

The KLookup::Database::Unihan handler is an interface to the Unihan.txt file distributed by the Unicode Consortium. This doesn‘t yet work, but will eventually turn KLookup into a more generic tool suited for more than just the Japanese language. (Also, it requires the unihan database, which can be downloaded as a gem).

Hacking

Make sure you‘ve got the latest revision (use SVN), and see list.yaml for a list of things that need doing.

Also of note is that you can use rake commit or rake ci as an alias to svn ci. It performs some checks to make sure every version controlled file is mentioned in ChangeLog (except ChangeLog and list.yaml). It also takes the most recent chunk of the ChangeLog and uses that as the commit message. It prevented a typo when I tried to commit the code for the first time, so I‘m convinced of its usefulness.

RUnicode

The RUnicode library is present because Ruby doesn‘t handle Unicode properly (RUnicode only handles UTF-8). It is only used in two or three places in the source.

RUnicode isn‘t included in the gem; instead activesupport is required (a Rails-related library which includes ActiveSupport::Multibyte and String#chars for working on strings in an encoding-independent and encoding-aware manner).

data/

data/ contains three files: a somewhat modified RADKFILE known as newradkfile, a UTF-8 KANJIDIC known as kanjidic, and an unmodified Unihan.txt taken from Unicode 5.0.0.

The RADKFILE and KANJIDIC are based on a Japanese encoding (JIS X 0208-1990) and so contain 6,355 characters. They were created by Jim Breen.

newradkfile has had most ‘pretend’ radicals replaced with ‘real’ radicals. For instance, the original file contained 犯 instead of 犭 and 艾 instead of 艹.

I decided this was a good idea so KLookup didn‘t have to think about the mappings. It‘s best not to ask how the file was generated… just be thankful for its existence (this is noted in list.yaml - I will get around to it when I feel like becoming very confused).

Unihan.txt contains information on 71,226 characters, including most of the information available in Jim Breen‘s resources. Thus, I would like to transition towards using KLookup::Database::Unihan as the default handler in future. The Unihan handler doesn‘t actually function yet (so you‘ve maybe downloaded 28MB for nothing…), but it will do soon (I‘m hoping for it to have at least minimal functionality by the next release).

Unihan is now in a seperate directory in SVN:

 $ svn co svn://rubyforge.org/var/svn/klookup/unihan/trunk unihan

Resources

Books:

Data:

Final Year Project

This project is my final year project on my undergraduate Bachelor of Science degree in Computer Science (three years) at the University of Bradford.

You can check it out with:

 $ svn co svn://rubyforge.org/var/svn/klookup/report/trunk klookup

[Validate]