# This file is last updated on Wed June 1 1994.
#
# All programs distributed in this directory
# are copyrighted by Man-Chi Pong (mcpong@cs.ust.hk),
# Department of Computer Science,
# The Hong Kong University of Science and Technology (HKUST) 1994.
# 
# Most likely I won't have time to enhance the program "wordg2b".
#	mcpong@cs.ust.hk (I'll be with HKUST till June 30, 1994).

WARNING:
The programs are developed in SunOS4.1.3 in a SparcStation-2.
The data files used by "wordg2b" may not work in other platforms.

/******************************************************************
Copyright 1994 by Man-Chi Pong.

                        All Rights Reserved

Permission to use, copy, modify, and distribute this software and its 
documentation for any purpose and without fee is hereby granted, 
provided that this copyright notice appear in all copies and that
both that copyright notice and this permission notice appear in 
supporting documentation.

DISCLAIMER:

I DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING
ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL
I BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR
ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
SOFTWARE.

******************************************************************/

wordg2b Release 1.0
===================
This directory contains the files for creating the program "wordg2b",
which converts a file of Chinese characters (hanzi) in
GB2312-80 (GB) encoding to Big5 encoding.  It also contains
other utilities which help to build the data files used by "wordg2b".

----------------------------------------------------------------------
To create "wordg2b" (& other utilities), type "make all".
----------------------------------------------------------------------

----------------------------------------------------------------------
To create the manual page in the file "wordg2b.man",  type "make man".
----------------------------------------------------------------------

----------------------------------------------------------------------
To install "wordg2b", type "make install".
----------------------------------------------------------------------

----------------------------------------------------------------------
To run "wordg2b" after installation, type "wordg2b -h" to see the help
message.  Note that instead of input from stdin and output to stdout,
"wordg2b" uses "-i" to specify an input file and "-o" to specify an
output file.  There is no way to input from stdin and output to stdout.

(The option "-h" applies for all utility programs as well.)
----------------------------------------------------------------------

Note:

Since GB to Big5 conversion is a many-to-one mapping,
"wordg2b" makes use of the following tables based on
Chinese words to help resolving the ambiguity:

(N.B.: A Chinese word contains one or more hanzi.)

-- "DICT.FMM" and "DICT.BMM"
	-- They contain the dictionaries used in word segmentation
	   of the input file so that the segmented words would
	   be examined one by one and converted using the
	   following tables.

	   (They are generated by "make datafiles" which can be caused
	    by "make install".)

-- a GB-Big5 char table ("g2bchar.tab")
	-- This is used when a word is not found in the following
	   tables and so the individual character(s) of the word
	   has to be converted one by one by looking up this table.

	   (This file should not be changed.)

-- a GB-Big5 word table ("g2bword.tab")
	-- A table of two columns,
	   left  column contains words in GB,
	   right column contains words in Big5.

	   (It can be simply copied from "g2bword.tab.base".
	    Or, if the user has his own GB-Big5 word table, say,
	    "g2bword.+" generated by the program "addword",
	    then it can be formed by
	    "cat g2bword.tab.base g2bword.+ > g2bword.tab")

-- a GB-Big5 word-bigram table with 1st word being a single-char-word
   ("prefix.scw.bigram.tab")
	-- A table of four columns,
	   left  two columns contain word-bigram (2 words) in GB,
	   right two columns contain word-bigram (2 words) in Big5.

-- a GB-Big5 word-bigram table with 2nd word being a single-char-word
   ("suffix.scw.bigram.tab")
	-- A table of four columns,
	   left  two columns contain word-bigram (2 words) in GB,
	   right two columns contain word-bigram (2 words) in Big5.

"wordg2b" first searches these files in the directory specified
in the environment variable WORDG2B_PATH, if it is defined;
and finally searches these files in the 'current working directory'.

E.g.
	setenv WORDG2B_PATH /usr/local/CHINESE/wordg2b

Then these files will first be searched in the directory
"/usr/local/CHINESE/wordg2b", and finally in the user's
current working directory.

(Note:
 Only "wordg2b" checks for the environment variable WORDG2B_PATH.
 Other utility programs, if they have to use the data files,
 assume that the data files will be in the same directory where
 the programs are run.  If the data files are in other directory,
 their full path names should be specified in command line arguments.)

----------------------------------------------------------------------
Some of the data files contain probabilities of the occurrences of
Chinese words.  The probabilities in the data files can be improved
through better training from representative text corpora.
Currently the probability data in those files are quite crude.
In future versions, data files with better-trained probabilities
would be available.
----------------------------------------------------------------------


======================================================================
# The user can enhance the distributed word tables so that           #
# the accuracy of conversion by the program "wordg2b" can be         #
# increased.  This can be done by using the program "AddDICT", etc.  #
# (See below for the details.)                                       #
======================================================================

======================================================================
The following describes how to enhance the word tables used by "wordg2b".
That is, you wish to include some words which have not been converted
correctly from GB to Big5.  (You have to know the words in GB and Big5
beforehand.)  After doing the following, "make install" again.
======================================================================

(1)
Check that the distributed files "DICT.FMM" and "DICT.BMM" exist.
They are the dictionary files for Formal Maximal Matching (FMM) and
Backward Maximal Matching (BMM) of words during word segmentation
by "wordg2b".

(2)
Run the interactive program "addword" to enter the desired words in
GB and Big5.
Two files "g2bword.+" and "wordprob.gb+" will be created.
E.g.
	% addword
	Will output to the following two files:
	caNameMonogramTableFile ='wordprob.gb+'
	caNameGBB5WordTableFile ='g2bword.+'
	input a word in GB (<ctrl-d> to terminate):
		<user input a word in GB>
	input the corresponding word in Big5:
		<user input the corresponding word in Big5>
	...
	... repeat inputting word in GB and word in Big5
	...
	input a word in GB (<ctrl-d> to terminate):
	<ctrl-d>

The file "g2bword.+" contains two fields per line:

	wordInGB wordInBig5
		^
		The two fields are separated by the character <tab>.

The file "wordprob.gb+" contains two fields per line:

	wordInGB floatingPointNumber
		^
		The two fields are separated by a space.

	The floatingPointNumber is the probability of occurrence
	of the word (wordInGB) among others in the distributed
	dictionary files "DICT.FMM" and "DICT.BMM".

	Since the base dictionary files don't contain your words,
	so the program "addword" simply assumes your added words
	occurred with the minimal probability 2.0e-07.

	Note:
	The single-character word  (in GB) ["de" in pinyin]
	has the highest probability 1.916912e-02.

	If after following the steps in this section and
	your word is still not yet converted properly,
	you may try to increase the probability field of
	your word in the file "wordprob.gb+", and repeat
	the steps below to increase the chance that your word
	will be converted correctly.

(3)
Run "AddDICT" twice, once for FMM and once for BMM, to add
the contents of file "wordprob.gb+" (created by "addword")
to the base dictionary files "DICT.FMM" and "DICT.BMM".
e.g.
	% AddDICT -h
	... help messages of command line options printed ...
	% AddDICT -f -d DICT.FMM -i wordprob.gb+ -o DICT
	[If no error, then
		mv DICT DICT.FMM
	]
	% AddDICT -b -d DICT.BMM -i wordprob.gb+ -o DICT
	[If no error, then
		mv DICT DICT.BMM
	]

	[ Repeat "AddDICT -i wordprob.gb+ ..." for more than once
	  would not increase the size of the new dictionary files,
	  because same new word will be added once only.
	]

(4)
Concatenate the contents of file "g2bword.+" (created by "addword")
to the distributed GB-Big5-word-table "g2bword.tab.base"
to become the file "g2bword.tab", which will be used by
"wordg2b".

	cat g2bword.tab.base g2bword.+ > g2bword.tab
or
	# if your "g2bword.tab" is already different from "g2bword.tab.base",
	cat g2bword.+ >> g2bword.tab

	It doesn't matter "g2bword.+" is added to the end
	or the beginning of "g2bword.tab.base", though usually
	I would put it at the end.

** DO NOT MODIFY THE FILE "g2bword.tab.base"! **

(5)
After the above are done, the table files are set up with the
additional words (in GB and Big5) input by you.  Then you can
do "make install".

Then running "wordg2b" will make use of the additional words.

======================================================================
Utilities:
----------
(1)
After you have created the file "wordprob.gb+",
you can use the program "checkdup -i wordprob.gb+"
to check whether you have entered duplicated entries
in the file.

"checkdup" checks for neighbouring duplicated entries
of the form "GBword<ASCII-char>...", i.e., a word in GB
followed by any ASCII character (usually ' ' or '\t').

(2)
The program "hzcount" counts the number of hanzi
in a given file.  ASCII characters are ignored.

(3)
The program "hzdiff" counts the number of different hanzi
in the given two files.  (It is assumed that the numbers of
hanzi in both files and the ASCII characters in both files
are the same.)

======================================================================
[ Note:
  The conversion from Big5 to GB is essentially a one-to-one mapping,
  if there is a matching hanzi.  The user can use the program "hc"
  ("hanzi converter") archived in ifcss.org:/software/ to convert
  from Big5 to GB.  "hc" is based on characters, not words.
]
======================================================================

======================================================================
If you would like to show your appreciation of this piece of work,
you could send a cheque of $10 (or above) to a charity for the third
world countries. 

e.g.
"Oxfam Hong Kong"
 Ground Floor-3B, June Garden,
28 Tung Chau St., Tai Kok Tsui,
Kowloon, Hong Kong.
(If you like, you may state that this is donation in your or my name.)

This is not an obligation in using this software.
======================================================================

====================
# End of this file #
====================
