SBBIC Khmer Word Breaker Using ICU
We’ve been working on getting code into ICU to allow Khmer Unicode to automatically break between words and the newest release of ICU now includes a Khmer word breaker. But access is difficult (unless you are a programmer). So we have made a small program that uses ICU and will allow you to use the Khmer word breaker in Linux (Windows will come soon). We’ve only tested this on Ubuntu 11.x so please test it and let us know if you have any problems. There is still room for improvement, so please let us know how it works for you.
The word-breaker is currently dictionary based, so it will work best on documents that have correct spelling. In the future we hope to add additional programming that will better deal with “unknown” words.
To use the program in Ubuntu place the Unicode .txt file you want to break in the same directory as sbbic-khmer-breaker.out and open the console to the directory where sbbic-khmer-breaker.out is located and type: ./sbbic-khmer-breaker.out yourinputfile.txt youroutputfile.txt (changing the names of the text files to the names you desire).
Again, if you have any issues, please don’t hesitate to ask in the comments.
DOWNLOAD: SBBIC Khmer Word Breaker Using ICU (193)
New Khmer Unicode Word Breaker in the Works
We’ve been testing a new Java application to use for Khmer word breaking. As you know, Khmer does not use spaces between words, and that causes some difficulties when using Khmer with a computer.
We’ve tested a new Java application (click here to download the unmodified source or view link at the bottom to download the latest Khmer dictionary with a built version) against the two current solutions and the results are promising (special thanks to Dave Jarvis the author for his willingness to let us use his application and even help us with making it work with Khmer).
Here’s a look at the tests – we used the first paragraph of this page (after correcting some of its spelling): http://km.wikipedia.org/wiki/ស៊ីន_ស៊ីសាមុត
We put actual spaces so the text breaking would be visible. Also, at this current stage, the new application can only break small lines of text, so the input text was broken into smaller parts (this was done for all the tests):
SBBIC’S NEW LINE SPLIT:
លោក ស៊ីន ស៊ីសាមុត
(១៩៣២-១៩៧៦)
គឺ ជា អ្នក និពន្ធ បទចំរៀង
និង ជា អ្នក ចំរៀង ខ្មែរ
ដ៏ ល្បីល្បាញ
នា អំលុង ទសវត្សរ៍ ឆ្នាំ ៥០ ដល់ ៧០
គាត់ មាន រហ័ស
នាម ថា ជា
អធិរាជ សំលេង មាស
លោក ស៊ីន ស៊ីសាមុត
ទទួលមរណភាព ក្នុង
របបប្រល័យពូជសាសន៍
ខ្មែរក្រហម,ខ្មែរ ក្រហម
នៅថ្ងៃទី១៨,នៅ ថ្ងៃ ទី ១៨
ខែឧសភា
ឆ្នាំ ១៩៧៦
ភាព ល្បីល្បាញ
របស់ លោក ស៊ីន ស៊ីសាមុត
បាន ពី ទឹក ដម សំលេង
ដ៏ ក្រអួន ក្រអៅ
ពីរោះ រណ្ដំ ចិត្ត
គួបផ្សំ និង បទចំរៀង
មនោសញ្ចេតនា គ្រប់
រស ជាតិ
លន្លង់លន្លោច
សប្បាយ កំសត់ ខ្លោចផ្សា -ល-
ដែល ជា ស្នាដៃ និពន្ធ
ផ្ទាល់ របស់ លោក
និង អ្នក និពន្ធ ដទៃ
ក្នុង ជំនាន់ លោក
PANCAMBODIA WORD WRAP:
លោក ស៊ីន ស៊ី សាមុត
(១៩៣២-១៩៧៦)
គឺជា អ្នកនិពន្ធ បទ ចំរៀង
និង ជា អ្នក ចំរៀង ខ្មែរ
ដ៏ ល្បីល្បាញ
នា អំលុង ទសវត្សរ៍ ឆ្នាំ ៥០ ដល់ ៧០
គាត់ មាន រ ហ័ ស
នាម ថា ជា
អធិរាជ សំលេង មាស
លោក ស៊ីន ស៊ី សាមុត
ទទួល ម រណ ភាព ក្នុង
របប ប្រល័យពូជសាសន៍
ខ្មែរក្រហម
នៅ ថ្ងៃទី ១៨
ខែ ឧសភា
ឆ្នាំ ១៩៧៦
ភាពល្បីល្បាញ
របស់លោក ស៊ីន ស៊ី សាមុត
បាន ពី ទឹកដម សំលេង
ដ៏ ក្រអួន ក្រ អៅ
ពីរោះរ ណ្ដំ ចិត្ត
គួប ផ្សំ និង បទ ចំ រៀ ង
មនោសញ្ចេតនា គ្រប់
រសជាតិ
ល ន្ល ង់ ល ន្លោ ច
សប្បាយ កំសត់ ខ្លោចផ្សា – ល -
ដែល ជា ស្នាដៃ និពន្ធ
ផ្ទាល់ របស់លោក
និង អ្នកនិពន្ធ ដទៃ
ក្នុង ជំនាន់ លោក
KHMEROS WORD BREAKER OUR DICTIONARY:
លោក ស៊ីន ស៊ីសា មុត
(១៩៣២ -១៩៧៦)
គ ឺ ជាអ្នក និពន្ធ បទចំរៀង
និ ង ជាអ្នក ចំរៀ ង ខ្មែរ
ដ៏ ល្បីល្បាញ
ន ា អំលុង ទសវត្សរ៍ ឆ្នាំ ៥០ ដល់ ៧០
គា ត់ មា នរហ ័ ស
នាម ថាជា
អធិរាជ សំលេង មាស
លោ ក ស៊ីន ស៊ីសា មុត
ទទួលមរណភាព ក្នុង
របបប្រល័យពូជសាសន៍
ខ្មែរក្រហម
នៅ ថ្ងៃទី១៨
ខែឧសភា
ឆ្នាំ១៩៧៦
ភាពល្បីល្បាញ
របស់លោក ស៊ីន ស៊ីសា មុត
បា នពី ទឹកដម សំលេង
ដ៏ ក្រអួន ក្រអៅ
ពីរោះ រណ្ដំ ចិត្ត
គួបផ្សំ និ ង បទចំរៀង
មនោសញ្ចេតនា គ្រប់
រសជាតិ
លន្លង់លន្លោច
សប្បាយ កំសត់ ខ្លោចផ្សា – ល -
ដែលជា ស្នាដៃ និពន្ធ
ផ្ទា ល់ របស់លោក
និ ង អ្នកនិពន្ធ ដទៃ
ក្នុ ង ជំនាន់ លោក
KHMER OS WORD BREAKER THEIR DICTIONARY:
លោក ស៊ីន ស៊ីសា មុត
(១៩៣២ -១៩៧៦)
គ ឺ ជាអ្នក និពន្ធ បទចំរៀង
និ ង ជាអ្នក ចំរៀ ង ខ្មែរ
ដ៏ ល្បីល្បាញ
ន ា អំលុង ទសវត្សរ៍ ឆ្នាំ ៥០ ដល់ ៧០
គា ត់ មា នរហ ័ ស
នាម ថាជា
អធិរាជ សំលេង មាស
លោ ក ស៊ីន ស៊ីសា មុត
ទទួលមរណភាព ក្នុង
របបប្រល័យពូជសាសន៍
ខ្មែរក្រហម
នៅ ថ្ងៃទី១៨
ខែឧសភា
ឆ្នាំ១៩៧៦
ភាពល្បីល្បាញ
របស់លោក ស៊ីន ស៊ីសា មុត
បា នពី ទឹកដម សំលេង
ដ៏ ក្រអួន ក្រអៅ
ពីរោះ រណ្ដំ ចិត្ត
គួបផ្សំ និ ង បទចំរៀង
មនោសញ្ចេតនា គ្រប់
រសជាតិ
លន្លង់លន្លោច
សប្បាយ កំសត់ ខ្លោចផ្សា – ល -
ដែលជា ស្នាដៃ និពន្ធ
ផ្ទា ល់ របស់លោក
និ ង អ្នកនិពន្ធ ដទៃ
ក្នុ ង ជំនាន់ លោក
You can download the built Java application with our dictionary and test document here: SBBIC WordSplit (201)
To run it you will need Java and the command line is: java -Xmx1024m -Xms1024m -Dfile.encoding=UTF-8 -jar wordsplit.jar khmerlexicon.csv khmercolumns.txt >> results.txt
Please test it and keep us informed of any comments, ideas, or breakthroughs. If you wish to volunteer to help us with the next steps please let us know.
Future plans:
- Modify the application to feed in the first 20 chars or so, find the first word, and then feed in the next word etc. (so we won’t have to break lines manually)
- Modify the application to allow word breaking rules for Khmer to help with accuracy (we need to collect rules for finding the end of Khmer words).
- Modify the application to accept the OpenOffice format
- Add support for Microsoft Word documents if possible
- Add a graphic user interface (GUI)
- Create a, extension for OpenOffice that will process a document, and possibly process as one types to automatically break words.
MOOR Online Legacy to Unicode Converter
MOOR Software has created an online tool that allows you to convert legacy fonts to Unicode. Visit their website and upload the document you want to convert.
KhmerOS Automatic Word Separation (ZWSP) Program
This program goes through a Khmer Unicode text in UTF-8 format and inserts ZWSP characters between the words. It separates words using an internal dictionary (based on the Chuon Nat dictionary). It can handle UTF-8 format files, even if these files are in HTML/XML. It can also deal with simple RTF files.
Download: KhmerOS Automatic Word Separation (ZWSP) Program
NOTE: you need to have the Java Runtime Environment installed in your computer (which you can download here). It runs on any platform that has java installed.
PAN Khmer Font Encoding Converter
Use this PAN Localization program to convert non-Unicode Khmer documents into Unicode. But you must install the whole package in order to get the stand-alone converter. This file will install, “five stand alone applications and one Microsoft Office plug in, extra fonts for Khmer conversion application, five stand alone java application and one plug in OpenOffice.org”
Download: PAN Khmer Font Encoding Converter
Mirror: Temporary Download Location because pancambodia.info is down – click here to download
PAN Khmer Font Encoding Converter
Use this PAN Localization program to convert non-Unicode Khmer documents into Unicode. But you must install the whole package in order to get the stand-alone converter. This file will install, “five stand alone applications and one Microsoft Office plug in, extra fonts for Khmer conversion application, five stand alone java application and one plug in OpenOffice.org”
Download: PAN Khmer Font Encoding Converter
Mirror: Temporary Download Location because pancambodia.info is down – click here to download
Limon to Unicode and ABC to Unicode Converter
Use these Khmer OS programs to convert files using the ABC and/or the Limon Khmer font into Khmer Unicode.
Visit: Font Converters at Khmer OS
PAN Khmer Line Breaking Program
Use this PAN Localization program to break Khmer words apart that have been typed in Khmer Unicode.
Download: PAN Khmer Line Breaking Program
Mirror: Temporary Download Location because pancambodia.info is down – click here to download
KhmerOS Automatic Word Separation (ZWSP) Program
This program goes through a Khmer Unicode text in UTF-8 format and inserts ZWSP characters between the words. It separates words using an internal dictionary (based on the Chuon Nat dictionary). It can handle UTF-8 format files, even if these files are in HTML/XML. It can also deal with simple RTF files.
Download: KhmerOS Automatic Word Separation (ZWSP) Program
NOTE: you need to have the Java Runtime Environment installed in your computer (which you can download here). It runs on any platform that has java installed.
PAN Khmer Collation and Sorting Program
Use this PAN Localization program to sort lists of Khmer words into the correct alphabetical order.
Download: PAN Khmer Collation and Sorting Program
Mirror: Temporary Download Location because pancambodia.info is down – click here to download
PAN Khmer Font Encoding Converter
Use this PAN Localization program to convert non-Unicode Khmer documents into Unicode. But you must install the whole package in order to get the stand-alone converter. This file will install, “five stand alone applications and one Microsoft Office plug in, extra fonts for Khmer conversion application, five stand alone java application and one plug in OpenOffice.org”
Download: PAN Khmer Font Encoding Converter
Mirror: Temporary Download Location because pancambodia.info is down – click here to download
Limon to Unicode and ABC to Unicode Converter
Use these Khmer OS programs to convert files using the ABC and/or the Limon Khmer font into Khmer Unicode.
Visit: Font Converters at Khmer OS