We are pleased to announce that LibreOffice Pre-Release 3.6 (Download: LibO-Dev_3.6.0.0.beta2_Win_x86_install_multi.msi or newer) now incorperates the latest ICU version which has the ability to automatically line-break Khmer Unicode (which we posted about previously here). This means you no longer have to manually add a zero-width space between words in order to correctly line-break in your documents! The screen-shots below show a sample LibreOffice document in LibreOffice 3.5 (that does not automatically line-break Khmer), a document with manual zero-width spaces added, and a document in LibreOffice Dev 3.6 with automatic Khmer line-breaking. As you can see the results are looking good!

LibreOffice Without the New ICU Automatic Khmer Line-Breaking

LibreOffice with Manual Word-breaks Added

LibreOffice Dev 3.6 With Automatic Khmer Line-Breaking

The automatic word-breaking does not yet currently work for spell checking, so in order to spell check in Khmer you will still need to continue to manually add zero-width spaces between words – but this is a great step forward for the Khmer language on computers! And hopefully in the near future we will no longer need to manually add spaces between words in Khmer in order to spell check.
Please try out the new LibreOffice pre-release and let us know how it works for you. Any issues you have with line-breaking (if something breaks incorrectly), please let us know in the comments so we can work towards debugging and increase the accuracy of the word-breaker in ICU. Special thanks to George for helping us make this a reality.

13 Comments. Leave new

  • I am curious how this line breaking mechanism plays with names of places and/or people (or other words not in the corpus). Any ideas?

    Reply
    • Yes, being that it is a dictionary based line-breaker it will have trouble with words not in the dictionary. We have some rules implemented that help (like never break a word after the jung sign ្ ), but more work still needs to be done. If you have any insight it would be appreciated – you can see the code here: http://source.icu-project.org/repos/icu/icu/trunk/source/common/dictbe.cpp

      We’ve been in contact with someone who has experience using a Hidden Markov Model with Khmer – but he has been quite busy and has not had the time to figure out a way implement it with icu.

      Reply
      • I used some rules to break the words correctly for the concordance. I developer friend of mine made a program that based upon rules and a word list of mine, would break the text into individual words. So, the rules were very important. I can pass the rules on to you if you are interested.

        Reply
        • Hi Adam,
          It would be great to get the rules – hopefully it will help the ICU break iterator perform with higher accuracy.

          Reply
  • bunthearith
    July 11, 2012 9:24 am

    any tutorial how to do it?

    Reply
    • Hello Bunthearith,
      Just download LibreOffice 3.6 and then when you write in Khmer it will automatically line break for you. If you have any trouble, let us know.

      Reply
  • តើ​កម្មវិធីនេះ​អាច​ប្រើជាមួយ Windows 8 64bit បាន​ដែរ​ឬ​ទេ?

    Reply
  • how can I download this software?

    Reply
  • រូបមើលអត់ឃើញទេបង តើវាតំណើរការយ៉ាងដូចម្តេច? ខ្ញុំប្រើដូចអត់មានឃើញ Auto Breaking ផងហ្នឹងបង?

    Reply
  • after i open a document, do i need to click on some button to let the software insert zero-width spaces between words or will it do itself once open?

    Reply
    • Hello Boran,
      LibreOffice will automatically break Khmer words (you won’t be able to tell except on line-breaks). You can also add your own zero-width-spaces if you want to control it manually.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Menu