Minh T. Nguyen

        "Enemy's Gate Is Down"
Search this site:

Minh Tri Nguyen Minh T. Nguyen enderminh Vietnamese nguyentriminh blog Visual Studio .NET Tips and Tricks Nguyễn Trí Minh
posts - 203, comments - 798, trackbacks - 120

Moving the new Vietnamese Conversions utility out of beta

After a very long beta phase, I am finally moving my new Vietnamese Conversions utility out of its beta stage and releasing it without any modifications. I have received a few bug reports, but honestly I was not able to reproduce them and my rudimentary tests have not shown any conversion problems.

The new version supports rich-text editing, so you can copy and paste Vietnamese legacy text (VPS, VNI, Vietnet/VIQR) including formatting in here and do the conversion into Unicode while retaining the formatting. It’s pretty useful/helpful when dealing with pre-formatted text. In addition, you can now also print directly from this textbox (there you go, Mom, that feature you keep on asking me for). If you have any suggestions, comments or find any bugs, shoot me an email. However, given that grad school will start for me in two weeks, marking the end of my social life as I know it over the next two years, I don’t know if I will get to accommodate any of your requests. :)


Vietnamese Conversions Before
Before the VNI->Unicode conversion


Vietnamese Conversions After
After the VNI->Unicode conversion

posted on Sunday, August 20, 2006 2:14 AM

Feedback

# re: Moving the new Vietnamese Conversions utility out of beta

Hey, what about converting VISCII? VISCII was easily the best of all the legacy formats, since it contained EVERY vietnamese letter (except the imaginary Dong sign) as a single byte.

Unfortunately microsoft killed VISCII when microsoft stopped a couple of characters from appearing in its software.

But all my legacy text is in VISCII.

Also support for the official TCVN standard character set would be good too. It was used a lot in Vietnam back in the legacy days.

Also, it is easy to detect the correct character set automatically. The trick is to interpret the text as each character set in turn, and count the number of times the rules of Vietnamese spelling are broken. Here is a list of things to check for when detecting the character set:

* Lowercase letters with capital letters afterward
* Two or more initial capitals followed by lowercase
* The vowel "a" or "e" with a mark above it, following an "a" or "e" without one
* More than one tone per word
* Tone not on the same letter as the circumflex, breve, or horn
* More than one circumflex, breve or horn, except in u+o+
* Invalid tones before "t", "c", or "p". (only sac and nang are allowed)
* Dashed D in the middle of words

There are also other things you should check for specific character sets, like VNI combinations that make no sense.

Then just pick the charset with the least errors.

My software is slightly more complicated, since I also have to detect what language it is, whether it is Vietnamese written without any tones, and whether it is some mixture of VIQR and other charsets.
8/21/2006 4:25 PM | Carl Kenner

# re: Moving the new Vietnamese Conversions utility out of beta

How about releasing it as open source on SourceForge and let other people develop it further? You can just sit back and be the overlord admin of the project :)
8/22/2006 12:41 AM | Loc

Post Comment

Title  
Name  
Url
Comment   
Enter the code you see: