.. | ||
LICENSE | ||
README.md | ||
utfutil.go |
utfutil
Utilities to make it easier to read text encoded as UTF-16.
Dealing with UTF-16 files from Windows.
Ever have code that worked for years until you received a file from a MS-Windows system that just didn't work at all? Looking at a hex dump you realize every other byte is \0. WTF? No, UTF. More specifically UTF-16LE with an optional BOM.
What does all that mean? Well, first you should read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky.
Now you are an expert. You can spend an afternoon trying to figure out how the heck to put all that together and use golang.org/x/text/encoding/unicode
to decode UTF-16LE. However I've already done that for you. Now you can take the easy way out change ioutil.ReadFile() to utfutil.ReadFile(). Everything will just work.
utfutil.ReadFile() is the equivalent of ioutil.ReadFile()
OLD: Works with UTF8 and ASCII files:
data, err := ioutil.ReadFile(filename)
NEW: Works if someone gives you a Windows UTF-16LE file occasionally but normally you are processing UTF8 files:
data, err := utfutil.ReadFile(filename, utfutil.UTF8)
utfutil.OpenFile() is the equivalent of os.Open().
OLD: Works with UTF8 and ASCII files:
data, err := os.Open(filename)
NEW: Works if someone gives you a file with a BOM:
data, err := utfutil.OpenFile(filename, utfutil.HTML5)
utfutil.NewScanner() is for reading files line-by-line
It works like os.Open():
s, err := utfutil.NewScanner(filename, utfutil.HTML5)
Encoding hints:
What's that second argument all about?
Since it is impossible to guess 100% correctly if there is no BOM, the functions take a 2nd parameter of type "EncodingHint" where you specify the default encoding for BOM-less files.
UTF8 No BOM? Assume UTF-8
UTF16LE No BOM? Assume UTF 16 Little Endian
UTF16BE No BOM? Assume UTF 16 Big Endian
WINDOWS = UTF16LE (i.e. a reasonable guess if file is from MS-Windows)
POSIX = UTF8 (i.e. a reasonable guess if file is from Unix or Unix-like systems)
HTML5 = UTF8 (i.e. a reasonable guess if file is from the web)
Future Directions
If someone writes a golang equivalent of uchatdet, I'll add a hint called "AUTO" which uses it. That would be awesome. Volunteers?