zargony.com

#![desc = "Random thoughts of a software engineer"]

Ruby 1.9 and file encodings

Just out of curiosity, I took two of my recent Rails applications today and tried them with Ruby 1.9. It was surprisingly easy to make all tests pass without any warnings or errors. Rails 2.3 already has quite good support for Ruby 1.9. The main gotcha however was about file encodings. If you have source files which contain non-ASCII characters, Ruby 1.9 now needs to know which encoding the file was saved with. If you don't specify the encoding for a file with non-ASCII characters, you'll get an invalid multibyte char (US-ASCII) error message.

File encoding

So Ruby 1.9 rejects to parse any file with non-ASCII characters if you don't specify the encoding. You can do so by adding a Ruby comment at the top of the file:

# encoding: utf-8

This tells the Ruby parser to interpret the file content using UTF-8 encoding. Of course, you need to specify the correct encoding (meaningly the encoding your editor used when the file was saved). Most Posix systems like Linux and Mac use UTF-8 by default nowadays. Windows however defaults to Latin-1 (ISO-8859-1).

Why bother?

Previously, Ruby didn't care about the encoding. The Ruby parser read everything in the source file byte by byte. E.g. try asking for the length of a string that contains Unicode characters. With Ruby 1.8, it returns the number of bytes the string occopies:

"ä".length # returns 2 since 'ä' occupies 2 bytes in UTF-8

If you put "ä".length into a text file and save it using UTF-8 encoding, the letter 'ä' is stored as the two-bytes-sequence C3,A4 (that's how 'ä' is represented in UTF-8). So since Ruby 1.8 doesn't know encodings and just reads bytes from the file, it sees two "letters" when parsing the file. In Ruby 1.8, bytes and characters are the same (meaning one character is considered one byte and vice versa). This is fine for ASCII, but doesn't work with multibyte encodings like Unicode UTF-8. If you would save the same file using Latin-1 encoding, the letter 'ä' would be stored as a single byte E4 and Ruby 1.8 would return 1 when asking for the length.

Ruby 1.9 however distinguishes between bytes and characters. If the parser encounters the byte sequence C3,A4 in a file, and you told it before that this file uses UTF-8 encoding, the parser knows that these two bytes are the representation of the single character 'ä'. Therefore Ruby 1.9 can correctly count the number of characters in a string, even if it contains multibyte characters:

# encoding: utf-8
"ä".length # returns 1 since Ruby 1.9 knows the encoding

Since there are several encoding standards, Ruby needs to know which one the file was saved with to correctly parse multibyte characters.

This also means that with different encodings, the length of a string is not anymore equal to the number of bytes the string occupies. However this is by design, since with every multibyte encoding, a character can not neccessarily be represented by a single byte.

In Ruby applications, you usually don't care about the number of bytes a string occupies. The most useful length measurement to know is the number of characters a string has. So String#length and String#size both return the number of characters in Ruby 1.9, not the number of bytes like Ruby 1.8 did before. To get the number of bytes a string occupies, Ruby 1.9 has as String#bytesize method (which however is rarely used, I suppose).

Does it affect Rails?

Rarely, I'd say. I tried two Rails applications I wrote a while ago (which heavily use UTF-8 strings for German characters) and didn't notice any problems. However, there could be some minor differences in behavior, e.g. if using validates_length_of in a model class, it previously checked against the byte count while with Ruby 1.9, the real character count is checked. If the underlying storage/database engine isn't aware of the encoding, this may lead strange length problems.

I'm a little surprised that Ruby 1.9 does not use a default encoding of UTF-8 (convention over configuration -- UTF-8 seems to be the most wide spread used encoding nowadays and it would be nice if it would be assumed the default). But then, this might lead to weird problems if you don't use UTF-8 and Ruby doesn't detect it (detecting non-ASCII is easy, detecting non-UTF-8 however wouldn't be reliable).

Btw, besides file encodings, there are a lot of other changes in Ruby 1.9. Check out the Changelog to find out more. And beware that you can do some strange things with encodings.