“There ain’t no such thing as plain text”. Joel Spolsky (what every developer must know about unicode).

This text is based on the assumption, that current locale uses UTF-8 encoding. Behavior might differ for other encodings. I use Ruby 2.0 for evaluation.

I was recently doing some text parsing in my native language (which is Polish). Poland uses latin / roman alphabet with a few additional accent letters, such as “ąęśćłóźżń”. Some unexpected issues occured, when I tried to use them with regular expressions in ruby.

Accent characters

Let’s first take a look how Polish characters are represented in bytes:

[cc lang=”ruby” escaped=”true”]
[11] pry(main)> “test”.bytes.to_a
=> [116, 101, 115, 116]
[12] pry(main)> “teść”.bytes.to_a
=> [116, 101, 197, 155, 196, 135]
Locale in my system are set to LANG=pl_PL.UTF-8 and we can see, that both “ś” and “ć” took 2 bytes each.

Let’s start with some examples:

[cc lang=”ruby” escaped=”true”]

[5] pry(main)> /\w+/.match(‘test’)
=> #<MatchData “test”>
[6] pry(main)> /\w+/.match(‘teść’)
=> #<MatchData “te”>

by now we can see, that “test” was matched correctly, while “\w+” applied on “teść” matched only first two characters. The solution to this problem would be to use POSIX regular expressions, so that the regexp would look like:

[cc lang=”ruby” escaped=”true”]
[8] pry(main)> /[[:alpha:]]+/.match(‘teść’)
=> #<MatchData “teść”>

which works great. The other one would be to use POSIX classes, such as character properties, described also here.

[cc lang=”ruby” escaped=”true”]
[9] pry(main)> /\p{Word}+/.match(‘teść’)
=> #<MatchData “teść”>

# whitespace character and \s

Whitespace characters

Let’s say we have text coming from text editor or internet pages. I would assume, that “\s” would work as sufficient space separator. I would be wrong. Let’s take a look at following example:

[cc lang=”ruby” escaped=”true”]

[1] pry(main)> /\w+\s\w+/.match “word word” # regular space
=> #<MatchData “word word”>

[2] pry(main)> /\w+\s\w+/.match “word word” # non breaking space
=> nil


(note that in second example, there is not-breaking space between words. In Windows you can insert one by pressing alt+0160 (numbers on numeric keyboard), in VIM by pressing <C-k> <space> <space>).
So to make it work with non breaking space you might either do convert text to replace non-breaking spaces to regular spaces, or if you want to preserve the information:

[cc lang=”ruby” escaped=”true”]
[3] pry(main)> /\w+[[:space:]]\w+/.match “word word” # non breaking space
=> #<MatchData “word word”>
# or
[4] pry(main)> /\w+[\s ]\w+/.match “word word”
=> #<MatchData “word word”>

Well, this is it. Couple thoughts worth remembering, when creating regular expression that are meant to work OK with international character set.

