“There ain’t no such thing as plain text”. Joel Spolsky (what every developer must know about unicode).
This text is based on the assumption, that current locale uses UTF-8 encoding. Behavior might differ for other encodings. I use Ruby 2.0 for evaluation.
I was recently doing some text parsing in my native language (which is Polish). Poland uses latin / roman alphabet with a few additional accent letters, such as “ąęśćłóźżń”. Some unexpected issues occured, when I tried to use them with regular expressions in ruby.
Accent characters
Let’s first take a look how Polish characters are represented in bytes:
[cc lang=”ruby” escaped=”true”]
[11] pry(main)> “test”.bytes.to_a
=> [116, 101, 115, 116]
[12] pry(main)> “teść”.bytes.to_a
=> [116, 101, 197, 155, 196, 135]
[/cc]
Locale in my system are set to LANG=pl_PL.UTF-8 and we can see, that both “ś” and “ć” took 2 bytes each.
Let’s start with some examples:
[cc lang=”ruby” escaped=”true”]
[5] pry(main)> /\w+/.match(‘test’)
=> #<MatchData “test”>
[6] pry(main)> /\w+/.match(‘teść’)
=> #<MatchData “te”>
[/cc]
by now we can see, that “test” was matched correctly, while “\w+” applied on “teść” matched only first two characters. The solution to this problem would be to use POSIX regular expressions, so that the regexp would look like:
[cc lang=”ruby” escaped=”true”]
[8] pry(main)> /[[:alpha:]]+/.match(‘teść’)
=> #<MatchData “teść”>
[/cc]
which works great. The other one would be to use POSIX classes, such as character properties, described also here.
[cc lang=”ruby” escaped=”true”]
[9] pry(main)> /\p{Word}+/.match(‘teść’)
=> #<MatchData “teść”>
[/cc]
# whitespace character and \s
Whitespace characters
Let’s say we have text coming from text editor or internet pages. I would assume, that “\s” would work as sufficient space separator. I would be wrong. Let’s take a look at following example:
[cc lang=”ruby” escaped=”true”]
[1] pry(main)> /\w+\s\w+/.match “word word” # regular space
=> #<MatchData “word word”>
[2] pry(main)> /\w+\s\w+/.match “word word” # non breaking space
=> nil
[/cc]
(note that in second example, there is not-breaking space between words. In Windows you can insert one by pressing alt+0160 (numbers on numeric keyboard), in VIM by pressing <C-k> <space> <space>).
So to make it work with non breaking space you might either do convert text to replace non-breaking spaces to regular spaces, or if you want to preserve the information:
[cc lang=”ruby” escaped=”true”]
[3] pry(main)> /\w+[[:space:]]\w+/.match “word word” # non breaking space
=> #<MatchData “word word”>
# or
[4] pry(main)> /\w+[\s ]\w+/.match “word word”
=> #<MatchData “word word”>
[/cc]
Well, this is it. Couple thoughts worth remembering, when creating regular expression that are meant to work OK with international character set.
Leave a Reply