PSla Blog

Blog Piotra Ślatały | Peter Slatala's Blog

Ruby – Regex – Special characters

“There ain’t no such thing as plain text”. Joel Spolsky (what every developer must know about unicode).

This text is based on the assumption, that current locale uses UTF-8 encoding. Behavior might differ for other encodings. I use Ruby 2.0 for evaluation.

I was recently doing some text parsing in my native language (which is Polish). Poland uses latin / roman alphabet with a few additional accent letters, such as “ąęśćłóźżń”. Some unexpected issues occured, when I tried to use them with regular expressions in ruby.

Accent characters

Let’s first take a look how Polish characters are represented in bytes:

[cc lang=”ruby” escaped=”true”]
[11] pry(main)> “test”.bytes.to_a
=> [116, 101, 115, 116]
[12] pry(main)> “teść”.bytes.to_a
=> [116, 101, 197, 155, 196, 135]
[/cc]
Locale in my system are set to LANG=pl_PL.UTF-8 and we can see, that both “ś” and “ć” took 2 bytes each.

Let’s start with some examples:

[cc lang=”ruby” escaped=”true”]

[5] pry(main)> /\w+/.match(‘test’)
=> #<MatchData “test”>
[6] pry(main)> /\w+/.match(‘teść’)
=> #<MatchData “te”>
[/cc]

by now we can see, that “test” was matched correctly, while “\w+” applied on “teść” matched only first two characters. The solution to this problem would be to use POSIX regular expressions, so that the regexp would look like:

[cc lang=”ruby” escaped=”true”]
[8] pry(main)> /[[:alpha:]]+/.match(‘teść’)
=> #<MatchData “teść”>
[/cc]

which works great. The other one would be to use POSIX classes, such as character properties, described also here.

[cc lang=”ruby” escaped=”true”]
[9] pry(main)> /\p{Word}+/.match(‘teść’)
=> #<MatchData “teść”>
[/cc]

# whitespace character and \s

Whitespace characters

Let’s say we have text coming from text editor or internet pages. I would assume, that “\s” would work as sufficient space separator. I would be wrong. Let’s take a look at following example:

[cc lang=”ruby” escaped=”true”]

[1] pry(main)> /\w+\s\w+/.match “word word” # regular space
=> #<MatchData “word word”>

[2] pry(main)> /\w+\s\w+/.match “word word” # non breaking space
=> nil

[/cc]

(note that in second example, there is not-breaking space between words. In Windows you can insert one by pressing alt+0160 (numbers on numeric keyboard), in VIM by pressing <C-k> <space> <space>).
So to make it work with non breaking space you might either do convert text to replace non-breaking spaces to regular spaces, or if you want to preserve the information:

[cc lang=”ruby” escaped=”true”]
[3] pry(main)> /\w+[[:space:]]\w+/.match “word word” # non breaking space
=> #<MatchData “word word”>
# or
[4] pry(main)> /\w+[\s ]\w+/.match “word word”
=> #<MatchData “word word”>
[/cc]

Well, this is it. Couple thoughts worth remembering, when creating regular expression that are meant to work OK with international character set.

Leave a Reply

Your email address will not be published. Required fields are marked *