Recently I needed to clean some address data at a customer. Although the national addresses were stored  highly normalized whenever it a foreign address was used all bets seemed to be off. For most part city field seems to contain a postal code as well as the actual city name. Problem being the format differed quit a lot.

  • Some cities had leading postal codes like (B1000 ANTWERP)
  • Some cities had trailing postal codes like (LEETON 2705 NSW)
  • Some city names consist of multiple words separated by tabs,space and/or punctional sings (like SAN PAULO or SAN-PAULO)
  • Some city names contain diacritics (like KECIÖREN)
  • The format of the postal code for a single country are written in many different formats. (B1000 ANTWERP, could be 1000 ANTWERP or B 1000 ANTWERP or B-1000 ANTWERP)

The goal here was to retain only the cityname with as little effort as possible. Of course one could develop an algorithm for each country specific, but since the country was not known for many occasions the only sane way was pattern matching. Hello REGEX! I don’t intend to be 100% correct, but as a long as I can clean 95% it’s good enough for my purpose

I decided to tackle the problem one by one.

Remove leading postal codes
To match leading postal code I developed the following regular expression, which breaks down to this

^               Try to match from the start of the string
[[:alpha:]]{0,3}    Optionally followed by 0 to 3 alphanumeric characters
[[:blank:]]?        Optionally followed by 1 blank character
[[:punct:]]?        Optionally followed by 1 punctual character
Followed by either
[[:digit:]]*      Optionally 0 or more digits
([[:blank:]]|[[:punct:]])?   Optionally followed by a blank or a punctual character
[[:digit:]]+                 Followed by 1 or more digits
Or
[[:digit:]]{2}                          2 digits
([[:blank:]]|[[:punct:]])?   Followed by 1  or 2 blank or a punctual characters
([[:blank:]]|[[:punct:]])?   Followed by 3 digits
[[:blank:]]+      Followed by 1 or more blanks

Remove leading postal codes
To match trailing postal code I developed the following regular expression, which breaks down to this
[[:blank:]]                  Try to match  a blank character
[[:alnum:]]{0,3}             Optionally followed by 0 to 3 alphanumeric characters
([[:blank:]]|[[:punct:]])?   Optionally followed by a blank or a punctual character
[[:digit:]]+                 Followed by 1 or more digits
[[:blank:]]?                 Optionally followed by 1 blank character
[[:alnum:]]*                 Optionally followed by 0 or more alphanumeric characters
$                            Followed by the end of the string

Using a the REGEXP_REPLACE function I replace any matched string by a zero string, essentially removing the trailing postal code if found.

Removing non alpha characters
Even after removing most leading and trailing postalcodes, there are some extremes left that are not matched by the previous regex (and never will be because these are to messed up to take into account).

Since city names do not tend to incorporate anything but alphanumeric characters, it’s a safe bet to simply remove any non alphanumeric character left. It’s a bit of a horse pill but it does help to clean to tight corners :)

Before I met regex I would try to accomplish the same result using the translate function in Oracle, however regex is much cleaner,better and easier. To remove the non alpha characters I simple did the following:
[^[:alpha:]]+        Try to match 1 or more non alpha characters (the ^ inverses the class [[:alpha:]])
Using a the REGEXP_REPLACE function I simply replace any non alpha character  by a space

Removing non alpha characters
The original string might contain double spaces, either original or as a result of replacing an non alphanumeric by space. To remove extra spaces again I use regex:
[[:blank:]]{2,}       Try to match a 2 or more spaces
Using a the REGEXP_REPLACE function I simply replace any number of spaces by found by a single space

Remove diacritics
Whoever invented these signs must not have been working with computers. There a pain! I want them gone and replaced by their respective base characters. As far as I know there is no way of doing this using regex. Lucky me there is a need trick using nlssort to sort accent insensitive and converting that output back to characters.

All together now!
When you put all of the above techniques together into a single statement, you can very easily clean up a messed up input like the example below.

Have fun!

INPUT ALL_TOGETHER
angered-goteborg ANGERED GOTEBORG
angered – goteborg ANGERED GOTEBORG
8767 eim gl schweiz EIM GL SCHWEIZ
8767 eim gl schweiz EIM GL SCHWEIZ
04910020 JOAPESSOA JOAPESSOA
CEP 58055 320 JOAPESSOA JOAPESSOA
KAYSER 4 KAYSER
4 KAYSER KAYSER
keciören KECIOREN
kecioren KECIOREN
2400 MOL-WEZEL MOL WEZEL
B2400 MOL-WEZEL MOL WEZEL
B- 2400 MOL-WEZEL MOL WEZEL
B 2400 MOL-WEZEL MOL WEZEL
B-2400 MOL-WEZEL MOL WEZEL
* ONBEKEND
. ONBEKEND
ONBEKEND
DO6484 QUEDLLINBURG QUEDLLINBURG
O6484 QUEDLLINBURG QUEDLLINBURG
47 400 RACIBORZ RACIBORZ
47-400 RACIBORZ RACIBORZ
SAN PAULO SAN PAULO
SAN-PAULO SAN PAULO
1453 vésenaz VESENAZ
vésenaz VESENAZ
WANMIASSA 2705 NSW WANMIASSA
WANMIASSA ACT 2903 WANMIASSA
WANMIASSA ACT 29035 WANMIASSA
WILLENHALL WV133QD WILLENHALL
WILLENHALL CV5 6F2 WILLENHALL
WILLENHALL WV133QD WILLENHALL

Posted in Oracle

Leave a Reply