Recently I needed to clean some address data at a customer. Although the national addresses were stored  highly normalized whenever it a foreign address was used all bets seemed to be off. For most part city field seems to contain a postal code as well as the actual city name. Problem being the format differed quit a lot.

  • Some cities had leading postal codes like (B1000 ANTWERP)
  • Some cities had trailing postal codes like (LEETON 2705 NSW)
  • Some city names consist of multiple words separated by tabs,space and/or punctional sings (like SAN PAULO or SAN-PAULO)
  • Some city names contain diacritics (like KECIÖREN)
  • The format of the postal code for a single country are written in many different formats. (B1000 ANTWERP, could be 1000 ANTWERP or B 1000 ANTWERP or B-1000 ANTWERP)

The goal here was to retain only the cityname with as little effort as possible. Of course one could develop an algorithm for each country specific, but since the country was not known for many occasions the only sane way was pattern matching. Hello REGEX! I don’t intend to be 100% correct, but as a long as I can clean 95% it’s good enough for my purpose

I decided to tackle the problem one by one.

Remove leading postal codes
To match leading postal code I developed the following regular expression, which breaks down to this

^               Try to match from the start of the string
[[:alpha:]]{0,3}    Optionally followed by 0 to 3 alphanumeric characters
[[:blank:]]?        Optionally followed by 1 blank character
[[:punct:]]?        Optionally followed by 1 punctual character
Followed by either
[[:digit:]]*      Optionally 0 or more digits
([[:blank:]]|[[:punct:]])?   Optionally followed by a blank or a punctual character
[[:digit:]]+                 Followed by 1 or more digits
Or
[[:digit:]]{2}                          2 digits
([[:blank:]]|[[:punct:]])?   Followed by 1  or 2 blank or a punctual characters
([[:blank:]]|[[:punct:]])?   Followed by 3 digits
[[:blank:]]+      Followed by 1 or more blanks

Remove leading postal codes
To match trailing postal code I developed the following regular expression, which breaks down to this
[[:blank:]]                  Try to match  a blank character
[[:alnum:]]{0,3}             Optionally followed by 0 to 3 alphanumeric characters
([[:blank:]]|[[:punct:]])?   Optionally followed by a blank or a punctual character
[[:digit:]]+                 Followed by 1 or more digits
[[:blank:]]?                 Optionally followed by 1 blank character
[[:alnum:]]*                 Optionally followed by 0 or more alphanumeric characters
$                            Followed by the end of the string

Using a the REGEXP_REPLACE function I replace any matched string by a zero string, essentially removing the trailing postal code if found.

Removing non alpha characters
Even after removing most leading and trailing postalcodes, there are some extremes left that are not matched by the previous regex (and never will be because these are to messed up to take into account).

Since city names do not tend to incorporate anything but alphanumeric characters, it’s a safe bet to simply remove any non alphanumeric character left. It’s a bit of a horse pill but it does help to clean to tight corners :)

Before I met regex I would try to accomplish the same result using the translate function in Oracle, however regex is much cleaner,better and easier. To remove the non alpha characters I simple did the following:
[^[:alpha:]]+        Try to match 1 or more non alpha characters (the ^ inverses the class [[:alpha:]])
Using a the REGEXP_REPLACE function I simply replace any non alpha character  by a space

Removing non alpha characters
The original string might contain double spaces, either original or as a result of replacing an non alphanumeric by space. To remove extra spaces again I use regex:
[[:blank:]]{2,}       Try to match a 2 or more spaces
Using a the REGEXP_REPLACE function I simply replace any number of spaces by found by a single space

Remove diacritics
Whoever invented these signs must not have been working with computers. There a pain! I want them gone and replaced by their respective base characters. As far as I know there is no way of doing this using regex. Lucky me there is a need trick using nlssort to sort accent insensitive and converting that output back to characters.

All together now!
When you put all of the above techniques together into a single statement, you can very easily clean up a messed up input like the example below.

Have fun!

INPUT ALL_TOGETHER
angered-goteborg ANGERED GOTEBORG
angered – goteborg ANGERED GOTEBORG
8767 eim gl schweiz EIM GL SCHWEIZ
8767 eim gl schweiz EIM GL SCHWEIZ
04910020 JOAPESSOA JOAPESSOA
CEP 58055 320 JOAPESSOA JOAPESSOA
KAYSER 4 KAYSER
4 KAYSER KAYSER
keciören KECIOREN
kecioren KECIOREN
2400 MOL-WEZEL MOL WEZEL
B2400 MOL-WEZEL MOL WEZEL
B- 2400 MOL-WEZEL MOL WEZEL
B 2400 MOL-WEZEL MOL WEZEL
B-2400 MOL-WEZEL MOL WEZEL
* ONBEKEND
. ONBEKEND
ONBEKEND
DO6484 QUEDLLINBURG QUEDLLINBURG
O6484 QUEDLLINBURG QUEDLLINBURG
47 400 RACIBORZ RACIBORZ
47-400 RACIBORZ RACIBORZ
SAN PAULO SAN PAULO
SAN-PAULO SAN PAULO
1453 vésenaz VESENAZ
vésenaz VESENAZ
WANMIASSA 2705 NSW WANMIASSA
WANMIASSA ACT 2903 WANMIASSA
WANMIASSA ACT 29035 WANMIASSA
WILLENHALL WV133QD WILLENHALL
WILLENHALL CV5 6F2 WILLENHALL
WILLENHALL WV133QD WILLENHALL

Posted in Oracle

Execute immediate results in ora-00911 invalid character

Note to self: when receiving this error while executing DML using execute immediate, check for a semicolon (;) at the end of your statement. REMOVE!

Read more ›


Posted in Oracle, PEBKAC, Pl/Sql Tagged with: , ,

Unintended recursiveness

Today I learned another PL/SQL feature… the hard way.

I was busy writing  some new code to import and process meta data. When testing the procedures performing a simple import  my procedure call kept running indefinitely and the OS load of my VM would go over 5. Definitely some code looping out of control. Been there, done that, doesn’t happen that often anymore. The other option would be some really bad performing SQL. Problem is the code I was testing contained neither. Killed my process, reviewed my code,reviewed the logging, didn’t see anything wrong, blamed my VM and tried again and expecting different behaviour…

Read more ›


Posted in Oracle Tagged with: ,

DBMS_ASSERT.SQL_OBJECT_NAME

DBMS_ASSERT is one of the oracle supplied pl/sql packages which contains useful functionality. Amongst others it supplies the function dbms_assert.sql_object_name. Although the function is very useful, it’s poorly documented and has several caveats you should be aware that are not mentioned.

Read more ›


Posted in Oracle Tagged with: ,

About triggers

A trigger is a named PL/SQL unit that is stored in the database and fired (run) in response to a specified event. The specified event is defined on (associated with) either a table, a view, a schema, or the database, and it is one of these:

  • The database manipulation language (DML) statement DELETE, INSERT, or UPDATE
  • A database definition language (DDL) statement such as CREATE, ALTER, or DROP
  • A database operation such as SERVERERROR, LOGON, LOGOFF, STARTUP, or SHUTDOWN

Read more ›


Posted in Oracle, Triggers Tagged with: ,

Compound triggers

If you put business rules into triggers validating your data your could easily grow a considerable forest around your tables. Making it hard to see for anyone exactly how all these triggers work together or even worse.. fight each other. Oracle recognized this and introduced the compound trigger. Basically it’s a special package… for triggers in which you can consolidate all your different triggers. What’s even better is that like packages these compound triggers can have a global declarations section. Again just like your packages.

Read more ›


Posted in Oracle, Pl/Sql Tagged with: , , ,

Predictable execution order of triggers

Prior to 11g if you defined multiple triggers of the same type on the same table it was impossible to guarantee which trigger would fire first. Oracle 11g release includes a new instruction FOLLOWS. This allows you to specify which trigger “FOLLOWS” the other triggers.
The basic syntax is:

FOLLOWS <trigger_name>, <trigger_name2>, ...

Read more ›


Posted in Oracle, Triggers Tagged with: , ,

PL/SQL Function Result Cache

With release 1 of 11g Oracle introduced PL/SQL Function Result Cache. As of release 2 of 11g Oracle further improved PL/SQL Function Result Cache. I’m not saying it’s the best invention since slices bread, but it’s up there. This really is a very nifty function I intent to use a lot. If you’re still on 10g then combined with conditional compiling you can already start preparing for this feature today!

Simply put: any table function returning data from tables, that are queried more often then updated benefits from the use of Function result Cache

Read more ›


Posted in Oracle, Pl/Sql Tagged with: , ,

Use named and mixed notation from SQL

For Pl/SQL program invocations it has named & mixed has been around for quite some time.However it you attempted the same function call from SQL prior to Oracle 11g this would lead to an error.

With Oracle 11g you can now use named & mixed notation when you call a function from SQL. This makes it much easier to call a function with lots of formal parameters of which you only want to specify 1 or 2 parameters

Read more ›


Posted in Oracle, Pl/Sql Tagged with: , ,

Dynamic SQL Enhancements

Changes have been made to both Native Dynamic SQL (NDS) and the DBMS_SQL package:

Changes to NDS:

  • the 32k limit has been lifted; EXECUTE IMMEDIATE now allows CLOB values passed in
  • the 32k limit has been lifted; OPEN FORALL now allows CLOB values passed in

Changes to DBMS_SQL package:

  • the 32k limit has been lifted; The DBMS_SQL.PARSE is now overloaded for CLOB argument
  • New TO_REFCURSOR function; Allowing switching to NDS from DBMS_SQL
  • New TO_CURSOR_NUMBER function; Allowing switching to DBMS_SQL from NDS
  • All data types that native dynamic SQL supports are supported.

Posted in Oracle, Pl/Sql Tagged with: , ,