Microsoft Access to PostgreSQL

Recently a client wanted to convert some Microsoft Access databases into PostgreSQL. There are several ways to do the conversion, including commercial tools, but my client choose to first try using the Open Source mdbtools (also mdbtools on Sourceforge; but that appears to be well out of date).

Ubuntu 14.04LTS packages mdbtools 0.7.1, which is the latest release, but that version is about 18 months old. After testing it on the conversion I found several issues including:

incorrect handling of namespace (SCHEMA) values around INDEX and CONSTRAINT names in the PostgreSQL schema export (Github Pull Request)
boolean data values only exported as 0 and 1 bare integer values, but PostgreSQL does not implicitly convert 0/1 to boolean; it expects, eg, SQL Standard TRUE and FALSE (GitHub Pull Request)
the detection of index flags (eg, UNIQUE) on Jet 4 databases was apparently looking in the wrong place for the flags, resulting in indexes being marked UNIQUE or not marked UNIQUE apparently at random (GitHub Pull Request)

so we ended up building from source, with our own patches. I have collected these patches in a mdbtools fork with postges_fixes on GitHub; hopefully eventually they will all get merged in upstream, but there are a dozen outstanding GitHub pull requests.

In addition to those problems, on one database we found that it had a multi-field FOREIGN KEY/REFERENCES pair, and mdbtools does not know how to handle multi-field FOREIGN KEY/REFERENCES fields, so it attempts to export them as two single-key references... which does not work:

ERROR:  there is no unique constraint matching given keys for referenced table...

(that error can also happen if the UNIQUE index constraint is not being detected properly; so I had a lot more of those errors before I fixed the detection of which indexes were UNIQUE!)

Since it only affected one FOREIGN KEY relationship between one pair of tables in one database, for now we worked around that one by using sed to change the first FOREIGN KEY relationship on that pair of tables to mention both keys, and omit the second one that mdbtools exported.

Along the way we also found a few other issues:

mdbtools seems to default to double quoting both field names and values, but PostgreSQL expects field names to be double quoted, and values to be single quoted (fortunately the -q command line flag can override that).
It seems to be legal in Microsoft Access to have table names that have spaces in them; but this means that the simple approach of using mdb-tables -1 DB to list the tables and iterate over them in a shell script does not work correctly. We fixed this by using a Python script to manage the conversion.
Mostly the internal Microsoft Access tables (eg, MSys...) are not exported, but occassionally some of these appear in things like FOREIGN KEY references; we chose just to omit the FOREIGN KEY references to MSys from our schema export. (MSysNavPaneGroup seemed to be the main one)
If you create the FOREIGN KEY constraints at the beginning, and then attempt to load data into the tables in an arbitrary order, then the FOREIGN KEY constraints will almost inevitably fail. Since we were already changing mdbtools to properly support CONSTRAINT names on PostgreSQL, we also changed mdbtools to export a line that allowed those CONSTRAINTs to be DEFERRABLE, eg:
```
ALTER TABLE "Orders" ALTER CONSTRAINT "Orders_ShipVia_fk" DEFERRABLE INITIALLY IMMEDIATE;
```
and then our load script can issue:
```
BEGIN;
SET CONSTRAINTS ALL DEFERRED;
[... schema ...]
[... data   ...]
COMMIT;
```
and providing the CONSTRAINTs all match at the end, the COMMIT will succeed. (The other approach is to defer adding the FOREIGN KEY CONSTRAINTs until the end.)

With all of those those work arounds the approach becomes basically (psuedo code):

mdb-schema -N SCHEMA DB postgres
mdb-tables -1 DB        # In Python wrapper script, split on newline

for TABLE in TABLES:
    mdb-export -D '%Y-%m-%d %H:%M:%S' -I postgres -N SCHEMA \
               -q "'" -B DB TABLE

with the BEGIN;/COMMIT wrapped around the whole lot as above. (This requires the patched mdbtools, to add the -B flag to get boolean values as TRUE and FALSE.)

SCHEMA is the destination schema in the PostgreSQL database that you want to put the tables into. Create the schema with:

CREATE SCHEMA ...;

When testing it is very useful to do:

\set ON_ERROR_STOP

if you are sending the data straight to psql, and also useful to run psql -b (which sets ECHO) to have it echo out the SQL that caused the error -- otherwise debugging is very hit and miss!

Because of the transaction wrapped around everything this combination just lets you try over and over again until it an run until completion.

And by putting it into its own SCHEMA, you can do:

BEGIN;
DROP SCHEMA ... CASCADE;

to remove it all and try again if you found, eg, data conversion errors. (Be careful with that command -- it will drop everything in that Schema! Make sure you named your test SCHEMA not a production one before you hit return, or COMMIT it, or you will be finding out in a hurry if your backup scripts worked....)

Database history

Microsoft Access uses the JET ("Joint Engine Technology") database technology. In particular it uses the "JET Red" database engine, which is a (basically) Microsoft Access specific fork of what was previously a Microsoft Windows "embedded database" library.

There is also "JET Blue" which is another fork starting with basically the same source. JET Blue later became the Extensible Storage Engine ("ESE"), and is used as the embedded database in lots of Microsoft products (Exchange, Active Directory, WINS, etc). Microsoft describes the two JET implementations as commpletely different, but looking at the two it is obvious they share a common history. ESE has its own Windows API, including things like JET_INDEXCREATE (but the API will not work with Microsoft Access, aka JET Red, databases).

Some more JET Red vs JET Blue history.

Other useful references

In no particular order:

jackcess Pure-Java library for reading/writing Microsoft Access databases; which seems to have the best debugged understanding of the file formats. (Also on Sourceforge, which appears to be relatively current.)
mdbtools HACKING guide to the JET Red format, which is mostly, but not completely, accurate
Jabakobob's reverse engineering of the JET Red format (which appears to make some of the same mistakes as mdbtools, due to basing their analysis in part on mdbtools analysis; in particular the index flags, like UNIQUE are shown what appears to be 4 bytes too early in both the Jakakbob's and mdbtools analysis)
Archive Team Microsoft Access file format reference, mostly covering older versions
JET (Red) Database Manipulation Tools wikibooks page; most, but not all, of which are Microsoft Windows based.
List of Microsoft Access size limitations, at least for older versions of Microsoft Access
Historical article on repairing Microsoft Access database corruption, and other historical hints on recovery
Extensible Storage Engine binary format (reverse engineered) from libesedb an Open Source library to access Microsoft ESE embedded databases; while definitely a different database to Microsoft Access (JET Red) there are lots of useful hints in the documentation of a "related" database format.
Forensics Wiki page on ESE
mdbtools for Microsoft Access to MySQL
Syntax of Microsoft Access CONSTRAINT Clause
GUI approach to creating multi-field relationships in Microsoft Access
Microsoft Access SQL to create table relationships
Microsoft Access Guide to Table Relationships
Microsoft Access hidden system tables
Microsoft Access example database: Northwind, and schema as seen in a MySQL export
Microsoft Access to creating multi-field indexes, and another example on StackOverflow