Recently a client wanted to convert some Microsoft Access databases into PostgreSQL. There are several ways to do the conversion, including commercial tools, but my client choose to first try using the Open Source mdbtools (also mdbtools on Sourceforge; but that appears to be well out of date).
Ubuntu 14.04LTS packages mdbtools 0.7.1, which is the latest
release, but that
version is about 18 months old. After testing it on the conversion
I found several issues including:
incorrect handling of namespace (SCHEMA) values around INDEX and CONSTRAINT names in the PostgreSQL schema export (Github Pull Request)
boolean data values only exported as
0and1bare integer values, but PostgreSQL does not implicitly convert0/1to boolean; it expects, eg, SQL StandardTRUEandFALSE(GitHub Pull Request)the detection of index flags (eg,
UNIQUE) on Jet 4 databases was apparently looking in the wrong place for the flags, resulting in indexes being markedUNIQUEor not markedUNIQUEapparently at random (GitHub Pull Request)
so we ended up building from source, with our own patches. I have collected these patches in a mdbtools fork with postges_fixes on GitHub; hopefully eventually they will all get merged in upstream, but there are a dozen outstanding GitHub pull requests.
In addition to those problems, on one database we found that it had
a multi-field FOREIGN KEY/REFERENCES pair, and mdbtools does
not know how to handle multi-field FOREIGN KEY/REFERENCES fields, so it attempts to export them as
two single-key references... which does not work:
ERROR: there is no unique constraint matching given keys for referenced table...
(that error can also happen if the UNIQUE index constraint is not
being detected properly; so I had a lot more of those errors before I
fixed the detection of which indexes were UNIQUE!)
Since it only affected one FOREIGN KEY relationship between one pair
of tables in one database, for now we worked around that one by
using sed to change the first FOREIGN KEY relationship on that
pair of tables to mention both keys, and omit the second one that
mdbtools exported.
Along the way we also found a few other issues:
mdbtoolsseems to default to double quoting both field names and values, but PostgreSQL expects field names to be double quoted, and values to be single quoted (fortunately the-qcommand line flag can override that).It seems to be legal in Microsoft Access to have table names that have spaces in them; but this means that the simple approach of using
mdb-tables -1 DBto list the tables and iterate over them in a shell script does not work correctly. We fixed this by using a Python script to manage the conversion.Mostly the internal Microsoft Access tables (eg,
MSys...) are not exported, but occassionally some of these appear in things likeFOREIGN KEYreferences; we chose just to omit theFOREIGN KEYreferences toMSysfrom our schema export. (MSysNavPaneGroupseemed to be the main one)If you create the
FOREIGN KEYconstraints at the beginning, and then attempt to load data into the tables in an arbitrary order, then theFOREIGN KEYconstraints will almost inevitably fail. Since we were already changingmdbtoolsto properly supportCONSTRAINTnames on PostgreSQL, we also changedmdbtoolsto export a line that allowed thoseCONSTRAINTs to beDEFERRABLE, eg:ALTER TABLE "Orders" ALTER CONSTRAINT "Orders_ShipVia_fk" DEFERRABLE INITIALLY IMMEDIATE;and then our load script can issue:
BEGIN; SET CONSTRAINTS ALL DEFERRED; [... schema ...] [... data ...] COMMIT;and providing the
CONSTRAINTs all match at the end, theCOMMITwill succeed. (The other approach is to defer adding theFOREIGN KEYCONSTRAINTs until the end.)
With all of those those work arounds the approach becomes basically (psuedo code):
mdb-schema -N SCHEMA DB postgres
mdb-tables -1 DB # In Python wrapper script, split on newline
for TABLE in TABLES:
mdb-export -D '%Y-%m-%d %H:%M:%S' -I postgres -N SCHEMA \
-q "'" -B DB TABLE
with the BEGIN;/COMMIT wrapped around the whole lot as above.
(This requires the patched mdbtools, to add the -B flag to get
boolean values as TRUE and FALSE.)
SCHEMA is the destination schema in the PostgreSQL database that
you want to put the tables into. Create the schema with:
CREATE SCHEMA ...;
When testing it is very useful to do:
\set ON_ERROR_STOP
if you are sending the data straight to psql, and also useful to run
psql -b (which sets ECHO) to have it echo out the SQL that caused
the error -- otherwise debugging is very hit and miss!
Because of the transaction wrapped around everything this combination just lets you try over and over again until it an run until completion.
And by putting it into its own SCHEMA, you can do:
BEGIN;
DROP SCHEMA ... CASCADE;
to remove it all and try again if you found, eg, data conversion errors.
(Be careful with that command -- it will drop everything in that
Schema! Make sure you named your test SCHEMA not a production one
before you hit return, or COMMIT it, or you will be finding out in
a hurry if your backup scripts worked....)
Database history
Microsoft Access
uses the JET ("Joint Engine
Technology")
database technology. In particular it uses the "JET Red" database
engine, which is a (basically) Microsoft Access specific
fork
of what was previously a Microsoft Windows "embedded database"
library.
There is also "JET Blue" which is another fork starting with basically the same source. JET Blue later became the Extensible Storage Engine ("ESE"), and is used as the embedded database in lots of Microsoft products (Exchange, Active Directory, WINS, etc). Microsoft describes the two JET implementations as commpletely different, but looking at the two it is obvious they share a common history. ESE has its own Windows API, including things like JET_INDEXCREATE (but the API will not work with Microsoft Access, aka JET Red, databases).
Some more JET Red vs JET Blue history.
Other useful references
In no particular order:
jackcess Pure-Java library for reading/writing Microsoft Access databases; which seems to have the best debugged understanding of the file formats. (Also on Sourceforge, which appears to be relatively current.)
mdbtools HACKING guide to the JET Red format, which is mostly, but not completely, accurate
Jabakobob's reverse engineering of the JET Red format (which appears to make some of the same mistakes as
mdbtools, due to basing their analysis in part onmdbtoolsanalysis; in particular the index flags, likeUNIQUEare shown what appears to be 4 bytes too early in both the Jakakbob's andmdbtoolsanalysis)Archive Team Microsoft Access file format reference, mostly covering older versions
JET (Red) Database Manipulation Tools wikibooks page; most, but not all, of which are Microsoft Windows based.
List of Microsoft Access size limitations, at least for older versions of Microsoft Access
Historical article on repairing Microsoft Access database corruption, and other historical hints on recovery
Extensible Storage Engine binary format (reverse engineered) from libesedb an Open Source library to access Microsoft ESE embedded databases; while definitely a different database to Microsoft Access (JET Red) there are lots of useful hints in the documentation of a "related" database format.
GUI approach to creating multi-field relationships in Microsoft Access
Microsoft Access example database: Northwind, and schema as seen in a MySQL export
Microsoft Access to creating multi-field indexes, and another example on StackOverflow