Recently a client wanted to convert some Microsoft Access databases into PostgreSQL. There are several ways to do the conversion, including commercial tools, but my client choose to first try using the Open Source mdbtools (also mdbtools on Sourceforge; but that appears to be well out of date).
Ubuntu 14.04LTS packages mdbtools
0.7.1, which is the latest
release, but that
version is about 18 months old. After testing it on the conversion
I found several issues including:
incorrect handling of namespace (SCHEMA) values around INDEX and CONSTRAINT names in the PostgreSQL schema export (Github Pull Request)
boolean data values only exported as
0
and1
bare integer values, but PostgreSQL does not implicitly convert0
/1
to boolean; it expects, eg, SQL StandardTRUE
andFALSE
(GitHub Pull Request)the detection of index flags (eg,
UNIQUE
) on Jet 4 databases was apparently looking in the wrong place for the flags, resulting in indexes being markedUNIQUE
or not markedUNIQUE
apparently at random (GitHub Pull Request)
so we ended up building from source, with our own patches. I have collected these patches in a mdbtools fork with postges_fixes on GitHub; hopefully eventually they will all get merged in upstream, but there are a dozen outstanding GitHub pull requests.
In addition to those problems, on one database we found that it had
a multi-field FOREIGN KEY
/REFERENCES
pair, and mdbtools
does
not know how to handle multi-field FOREIGN KEY
/REFERENCES
fields, so it attempts to export them as
two single-key references... which does not work:
ERROR: there is no unique constraint matching given keys for referenced table...
(that error can also happen if the UNIQUE
index constraint is not
being detected properly; so I had a lot more of those errors before I
fixed the detection of which indexes were UNIQUE
!)
Since it only affected one FOREIGN KEY
relationship between one pair
of tables in one database, for now we worked around that one by
using sed
to change the first FOREIGN KEY
relationship on that
pair of tables to mention both keys, and omit the second one that
mdbtools
exported.
Along the way we also found a few other issues:
mdbtools
seems to default to double quoting both field names and values, but PostgreSQL expects field names to be double quoted, and values to be single quoted (fortunately the-q
command line flag can override that).It seems to be legal in Microsoft Access to have table names that have spaces in them; but this means that the simple approach of using
mdb-tables -1 DB
to list the tables and iterate over them in a shell script does not work correctly. We fixed this by using a Python script to manage the conversion.Mostly the internal Microsoft Access tables (eg,
MSys...
) are not exported, but occassionally some of these appear in things likeFOREIGN KEY
references; we chose just to omit theFOREIGN KEY
references toMSys
from our schema export. (MSysNavPaneGroup
seemed to be the main one)If you create the
FOREIGN KEY
constraints at the beginning, and then attempt to load data into the tables in an arbitrary order, then theFOREIGN KEY
constraints will almost inevitably fail. Since we were already changingmdbtools
to properly supportCONSTRAINT
names on PostgreSQL, we also changedmdbtools
to export a line that allowed thoseCONSTRAINT
s to beDEFERRABLE
, eg:ALTER TABLE "Orders" ALTER CONSTRAINT "Orders_ShipVia_fk" DEFERRABLE INITIALLY IMMEDIATE;
and then our load script can issue:
BEGIN; SET CONSTRAINTS ALL DEFERRED; [... schema ...] [... data ...] COMMIT;
and providing the
CONSTRAINT
s all match at the end, theCOMMIT
will succeed. (The other approach is to defer adding theFOREIGN KEY
CONSTRAINT
s until the end.)
With all of those those work arounds the approach becomes basically (psuedo code):
mdb-schema -N SCHEMA DB postgres
mdb-tables -1 DB # In Python wrapper script, split on newline
for TABLE in TABLES:
mdb-export -D '%Y-%m-%d %H:%M:%S' -I postgres -N SCHEMA \
-q "'" -B DB TABLE
with the BEGIN;
/COMMIT
wrapped around the whole lot as above.
(This requires the patched mdbtools
, to add the -B
flag to get
boolean values as TRUE
and FALSE
.)
SCHEMA
is the destination schema in the PostgreSQL database that
you want to put the tables into. Create the schema with:
CREATE SCHEMA ...;
When testing it is very useful to do:
\set ON_ERROR_STOP
if you are sending the data straight to psql
, and also useful to run
psql -b
(which sets ECHO
) to have it echo out the SQL that caused
the error -- otherwise debugging is very hit and miss!
Because of the transaction wrapped around everything this combination just lets you try over and over again until it an run until completion.
And by putting it into its own SCHEMA
, you can do:
BEGIN;
DROP SCHEMA ... CASCADE;
to remove it all and try again if you found, eg, data conversion errors.
(Be careful with that command -- it will drop everything in that
Schema! Make sure you named your test SCHEMA
not a production one
before you hit return, or COMMIT
it, or you will be finding out in
a hurry if your backup scripts worked....)
Database history
Microsoft Access
uses the JET
("Joint Engine
Technology")
database technology. In particular it uses the "JET Red" database
engine, which is a (basically) Microsoft Access specific
fork
of what was previously a Microsoft Windows "embedded database"
library.
There is also "JET Blue" which is another fork starting with basically the same source. JET Blue later became the Extensible Storage Engine ("ESE"), and is used as the embedded database in lots of Microsoft products (Exchange, Active Directory, WINS, etc). Microsoft describes the two JET implementations as commpletely different, but looking at the two it is obvious they share a common history. ESE has its own Windows API, including things like JET_INDEXCREATE (but the API will not work with Microsoft Access, aka JET Red, databases).
Some more JET Red vs JET Blue history.
Other useful references
In no particular order:
jackcess Pure-Java library for reading/writing Microsoft Access databases; which seems to have the best debugged understanding of the file formats. (Also on Sourceforge, which appears to be relatively current.)
mdbtools HACKING guide to the JET Red format, which is mostly, but not completely, accurate
Jabakobob's reverse engineering of the JET Red format (which appears to make some of the same mistakes as
mdbtools
, due to basing their analysis in part onmdbtools
analysis; in particular the index flags, likeUNIQUE
are shown what appears to be 4 bytes too early in both the Jakakbob's andmdbtools
analysis)Archive Team Microsoft Access file format reference, mostly covering older versions
JET (Red) Database Manipulation Tools wikibooks page; most, but not all, of which are Microsoft Windows based.
List of Microsoft Access size limitations, at least for older versions of Microsoft Access
Historical article on repairing Microsoft Access database corruption, and other historical hints on recovery
Extensible Storage Engine binary format (reverse engineered) from libesedb an Open Source library to access Microsoft ESE embedded databases; while definitely a different database to Microsoft Access (JET Red) there are lots of useful hints in the documentation of a "related" database format.
GUI approach to creating multi-field relationships in Microsoft Access
Microsoft Access example database: Northwind, and schema as seen in a MySQL export
Microsoft Access to creating multi-field indexes, and another example on StackOverflow