In the last blog post about migrating from MSSQL to PostgreSQL, I left one detail for later -- resetting all the sequences after importing data. I created two PL/pgSQL functions to do the work of selecting the maximum value in the primary key and updating the sequence.
CREATE OR REPLACE FUNCTION get_pk_max(table_name VARCHAR(100), pk_name VARCHAR(100))
RETURNS INTEGER AS $$
DECLARE i INTEGER;
DECLARE q TEXT;
BEGIN
q := 'SELECT max(' || pk_name || ') FROM ' || table_name;
EXECUTE(q) INTO i;
IF i IS NULL THEN
RETURN 0;
ELSE
RETURN i;
END IF;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION reset_seq(table_name VARCHAR(100), pk_name VARCHAR(100), seq_name VARCHAR(100))
RETURNS INTEGER AS $$
DECLARE i INTEGER;
BEGIN
SELECT get_pk_max(table_name, pk_name) INTO i;
RETURN setval(seq_name, i + 1, false);
END;
$$ LANGUAGE plpgsql;
The reset_seq() function is called with from the PostgreSQL client for each table by a shell script:
for T in `cat ${DB_INFO_DIR}/tables`; do
if [ -f ${DB_INFO_DIR}/${T}.sequence ]; then
echo "Processing $T"
PK=`gawk 'BEGIN { RS="\000"; FS="\n" }; { print $1 }' ${DB_INFO_DIR}/${T}.sequence`
SEQ=`gawk 'BEGIN { RS="\000"; FS="\n" }; { print $2 }' ${DB_INFO_DIR}/${T}.sequence`
psql -c "SELECT reset_seq('ll_${T}', '${PK}', '${SEQ}');"
else
echo "No sequence information for $T"
fi
done
This loops over every table name in the new database. The table name, prefixed with ll_, is the first argument to reset_seq(). The second and third arguments are read from a file containing two lines; the first line contains the name of the table's primary key, and the second line contains the name of the sequence.
The shell script executes very quickly, resetting all the sequences in the database in a few seconds.
The following are notes on a migration from Microsoft SQL Server 2000 to PostgreSQL 8. The old database has 60 tables, some with a few million rows, and about 7 Gb of data in total. There were several issues to deal with:
- The order of the columns in the tables of each schema aren't consistent. The older database has had columns added over time, while the new database schema has been created from the most current SQL scripts.
- The data exported using bcp will need reformatting to be imported using psql.
- Fields with NULL values will need special consideration. The bcp utility is not consistent in the use of 0x00 (the null character) to indicate a null value.
- The MSSQL database has a mix of character encodings; The PostgreSQL database uses UTF8.
- The import must be automated so it can be performed after the existing database is taken offline, and before the new one is put into production. This ensures that all user information is consistent.
- The data will not be imported in a single transaction. Therefore, the order of the import must be correct for the foreign key relationships that exist.
- The indexes and sequences of the new database must be set. (TODO)
The MSSQL database is hosted on a Windows server in a remote facility. I can open a VNC connection through an SSH tunnel, but the responsiveness is unacceptable. Instead, I use the Cygwin port of the OpenSSH daemon, running as a service, and providing a bash shell.
The first step in preparing to export the data is to get a list of table names. I used the isql utility to connect to the server:
isql -Q "select name from sysobjects where type = 'U'" \
-S $MSSQL_SERVER -U $MSSQL_USERNAME -P $MSSQL_PASSWORD \
| grep -Eo 'tbl_[a-z_]+' \
| sed -s 's/tbl_//g' \
| sort > tables
The result is file containing one table name (with the tbl_ prefix removed) per line. I copied the file as tables_to_export, and removed the tables that were not necessary to export. Incidentally, I don't know if the Microsoft documentation on these utilities will be accessible in the future. To be safe, I've created PDF archives of the pages on MSDN. The following bit of shell script loops over each line in the list of tables to be exported:
TEMPFILE=`mktemp`
for T in `cat tables_to_export`; do
bcp "tbl_${T}" format $TEMPFILE -f "${EXPORT_DIR}/${T}.fmt" -c -k \
-S $MSSQL_SERVER -U $MSSQL_USERNAME -P $MSSQL_PASSWORD
bcp "tbl_${T}" out "${EXPORT_DIR}/${T}" -o "${EXPORT_DIR}/${T}.out" -c -k \
-r '_~R~_' -t '_~F~_' \
-S $MSSQL_SERVER -U $MSSQL_USERNAME -P $MSSQL_PASSWORD
done
There are two runs of bcp for each table; the first documents the format of the table being exported, and the second performs the actual data export. The command line switches -c and -k specify character data, and to keep null values, rather than export with column defaults. I selected _~R~_ and _~F~_ for row and field terminators, rather than \n and \t, because the data itself contains new lines and tabs. After completing the export, I created a tarball and transferred it to an administration shell account on the same network as the new server. This little detail is important -- the new PostgreSQL server does not have file system access to the data files. See the documentation on the COPY command for more information.
The tarball is extracted to the local /tmp directory, rather than the NFS-mounted home directory for speed -- an important consideration when processing large files. The first task is to escape the embedded backslash, new line, and tab characters, convert the nulls, and then strip out the row and field terminators. A fairly straightforward Gawk program does the work:
BEGIN {
RS="_~R~_"
FS="_~F~_"
OFS="\t"
}
{
gsub(/\\/, "\\\\")
gsub(/\000/, "\\N")
gsub(/\r\n/, "\\n")
gsub(/\n/, "\\n")
gsub(/\t/, "\\t")
$1 = $1
print $0
}
The data in some files required extra massaging, such as inserting missing null values or removing obsolete fields. For each data file that requires extra processing, I created a file of the same name containing additional gawk rules. The processing script checks for the extra rule files and applies them when found. The new files are created with .tab extensions, indicating that the file is ready for import.
This result is a set of data files that can be used by psql. However, the order of the columns must be dealt with. For that, I wrote a script to loop over all of the exported table names (bundled into the tarball), and compare the column order to the new database:
# Create a column list using bcp format files
for T in `cat ${DBOLD_DIR}/tables_to_export`; do
gawk 'BEGIN { FS="[ ]+" } /[ ]+/ { print $7 }' \
"${DBOLD_DATADIR}/${T}.fmt" > "${DBOLD_WORK}/${T}.columns"
done
# Create a list of all the tables in the new database
psql -c '\dt *' | gawk 'BEGIN { FS=" " } /public/ { print $3 }' \
| sed -e 's/^ll_//' > $DBNEW_TABLES
# Create column list for each table in the new database
for T in `cat $DBNEW_TABLES`; do
psql -c "\\d ll_${T}" \
| gawk 'BEGIN { FS=" " } /Column/ { next } /^[ ][a-z]/ { print $1 }' \
> "${DBNEW_WORK}/${T}.columns"
done
# Compare each exported table to the new table
for T in `cat ${DBOLD_DIR}/tables_to_export` ; do
echo "Table: $T"
if [ ! -f ${DBNEW_DIR}/${T}.columns ]; then
echo "Does not exist in new database."
else
# Quietly check for differences
diff -w -i -q "${DBNEW_DIR}/${T}.columns" "${DBOLD_DIR}/${T}.columns" > /dev/null
if [ $? != 0 ]; then
# Show differences side-by-side
diff -w -i -y "${DBNEW_DIR}/${T}.columns" "${DBOLD_DIR}/${T}.columns"
else
echo "No differences."
fi
fi
echo
done
I went to the effort to programmatically compare the old tables to the new tables for two reasons: 1) it's highly likely that I would have made errors doing it by visual inspection, and 2) the resulting output is used to create a column list used during the import of the data. These column list files are placed into a directory that will be checked by the import script. If a file is found matching the name of the table, the column list is used instead of a straight copy.
I don't yet have a tool for examining the SQL scripts, and determining the dependency order. Fortunately, this only needs to be done once (for this project). I created a file containing the table names in the order that they must be imported.
Finally, the data can be imported!
for T in `cat ${DB_IMPORT_INFO_DIR}/table_import_order`; do
echo "Processing $T"
# If the column order is special, use the column list
if [ -f ${DB_IMPORT_INFO_DIR}/${T} ]; then
Q=`head -n 1 ${DB_IMPORT_INFO_DIR}/${T}`
psql -c "COPY ll_${T} (${Q}) FROM STDIN" < ${DB_DATA_DIR}/${T}.tab
# Otherwise, perform a straight copy
else
psql -c "COPY ll_${T} FROM STDIN" < ${DB_DATA_DIR}/${T}.tab
fi
done
I should note that I set the PGCLIENTENCODING environment variable to LATIN1 in the import script. This makes the conversion to UTF8 explicit, and eliminates any byte sequence errors. Initially, I tried setting \encoding LATIN1 in the ~/.psqlrc file, as well as specifying -v encoding=LATIN1 on the psql command line -- neither of these methods cause the correct character encoding to be set during the import.
As indicated above, the SQL to set the indexes and sequences needs to be completed. Other than that, the project is nearly complete.
This is a note to myself, and any other PostgreSQL user that might also be bitten by this circumstance. I like to have my psql client configured without AUTOCOMMIT enabled. This allows me to run long scripts that don't change anything unless it all succeeds. So, I set it to false in my ~/.psqlrc file.
Recently I was making some changes to a database table and got sidetracked before committing the ALTER TABLE. Hours later, I wondered why my web app was completely unresponsive for some requests. It turns out that the JDBC driver was locked (and doesn't timeout) because of my open psql connection. D'oh! How easy it is to overlook the simple things.
PostgreSQL
|
Posted
12/6/07
@ 12:41 PM
by Joseph Lamoree