Hex Appeal: Handling Binary Data for DynamoDB to Postgres Migration with AWS Glue ETL Job
Are you thinking of migrating DynamoDB (DDB) data to Postgres? One of the challenges you may encounter is Glue job not supporting Bytea
data type target Postgres, and hence handling the binary data. This post will walk you through a solution we found, of course with the help of AWSome support, to handle binary data in an AWS Glue Job.
Let’s consider a DDB table with following schema:
id
: N (number)token
: B (binary)username
: S (string)lastupdatedby
: SS (string array)
As per our goal, we will migrate this table from DDB to Postgres using AWS Glue job. However, when we try to map the binary data, token
column, to Postgres, we run into an issue because we can’t use Bytea
data type for Postgres, as it is not supported by Glue, therefore, if create token
as a string column in Postgres table, some encoded string gets stored which is no-decodable.
To solve the problem, we need to convert the token
column to a hexadecimal string, for which we can use PySpark’s base64
function. If you are looking for a quick solution, you can refer to this sample script. Here’s how we converted and migrated the data:
Prerequisites:
We assume you have a database created in Glue, a Connector for creating JDBC crawler, two crawlers — DDB and Postgres, and source and destination tables. The crawlers will crawl both the tables, and create Glue tables in Glue database. These Glue tables will be then used in Glue visual editor. Source Data Catalogue will be Glue’s DDB crawled table, and destination will be Postgres with the Glue’s crawled Postgres table.
Step 0: For a quick start, Glue’s visual editor is great tool. Create a new Job like below, and you will have an auto-generated script in the Script
tab:
Step 1: Create Dynamic Frame from the DDB table:
DataCatalogtable_node1 = glueContext.create_dynamic_frame.from_catalog( database="ddb_to_postgres_migration_db", table_name="my_table", transformation_ctx="DataCatalogtable_node1", )
Step 2: Convert the binary columns to hexadecimal string using PySpark’s base64
function:
df = DataCatalogtable_node1.toDF() df2 = df.withColumn("token_hex", F.base64(F.col("token")).cast("string"))
This creates a PySpark dataframe from our DynamicFrame and using base64
function, converts the token
column to a hexadecimal string. We gave this column a new name — token_hex
.
Step 3: Map the columns to Postres:
ApplyMapping_node2 = ApplyMapping.apply( frame=DynamicFrame.fromDF(df2.drop("token"), glueContext, "df2"), mappings=[ ("id", "number", "id", "number"), ("username", "string", "username", "string"), ("lastupdatedby", "array", "lastupdatedby", "array"), ("token_hex", "string", "token", "binary"), ], transformation_ctx="ApplyMapping_node2", )
Here we map the columns from our PySpark dataframe to our Postgres table. We dropped the original token
column and mapped the new token_hex
column to the token
column in Postgres. Note that we specify the binary
data type for token
column in Postgres, even though we are mapping a string column to it. This is because we will be inserting a hexadecimal string, which is essentially a string of bytes.
With these changes, you will be able to run the job. Once the job succeeds, you can confirm if the data is available in Postgres table. In order to verify the binary data, you can read the binary data, and try converting it, something like with the python
code below:
hex_str = [<your hex strings from Postgres table>] for s in hex_str: bin_data = binascii.unhexlify(s) plain_str = bin_data.decode("utf-8") print(plain_str)
That’s pretty much it. Congrats! You have migrated your DDB data, including binary data, to Postgres using AWS Glue. Of course this approach can be extended to include multiple binary columns, and to handled that you will need to cast all the binary columns like this:
df = DataCatalogtable_node1.toDF() df2 = df.withColumn("token_hex", F.base64(F.col("token")).cast("string")) \ .withColumn("column_1_hex", F.base64(F.col("column_1")).cast("string")) \ ... .withColumn("column_n_hex", F.base64(F.col("column_n")).cast("string"))
And drop the hex-casted columns:
frame=DynamicFrame.fromDF(df2.drop("column_1", "column_2", ..., "column_n"), glueContext, "df2")
And, finally adjust the mapping accordingly:
("column_1_hex", "string", "column_1", "binary"), ("column_2_hex", "string", "column_2", "binary"), ... ("column_n_hex", "string", "column_n", "binary"),
In summary, handling binary data in an AWS Glue job can be a challenge, but with a little bit of PySpark magic, we can easily convert binary data to hexadecimal strings and map it to our destination database.
Happy Migrating!