20x Quicker Ingestion with Rockset’s New DynamoDB Connector

20x Quicker Ingestion with Rockset’s New DynamoDB Connector
20x Quicker Ingestion with Rockset’s New DynamoDB Connector


Since its introduction in 2012, Amazon DynamoDB has been probably the most widespread NoSQL databases within the cloud. DynamoDB, not like a conventional RDBMS, scales horizontally, obviating the necessity for cautious capability planning, resharding, and database upkeep. In consequence, DynamoDB is the database of selection for firms constructing event-driven architectures and user-friendly, performant purposes at scale. As such, DynamoDB is central to many fashionable purposes in advert tech, gaming, IoT, and monetary providers.

Nonetheless, whereas DynamoDB is nice for real-time transactions it doesn’t do as properly for analytics workloads. Analytics workloads are the place Rockset shines. To allow these workloads, Rockset offers a completely managed sync to DynamoDB tables with its built-in connector. The information from DynamoDB is mechanically listed in an inverted index, a column index and a row index which may then be queried rapidly and effectively.

As such, the DynamoDB connector is considered one of our most generally used knowledge connectors. We see customers transfer huge quantities of knowledge–TBs price of knowledge–utilizing the DynamoDB connector. Given the dimensions of the use, we quickly uncovered shortcomings with our connector.

How the DynamoDB Connector At the moment Works with Scan API

At a excessive degree, we ingest knowledge into Rockset utilizing the present connector in two phases:


dynamodb-rockset-connector-v1

  1. Preliminary Dump: This section makes use of DynamoDB’s Scan API for a one-time scan of the whole desk
  2. Streaming: This section makes use of DynamoDB’s Streams API and consumes steady updates made to a DynamoDB desk in a streaming style.

Roughly, the preliminary dump provides us a snapshot of the info, on which the updates from the streaming section apply. Whereas the preliminary dump utilizing the Scan API works properly for small sizes, it doesn’t at all times do properly for giant knowledge dumps.

There are two foremost points with DynamoDB’s preliminary dump because it stands at this time:

  • Unconfigurable section sizes: Dynamo doesn’t at all times steadiness segments uniformly, typically resulting in a straggler section that’s inordinately bigger than the others. As a result of parallelism is at section granularity, we’ve got seen straggler segments improve the full ingestion time for a number of customers in manufacturing.
  • Mounted Dynamo stream retention: DynamoDB Streams seize change information in a log for as much as 24 hours. Which means if the preliminary dump takes longer than 24 hours the shards that had been checkpointed at the beginning of the preliminary dump can have expired by then, resulting in knowledge loss.

Bettering the DynamoDB Connector with Export to S3

When AWS introduced the launch of recent performance that permits you to export DynamoDB table data to Amazon S3, we began evaluating this method to see if this might assist overcome the shortcomings with the older method.

At a excessive degree, as an alternative of utilizing the Scan API to get a snapshot of the info, we use the brand new export desk to S3 performance. Whereas not a drop-in alternative for the Scan API, we tweaked the streaming section which, along with the export to S3, is the idea of our new connector.

dynamodb-rockset-connector-v2

Whereas the previous connector took virtually 20 hours to ingest 1TB finish to finish with manufacturing workload working on the DynamoDB desk, the brand new connector takes solely about 1 hour, finish to finish. What’s extra, ingesting 20TB from DynamoDB takes solely 3.5 hours, finish to finish! All you have to present is an S3 bucket!

Advantages of the brand new method:

  • Doesn’t have an effect on the provisioned learn capability, and thus any manufacturing workload, working on the DynamoDB desk
  • The export course of is rather a lot quicker than customized table-scan options
  • S3 duties might be configured to unfold the load evenly in order that we don’t must take care of a closely imbalanced section like with DynamoDB
  • Checkpointing with S3 comes without cost (we only recently constructed help for this)

We’re opening up entry for public beta, and can’t wait so that you can take this for a spin! Signal-up here.

Blissful ingesting and pleased querying!



Leave a Reply

Your email address will not be published. Required fields are marked *