This submit is co-written with Nofar Diamant and Matan Safri from AppsFlyer.
AppsFlyer develops a number one measurement answer targeted on privateness, which allows entrepreneurs to gauge the effectiveness of their advertising actions and integrates them with the broader advertising world, managing an unlimited quantity of 100 billion occasions every single day. AppsFlyer empowers digital entrepreneurs to exactly determine and allocate credit score to the assorted client interactions that lead as much as an app set up, using in-depth analytics.
A part of AppsFlyer’s providing is the Audiences Segmentation product, which permits app house owners to exactly goal and reengage customers primarily based on their habits and demographics. This features a function that gives real-time estimation of viewers sizes inside particular person segments, known as the Estimation function.
To supply customers with real-time estimation of viewers measurement, the AppsFlyer crew initially used Apache HBase, an open-source distributed database. Nevertheless, because the workload grew to 23 TB, the HBase structure wanted to be revisited to fulfill service stage agreements (SLAs) for response time and reliability.
This submit explores how AppsFlyer modernized their Audiences Segmentation product through the use of Amazon Athena. Athena is a robust and versatile serverless question service offered by AWS. It’s designed to make it simple for customers to investigate information saved in Amazon Simple Storage Service (Amazon S3) utilizing normal SQL queries.
We dive into the assorted optimization strategies AppsFlyer employed, corresponding to partition projection, sorting, parallel question runs, and using question end result reuse. We share the challenges the crew confronted and the methods they adopted to unlock the true potential of Athena in a use case with low-latency necessities. Moreover, we talk about the thorough testing, monitoring, and rollout course of that resulted in a profitable transition to the brand new Athena structure.
Audiences Segmentation legacy structure and modernization drivers
Viewers segmentation entails defining focused audiences in AppsFlyer’s UI, represented by a directed tree construction with set operations and atomic standards as nodes and leaves, respectively.
The next diagram reveals an instance of viewers segmentation on the AppsFlyer Audiences administration console and its translation to the tree construction described, with the 2 atomic standards because the leaves and the set operation between them because the node.
To supply customers with real-time estimation of viewers measurement, the AppsFlyer crew used a framework known as Theta Sketches, which is an environment friendly information construction for counting distinct parts. These sketches improve scalability and analytical capabilities. These sketches have been initially saved within the HBase database.
HBase is an open supply, distributed, columnar database, designed to deal with giant volumes of knowledge throughout commodity {hardware} with horizontal scalability.
Authentic information construction
On this submit, we concentrate on the occasions
desk, the most important desk initially saved in HBase. The desk had the schema date | app-id | event-name | event-value | sketch
and was partitioned by date
and app-id
.
The next diagram showcases the high-level authentic structure of the AppsFlyer Estimations system.
The structure featured an Airflow ETL course of that initiates jobs to create sketch recordsdata from the supply dataset, adopted by the importation of those recordsdata into HBase. Customers may then use an API service to question HBase and retrieve estimations of person counts in response to the viewers section standards arrange within the UI.
To be taught extra concerning the earlier HBase structure, see Applied Probability – Counting Large Set of Unstructured Events with Theta Sketches.
Over time, the workload exceeded the scale for which HBase implementation was initially designed, reaching a storage measurement of 23 TB. It turned obvious that to be able to meet AppsFlyer’s SLA for response time and reliability, the HBase structure wanted to be revisited.
As beforehand talked about, the main focus of the use case entailed day by day interactions by clients with the UI, necessitating adherence to a UI normal SLA that gives fast response occasions and the potential to deal with a considerable variety of day by day requests, whereas accommodating the present information quantity and potential future growth.
Moreover, because of the excessive price related to working and sustaining HBase, the goal was to seek out another that’s managed, simple, and cost-effective, that wouldn’t considerably complicate the prevailing system structure.
Following thorough crew discussions and consultations with the AWS specialists, the crew concluded {that a} answer utilizing Amazon S3 and Athena stood out as essentially the most cost-effective and simple alternative. The first concern was associated to question latency, and the crew was notably cautious to keep away from any adversarial results on the general buyer expertise.
The next diagram illustrates the brand new structure utilizing Athena. Discover that import-..-sketches-to-hbase
and HBase have been omitted, and Athena was added to question information in Amazon S3.
Schema design and partition projection for efficiency enhancement
On this part, we talk about the method of schema design within the new structure and completely different efficiency optimization strategies that the crew used together with partition projection.
Merging information for partition discount
With a purpose to consider if Athena can be utilized to assist Audiences Segmentation, an preliminary proof of idea was carried out. The scope was restricted to occasions arriving from three app-ids
(approximated 3 GB of knowledge) partitioned by app-id
and by date
, utilizing the identical partitioning schema that was used within the HBase implementation. Because the crew scaled as much as embrace the whole dataset with 10,000 app-ids
for a 1-month time vary (reaching an approximated 150 GB of knowledge), the crew began to see extra gradual queries, particularly for queries that spanned over important time ranges. The crew dived deep and found that Athena spent important time on the question starting stage resulting from numerous partitions (7.3 million) that it loaded from the AWS Glue Knowledge Catalog (for extra details about utilizing Athena with AWS Glue, see Integration with AWS Glue).
This led the crew to look at partition indexing. Athena partition indexes present a option to create metadata indexes on partition columns, permitting Athena to prune the information scan on the partition stage, which might cut back the quantity of knowledge that must be learn from Amazon S3. Partition indexing shortened the time of partition discovery within the question starting stage, however the enchancment wasn’t substantial sufficient to fulfill the required question latency SLA.
As an alternative choice to partition indexing, the crew evaluated a technique to cut back partition quantity by decreasing information granularity from day by day to month-to-month. This technique consolidated day by day information into month-to-month aggregates by merging day-level sketches into month-to-month composite sketches utilizing the Theta Sketches union functionality. For instance, taking an information of a month vary, as an alternative of getting 30 rows of knowledge per thirty days, the crew united these rows right into a single row, successfully slashing the row rely by 97%.
This technique enormously decreased the time wanted for the partition discovery section by 30%, which initially required roughly 10–15 seconds, and it additionally decreased the quantity of knowledge that needed to be scanned. Nevertheless, the anticipated latency targets primarily based on the UI’s responsiveness requirements have been nonetheless not perfect.
Moreover, the merging course of inadvertently compromised the precision of the information, resulting in the exploration of different options.
Partition projection as an enhancement multiplier
At this level, the crew determined to discover partition projection in Athena.
Partition projection in Athena permits you to enhance question effectivity by projecting the metadata of your partitions. It nearly generates and discovers partitions as wanted with out the necessity for the partitions to be explicitly outlined within the database catalog beforehand.
This function is especially helpful when coping with giant numbers of partitions, or when partitions are created quickly, as within the case of streaming information.
As we defined earlier, on this specific use case, every leaf is an entry sample being translated into a question that should include date
vary, app-id
, and event-name
. This led the crew to outline the projection columns through the use of date type for the date
vary and injected type for app-id
and event-name
.
Quite than scanning and loading all partition metadata from the catalog, Athena can generate the partitions to question utilizing configured guidelines and values from the question. This avoids the necessity to load and filter partitions from the catalog by producing them within the second.
The projection course of helped keep away from efficiency points attributable to a excessive variety of partitions, eliminating the latency from partition discovery throughout question runs.
As a result of partition projection eradicated the dependency between variety of partitions and question runtime, the crew may experiment with an extra partition: event-name
. Partitioning by three columns (date
, app-id
, and event-name
) decreased the quantity of scanned information, leading to a ten% enchancment in question efficiency in comparison with the efficiency utilizing partition projection with information partitioned solely by date
and app-id
.
The next diagram illustrates the high-level information stream of sketch file creation. Specializing in the sketch writing course of (write-events-estimation-sketches
) into Amazon S3 with three partition fields triggered the method to run twice as lengthy in comparison with the unique structure, resulting from an elevated variety of sketch recordsdata (writing 20 occasions extra sketch recordsdata to Amazon S3).
This prompted the crew to drop the event-name
partition and compromise on two partitions: date
and app-id
, ensuing within the following partition construction:
s3://bucket/table_root/date=${day}/app_id=${app_id}
Utilizing Parquet file format
Within the new structure, the crew used Parquet file format. Apache Parquet is an open supply, column-oriented information file format designed for environment friendly information storage and retrieval. Every Parquet file comprises metadata corresponding to minimal and most worth of columns that enables the question engine to skip loading unneeded information. This optimization reduces the quantity of knowledge that must be scanned, as a result of Athena can skip or shortly navigate via sections of the Parquet file which are irrelevant to the question. Consequently, question efficiency improves considerably.
Parquet is especially efficient when querying sorted fields, as a result of it permits Athena to facilitate predicate pushdown optimization and shortly determine and entry the related information segments. To be taught extra about this functionality in Parquet file format, see Understanding columnar storage formats.
Recognizing this benefit, the crew determined to kind by event-name
to boost question efficiency, reaching a ten% enchancment in comparison with non-sorted information. Initially, they tried partitioning by event-name
to optimize efficiency, however this method elevated writing time to Amazon S3. Sorting demonstrated question time enchancment with out the ingestion overhead.
Question optimization and parallel queries
The crew found that efficiency could possibly be improved additional by operating parallel queries. As a substitute of a single question over a protracted window of time, a number of queries have been run over shorter home windows. Regardless that this elevated the complexity of the answer, it improved efficiency by about 20% on common.
As an example, contemplate a state of affairs the place a person requests the estimated measurement of app com.demo
and occasion af_purchase
between April 2024 and finish of June 2024 (as illustrated earlier, the segmentation is outlined by the person after which translated to an atomic leaf, which is then damaged all the way down to a number of queries relying on the date vary). The next diagram illustrates the method of breaking down the preliminary 3-month question into two separate as much as 60-day queries, operating them concurrently after which merging the outcomes.
Lowering outcomes set measurement
In analyzing efficiency bottlenecks, analyzing the differing types and properties of the queries, and analyzing the completely different phases of the question run, it turned clear that particular queries have been gradual in fetching question outcomes. This drawback wasn’t rooted within the precise question run, however in information switch from Amazon S3 on the GetQueryResults section, resulting from question outcomes containing numerous rows (a single end result can include thousands and thousands of rows).
The preliminary method of dealing with a number of key-value permutations in a single sketch inflated the variety of rows significantly. To beat this, the crew launched a brand new event-attr-key
area to separate sketches into distinct key-value pairs.
The ultimate schema regarded as follows:
date | app-id | event-name | event-attr-key | event-attr-value | sketch
This refactoring resulted in a drastic discount of end result rows, which considerably expedited the GetQueryResults
course of, markedly enhancing total question runtime by 90%.
Athena question outcomes reuse
To deal with a standard use case within the Audiences Segmentation GUI the place customers usually make delicate changes to their queries, corresponding to adjusting filters or barely altering time home windows, the crew used the Athena query results reuse function. This function improves question efficiency and reduces prices by caching and reusing the outcomes of earlier queries. This function performs a pivotal function, notably when bearing in mind the current enhancements involving the splitting of date ranges. The flexibility to reuse and swiftly retrieve outcomes implies that these minor—but frequent—modifications not require a full question reprocessing.
Consequently, the latency of repeated question runs was decreased by as much as 80%, enhancing the person expertise by offering quicker insights. This optimization not solely accelerates information retrieval but in addition considerably reduces prices as a result of there’s no have to rescan information for each minor change.
Resolution rollout: Testing and monitoring
On this part, we talk about the method of rolling out the brand new structure, together with testing and monitoring.
Fixing Amazon S3 slowdown errors
Throughout the answer testing section, the crew developed an automation course of designed to evaluate the completely different audiences inside the system, utilizing the information organized inside the newly carried out schema. The methodology concerned a comparative evaluation of outcomes obtained from HBase towards these derived from Athena.
Whereas operating these exams, the crew examined the accuracy of the estimations retrieved and in addition the latency change.
On this testing section, the crew encountered some failures when operating many concurrent queries directly. These failures have been attributable to Amazon S3 throttling resulting from too many GET requests to the identical prefix produced by concurrent Athena queries.
With a purpose to deal with the throttling (slowdown errors), the crew added a retry mechanism for question runs with an exponential back-off technique (wait time will increase exponentially with a random offset to stop concurrent retries).
Rollout preparations
At first, the crew initiated a 1-month backfilling course of as a cost-conscious method, prioritizing accuracy validation earlier than committing to a complete 2-year backfill.
The backfilling course of included operating the Spark job (write-events-estimation-sketches
) within the desired time vary. The job learn from the information warehouse, created sketches from the information, and wrote them to recordsdata within the particular schema that the crew outlined. Moreover, as a result of the crew used partition projection, they might skip the method of updating the Knowledge Catalog with each partition being added.
This step-by-step method allowed them to substantiate the correctness of their answer earlier than continuing with the whole historic dataset.
With confidence within the accuracy achieved in the course of the preliminary section, the crew systematically expanded the backfilling course of to embody the total 2-year timeframe, assuring a radical and dependable implementation.
Earlier than the official launch of the up to date answer, a strong monitoring technique was carried out to safeguard stability. Key displays have been configured to evaluate important features, corresponding to question and API latency, error charges, API availability.
After the information was saved in Amazon S3 as Parquet recordsdata, the next rollout course of was designed:
- Hold each HBase and Athena writing processes operating, cease studying from HBase, and begin studying from Athena.
- Cease writing to HBase.
- Sundown HBase.
Enhancements and optimizations with Athena
The migration from HBase to Athena, utilizing partition projection and optimized information buildings, has not solely resulted in a ten% enchancment in question efficiency, however has additionally considerably boosted total system stability by scanning solely the mandatory information partitions. As well as, the transition to a serverless mannequin with Athena has achieved a powerful 80% discount in month-to-month prices in comparison with the earlier setup. This is because of eliminating infrastructure administration bills and aligning prices immediately with utilization, thereby positioning the group for extra environment friendly operations, improved information evaluation, and superior enterprise outcomes.
The next desk summarizes the enhancements and the optimizations carried out by the crew.
Space of Enchancment | Motion Taken | Measured Enchancment |
Athena partition projection | Partition projection over the massive variety of partitions, avoiding limiting the variety of partitions; partition by event_name and app_id |
A whole lot of p.c enchancment in question efficiency. This was essentially the most important enchancment, which allowed the answer to be possible. |
Partitioning and sorting | Partitioning by app_id and sorting event_name with day by day granularity |
100% enchancment in jobs calculating the sketches. 5% latency in question efficiency. |
Time vary queries | Splitting very long time vary queries into a number of queries operating in parallel | 20% enchancment in question efficiency. |
Lowering outcomes set measurement | Schema refactoring | 90% enchancment in total question time. |
Question end result reuse | Supporting Athena question outcomes reuse | 80% enchancment in queries ran greater than as soon as within the given time. |
Conclusion
On this submit, we confirmed how Athena turned the primary element of the AppsFlyer Audiences Segmentation providing. We explored varied optimization strategies corresponding to information merging, partition projection, schema redesign, parallel queries, Parquet file format, and using the query result reuse.
We hope our expertise offers useful insights to boost the efficiency of your Athena-based functions. Moreover, we suggest testing Athena performance best practices for additional steering.
In regards to the Authors
Nofar Diamant is a software program crew lead at AppsFlyer with a present concentrate on fraud safety. Earlier than diving into this realm, she led the Retargeting crew at AppsFlyer, which is the topic of this submit. In her spare time, Nofar enjoys sports activities and is obsessed with mentoring girls in expertise. She is devoted to shifting the business’s gender demographics by growing the presence of girls in engineering roles and inspiring them to succeed.
Matan Safri is a backend developer specializing in massive information within the Retargeting crew at AppsFlyer. Earlier than becoming a member of AppsFlyer, Matan was a backend developer in IDF and accomplished an MSC in electrical engineering, majoring in computer systems at BGU college. In his spare time, he enjoys wave browsing, yoga, touring, and taking part in the guitar.
Michael Pelts is a Principal Options Architect at AWS. On this place, he works with main AWS clients, helping them in creating progressive cloud-based options. Michael enjoys the creativity and problem-solving concerned in constructing efficient cloud architectures. He additionally likes sharing his intensive expertise in SaaS, analytics, and different domains, empowering clients to raise their cloud experience.
Orgad Kimchi is a Senior Technical Account Supervisor at Amazon Net Companies. He serves because the buyer’s advocate and assists his clients in reaching cloud operational excellence specializing in structure, AI/ML in alignment with their enterprise targets.