by Nga Tran

Scaling throughput and performance in a sharding database system

feature
Apr 12, 202211 mins

Understand the two dimensions of scaling for database query and ingest workloads, and how sharding can make scaling elasticโ€”or not.

speedometer 124549393
Credit: Thinkstock

Scaling throughputย and performance are critical design topics for all distributed databases, and shardingย is usually a part of the solution. However, a design that increasesย throughputย does not always help withย performanceย and vice versa. Even when a design supports both, scaling them up and down at the same timeย is not always easy.

This post will describe these two types of scaling for both query and ingest workloads, and discussย shardingย techniques thatย makeย themย elastic. Before we dive into the database world, let us first walk through an example of elastic throughput and performance scaling from daily life.

Scaling effects in a fast food restaurant

Nancy is opening a fast food restaurant and laying out the scenarios to optimize her operational costs on different days of the week. Figure 1 illustrates her business on a quiet day. For the restaurant to be open, there are two lines which must remain open: drive-thru and walk-in. Each requires one employee to cover. On average, each person needs six minutes to process an order, and the two employees should be able to cover the restaurantโ€™s expectedย throughputย of 20 customers per hour.

scaling databases 01 InfluxData

Figure 1: The restaurant operation on a quiet day.

Letโ€™s assume that an order can be processed in parallel by at most two people, one making drinks and the other making food. Nancyโ€™s employees are trained to go and help with the other line if their line is empty. Doubling up on a single line reduces the order processing time to three minutes and helps keep theย throughputย steady when customers enter the lines at various intervals.

Figure 2 shows a busier day with around 50% more customers. Adding an employee should cover the 50%ย increase inย throughput. Nancy requests her team to be flexible:

  • If only one customer comes to a line at a time, one person should run between two lines to help reduce the processing time so they will be available to help new customers immediately.
  • If a few customers walk in at the same time, employees should open a new line to help at least two walk-in customers at the same time because Nancy knows walk-in customers tend to be happier when their orders are taken immediately but very tolerant with the six minute processing.
scaling databases 02 InfluxData

Figure 2: The operation that covers 50% more customers.

To smoothly handle the busiest days of the year, which draw some 80 customers per hour, Nancy builds a total of four counters: one drive-thru and three walk-ins, as shown in Figure 3. Since adding a third person to help with an order wonโ€™t help reduce the order time, she plans to staff up to two employees per counter. A few days a year, when the town holds a big event and closes the street (making the drive-thru inaccessible), Nancy accepts herย maxย throughputย will be 60 customers per hour.

scaling databases 03 InfluxData

Figure 3: The operation on a busy day.

Nancyโ€™s order handling strategyย elastically scales customerย throughput (i.e., scales as needed) while also applying flexibility toย makeย order processing time (i.e., performance) faster. Important points to notice:

  1. Theย maxย performanceย scaling factor (max number of employees to help with one order) is two. Nancyย cannot change this factor if she wants to stick with the same food offerings.
  2. Theย maxย throughput is 80 customers per hour due to the max number of counters being four. Nancy could changeย this factor if she has room to add more counters to her restaurant.

Scaling effects in a sharding database system

Similar to the operation at a fast food restaurant, a databaseย systemย should be built to support elastic scaling of throughputย andย performance for both query and ingest workloads.

Query workload

Term definition:

  • Queryย throughputย scaling:ย the ability toย scaleย up and down the number of queries executed in a defined amount of time such as a second or a minute.
  • Queryย performanceย scaling:ย the ability toย makeย a query run faster or slower.
  • Elastic scaling:ย the ability toย scale throughputย or performance up and down easily based on traffic or other needs.

Examples

Letโ€™s assume our sales data is stored in an accessible storage location such as a local disk or a remote disk or a cloud. Three teams in the company, Reporting, Marketing, and Sales, want to query this data frequently. Our first setup, illustrated in Figure 4, is to have one query node to receive all queries from all three teams, read the data, and return the query results.

scaling databases 04 InfluxData

Figure 4: One query node handles all requests.

At first this setup works well but when more and more queries are added, the wait time to get results back becomes quite large. Worse, many times the queries get lost due to timeouts. To deal with the increasing query throughput requests, a new setup shown in Figure 5 provides four query nodes. Each of these nodes works independently for our different business purposes: one for the Reporting team, one for the Marketing team, one for the Sales team focusing on small customers, and one for the Sales team focusing on large customers.

scaling databases 05 InfluxData

Figure 5: Add more query nodes, one for each business purpose, to handle moreย throughput.

The new setup catches up well with the high volume ofย throughputย and no queries get lost. However, for some time-sensitive queries that the teams need to react to immediately, waiting several minutes to get the result back is not good enough. To solve this problem, the data is split equally into four shards, where each shard contains data of 12 or 13 states, as shown in Figure 6. Because the Reporting team runs the most latency sensitive queries, a query cluster of four nodes is built forย themย toย perform queries four times faster. The Marketing team is still happy with its single-node setup, so data from all shards is directed to that one node.

scaling databases 06 InfluxData

Figure 6: Shard data and add Query Nodes to handle sharded data in parallel.

The Sales team does not deal with time-sensitive queries, but as this team grows larger, the number of query requests keep increasing. Therefore, the Sales team should take advantage of performanceย scaling to improve throughput and avoid reaching maxย throughputย in the near future. This is done by replacing two independent query nodes with two independent query clusters, one with four nodes and the other two nodes, based on their respective growth.

scaling databases 07 InfluxData

Figure 7: Adjust the size of the Reporting cluster based on the Reporting teamโ€™s performanceย needs and shut down a Sales cluster based on the Sales teamโ€™s throughputย needs.

During times of the year when the Reporting team does not need to handle time-sensitive queries, two query nodes of its cluster are temporarily removed to save resources, as shown in Figure 7. Similarly, when the Sales team does not need to handle high throughput workloads, it temporarily removes one of its clusters and directs all queries to the remaining one.

The teams are happy with their elastic scaling setup. The current setup allows all teams to scale throughput up and down easily, by adding or removing query clusters. However, the Reporting team notices that its query performance does not improve beyond the limit factor of four query nodes; scaling query nodes beyond that limit doesnโ€™t help. Thus we can say that the Reporting teamโ€™s query throughput scaling is fully elastic, but its queryย performanceย scaling is only elastic to the scale factor of four.

The only way the Reporting team canย scale query performance further is to split data into more and smaller shards, which is not trivial. Weโ€™ll discuss this next.

Ingest workload

Term definition:

  • Ingestย throughputย scaling: the ability toย scaleย up and down the amount of ingested data in a defined amount of time such as a second or a minute.
  • Ingestย performanceย scaling: the ability to increase or decrease the speed of ingesting a set of data into theย system.

Examples

scaling databases 08 InfluxData

Figure 8: One ingest node handles all ingested data.

In order to have four shards of sales data as described above, the ingest data must be sharded at load time. Figure 8 illustrates an ingest node that takes all ingest requests, shards them accordingly, handles pre-ingest work, and then saves the data to the right shard.

However, when the ingest data increases, one ingest node no longer catches up with the requests and ingest data gets lost. Thus a new setup shown in Figure 9 is built to add more ingest nodes, each handling data for a different set of write requests to support higher ingestย throughput.

scaling databases 09 InfluxData

Figure 9: Add ingest nodes, each handling a subset of write requests, to support moreย throughput.

Even though the new setup handles a higher ingest volume ofย throughputย and no data gets lost, the increasing demand of lower ingest latency makes the teams think they need to change the setup further. The ingest nodes that need lower ingest latency are converted into ingest clusters, shown in Figure 10.

Here each cluster includes a shard node that is responsible forย shardingย the coming data and additional ingest nodes. Each ingest node is responsible for processing pre-ingest work for its assigned shards and sending the data to the right shard storage. Theย performanceย of Ingest Cluster 2 is twice that of Ingest Node 1, as the latency is now around half of the previous setup. Ingest Cluster 3 is around four times as fast as Ingest Node 1.

scaling databases 10 rev InfluxData

Figure 10: Convert ingest nodes to ingest clusters to speed up data ingest.

During times of the year when the latency is not critical, a couple of nodes are temporarily removed from Ingest Cluster 3 to save resources. When ingestย throughputย is minimal, Ingest Cluster 2 and Ingest Cluster 3 are even shut down and all write requests are directed to Ingest Node 1 for ingesting.

As with their query workloads, the Reporting, Marketing, and Sales teams are very happy with the elastic scaling setup for their ingest workloads. However, they notice that even though ingest throughput scales up and down easily by adding and removing ingest clusters, when Ingest Cluster 3 has reached its scaleย factor of four, adding more ingest nodes to its cluster doesnโ€™t improve performance. Thus we can say that its ingest throughput scaling is fully elastic, but its ingestย performanceย scaling is only elastic to theย scaleย factor of four.

Preparing for future elasticity

As demonstrated in the examples, the query and ingestย throughputย scaling of the setups in Figure 6 and Figure 10 are fully elastic, but their performance scaling is only elastic to theย scaleย factor of four. To support a higher performance scaling factor, the data should be split into smaller shards, e.g., one shard per state. However, when we go with a smallerย scaleย factor, many shards must be mapped to one query node in the query cluster. Similarly, one ingest node must handle the data of many shards.

A limitation ofย performanceย scaling is that increasing the scaleย factor (i.e., splitting data into smaller shards) does not mean theย systemย willย scaleย as expected due to the overhead or limitations of each use caseโ€”as we saw in Nancyโ€™s fast food restaurant, where the max performance scaling factor was two employees per order.

The elastic throughputย and performance scalings described in this post are just examples to help us understand their role in a database system. The real designs to supportย themย are a lot more complicated and need to consider more factors.

Nga Tran is a staff software engineer atย InfluxData, and a member of theย IOxย team, which is building the next-generation time series storage engine for InfluxDB. Before InfluxData, Nga had been with Vertica Analytic DBMS for over a decade. She was one of the key engineers who built the query optimizer for Vertica, and later, ran Verticaโ€™s engineering team.ย 

โ€”

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries toย newtechforum@infoworld.com.