Observations - Response time and throughput

5.3 Response time and throughput

5.3.1 Observations

Response and service time

Figure 5.3 shows observed response and service time for both single server and dual server setups with an increasing number of concurrent clients. The response time starts out at approximately 200 ms for both single and dual server setups. By subtracting the round trip time (RTT) needed to transmit requests and corresponding results, we can calculate the service time, also included in Figure 5.3. We have used an RTT of 55 ms in our calculations, based on average ICMP ping results from the client network. The service time is initially approximately 80 ms, which we want to further decompose.

0 100 200 300 400 500 600 700 800 900

0 100 200 300 400 500 600

Time (ms)

Concurrent clients

Response time (single server) Response time (dual servers) Response time (single server baseline) Service time (single server) Service time (dual servers) Service time (single server baseline)

Figure 5.3: Cloud search response time

When a request arrives at the cloud search server, a thread is created to handle the re-quest, after which the request is parsed and processed, including an index lookup. This means that we can decompose the service time into two distinct components; thread

creation/request processing, and index lookup. Since our baseline experiment has the same creation and processing overhead, excluding the index lookup, we are able to isolate the overhead of thread creation/request processing from the index lookup. In Figure 5.3, this overhead is visible as the single server baseline service time, starting out at about 50 ms. This tells us that the time taken to perform a single index lookup is initially approximately 30 ms, confirmed by the initial storage response time in Figure 5.4, which shows the relationship between the total service time and the response time for performing lookups against cloud storage.

0 100 200 300 400 500 600 700 800

0 100 200 300 400 500 600

Time (ms)

Concurrent clients

Service time (single server) Service time (dual servers) Service time (single server baseline) Storage response time (single server) Storage response time (dual servers) Storage response time (single server baseline)

Figure 5.4: Cloud storage response time

With a single server, the response time is stable until about 125 concurrent clients, after which the response time starts increasing linearly with the number of concurrent clients. With the linear increase in response time, we reach our maximum acceptable response time using a single server at approximately 400 concurrent clients.

Throughput and capacity

In order to pinpoint the bottleneck limiting the number of concurrent clients within an acceptable response time, we will need to take a look at the throughput and capacity of our system.

Throughputis an expression of the rate at which a system performs work. In our case, we express the throughput as the number of search queries performed per second (QPS). For our search service, we calculate the average throughput T for Qper f ormed

queries performed overT_runtime seconds using the following formula:

T = ^Qper f ormed

T_runtime

Closely related is the capacity of a system, which is an expression of the maximum throughput a system can handle without queuing requests. Workloads with rates less than the capacity can be handled directly without queueing. Workloads leading to rates higher than the capacity means that requests must be queued for delayed pro-cessing, since the system is unable to cope. When the system is combined from multi-ple operations, the bottleneck operation (i.e., the operation that takes the longest time to complete) defines the total capacity.

The observed throughput and capacity of our search system is shown in Figure 5.5.

The solid lines indicate the average throughput with the workload varying with the number of concurrent clients. The dashed lines indicate the estimated capacity of the different setups based on the maximum throughput observed during all workloads.

When the throughput is equal to the capacity of a system, it is said to be saturated.

When the system is saturated, the response time for individual requests increase as the requests are forced to queue longer before they are processed. This means that while the service time starts out as the dominating factor in the total response time, it is quickly surpassed by the queuing delay when the system is saturated.

Figure 5.3 clearly shows this pattern, with the initial increase in response time at ap-prox. 125 concurrent clients corresponding to the point at which the throughput starts flattening in Figure 5.5. At about 175 concurrent clients, the single server throughput reaches its capacity, meaning that it is unable to perform more queries per second than the current workload. This is reflected in the response time, which starts to increase linearly with the increase in incoming requests.

When the rate of incoming requests is higher than the capacity, queues will in theory grow infinitely since the server is unable to cope with the incoming rate. This means that no matter how many clients we add, the rate of completion will not increase - only response times. In practice, however, incoming requests would soon start to drop since

0 200 400 600 800 1000 1200 1400 1600 1800

0 100 200 300 400 500 600

Throughput (queries/second)

Concurrent clients

Single server Dual servers Single server baseline Averaged single server capacity Averaged dual server capacity Averaged single server baseline capacity

Figure 5.5: Cloud search throughput

the queues are not unbound in reality, and the resulting service would be unreliable for users.

Based on these observations, we conclude that a single server in our search service has a capacity of approximately 650 queries per second.

In document Moving into the Cloud (sider 102-105)