Potential bottlenecks - Response time and throughput

5.3 Response time and throughput

5.3.2 Potential bottlenecks

Because the slowest operation in a system, the bottleneck, limits its total capacity, we are eager to identify the bottleneck operation in our search service. In the following section, we will look at a some potential bottlenecks, discussing whether our observa-tions indicate that they are in fact limiting system capacity.

Service limitations

Although the S3 documentation does not specifically detail any service limitations, it is highly likely that Amazon has implemented certain limitations on incoming requests to prevent abuse and attacks on the S3 service, such as Denial of Service (DoS) attacks in which large numbers of illegitimate requests are sent to a server with the intent of exhausting all available resources, rendering the server unable to provide legitimate service. For example, limitations are likely in place to restrict the maximum number of requests within a certain time interval for a single object or bucket, or received from an individual IP address or pair of account credentials.

However, we see no indications that the capacity of our search system is limited by any imposed service limitation. In fact, our figures show that our service was able to achieve a linearly increasing request rate when requesting objects from the same bucket using the same credentials at multiple servers. Since the S3 service is designed to be massively scalable, any imposed limitations are likely far higher than our current usage patterns, requiring a significantly larger amount of concurrent client resources to reach.

Insufficient replication

S3 is designed to transparently provide replication, and the replication degree is dy-namically adjusted by the system based on the request patterns of specific buckets and objects. The degree of replication can currently not be influenced specifically by the user, and the replication mechanism is not documented. Consequently, the specific thresholds and delays for increasing replication are unknown. Replication is handled at the bucket level, meaning that all objects in the same bucket is equally replicated.

Despite the unknown internal replication mechanism of S3, object requests must clearly be distributed among some set of servers with access to a replica of the requested ob-ject. This load distribution is possible at the DNS level (changing the IP address cor-responding to the object URI) as well as internally when fetching the object from disk.

Currently it seems like a combination of the two is used.

Should a requested object be located in a bucket with too few replicas to process in-coming requests at the current rate, requests must be queued for processing, resulting in higher service times. If this case of insufficient replication was a bottleneck in our benchmarks, the cloud search service would not have been able to scale almost linearly when adding more servers, since additional servers would be contending for the same replicas. Also, Figure 5.7 indicate that the service time of the cloud storage service re-mains relatively constant throughout our tests, indicating that queuing delay is not a

factor in our tests. We conclude that there are no signs that indicate that the S3 service provide insufficient replication for our tests, although we have no way of establishing or adjusting the exact degree of replication.

VM I/O limitations

Running in a virtualized environment means that I/O resources are shared between a number of virtual machines sharing physical hardware. Since the cloud search service does not use hard disks directly, we are primarily concerned with the network band-width available to an instance. The EC2 documentation states that the service does not impose fixed limits on available bandwidth, except for guaranteeing a minimum share of the physical bandwidth to each VM sharing physical hardware. This means that an instance is often able to achieve a higher bandwidth than its strictly allotted share, at the cost of some variance.

To protect the internal network health, Amazon might additionally perform traffic shaping and filtering, for example to enforce a maximum rate of outgoing TCP pack-ets from an instance. Such shaping might lead to situations in which a bottleneck is formed by limiting incoming or outgoing network traffic. Performing a large amount of cloud storage requests results in a large number of small HTTP requests, potentially affected by traffic shaping or network rate limiting.

To investigate if the amount of parallel HTTP connections could be a bottleneck, we ran our experiments replacing the cloud storage access with a HTTP request to a static document located on Amazon web servers. This service consistently operated with a capacity similar to the baseline service, confirming that traffic shaping/rate limiting is not the bottleneck in our service.

Server resource exhaustion

If the VMM is unable to provide sufficient resources like CPU and memory to the service, it will be unable to process more requests. We monitored CPU load and mem-ory consumption during our experiments to ensure that we did not exhaust system resources, and we saw no signs that limited CPU and memory were affecting the ca-pacity.

As long as clients have sufficient bandwidth and the server has resources to process the incoming rate of requests, the throughput of the baseline scenario shown in Figure 5.3 should scale linearly with load. We have executed the experiment with the client load

spread over more machines to ensure that the limitations are not due to lack of band-width in the clients. This implies that the capacity we observe in the baseline scenario is limited by server resources, and represents the maximum throughput achievable on a single server, independently of cloud storage performance.

Little’s law [77] is well-known in queuing theory, and describes the mean number of requests in a system at any time, L, as a function of the arrival rate of new requestsλ and the expected timeW spent to process a single request:

L =^λW

Little’s law is often used in software benchmarking to ensure that observed results are not limited by bottlenecks in the test setup. In a multi-threaded scenario, it allows us to calculate the effective concurrency, i.e., the average number of threads actually running at any time.

We suspect that thread creation and management is the most resource intensive task in the otherwise minimal baseline scenario. We therefore want to investigate if thread creation and overhead in our server processes is in fact limiting the throughput. The results shown in Figure 5.5 indicate that a single server is able to handle a through-put of approximately 1500 queries per second in this baseline scenario (i.e., without accessing the cloud storage service). Using Little’s law, we are able to calculate the av-erage number of concurrent threads (units) active in the system based on the maximum throughput (arrival rate) and corresponding unit service time:

L =1500units/s·0.075s/unit =112.5units

Using the same calculation with Little’s law, we arrive at an equivalent mean number of concurrent threads per server of 105 for both the single and dual server scenarios, including cloud storage access. These numbers tell us that the actual achieved concur-rency is almost equivalent in all scenarios, independently of the work performed in each case.

This leads us to conclude that thread creation and management overhead is in fact the current bottleneck in our cloud search service, and that the current system is unable to handle more than approximately 100-110 concurrent requests per server, limiting throughput to 650 QPS. To investigate this, we would have liked to implement a more efficient version of the client based on asynchronous I/O operations instead of threads, but time did not allow us to complete that in time for this thesis.

In document Moving into the Cloud (sider 105-109)