Hornet: Parallel Data Transfer From Multiple Servers

Allen Miu, Eugene Shih, Hari Balakrishnan
M. I. T. Laboratory for Computer Science
545 Technology Square, Cambridge, MA 02139
{aklmiu,eugene,hari}@lcs.mit.edu

6.892 Project Proposal
October 1, 1999

Research Problem:

We will investigate whether we can decrease file download time by developing a system that downloads data in parallel from multiple servers spread across a wide-area network.

Introduction:

The conventional method for downloading a file from the Internet is to open a connection between the client and a single server. The download performance is limited by the load of the server, the bottleneck link, and any traffic fluctuations intersecting any point of the path. To help balance load and improve download performance, various service providers have begun deploying mirror servers in different network domains. As a result, the clients now have the option to choose to download from a different server.

Presently, users are seldomly given accurate metrics (if any) to help them make the best choice. Even when the "optimal path" has been chosen, throughput can still drop in the face of changing traffic patterns.

Rather than focusing on the problem of finding the "optimal path" to improve download performance, we propose a scheme, which we call "paraloading," that downloads different parts of a file from (initially) all of mirror servers in parallel. There are at least two advantages. First, the optimal path will be among the "parallel connections" between the client and the mirror servers without the complexity of estimating network performance metrics. Second, we can perform load balancing among the parallel connections to further improve download performance. (For example, those connections that become heavily congested can be dropped.) This contrasts to the single connection case, where dynamic load balancing among different connections cannot be easily deployed.

There are many issues that need to be settled before a parallel downloading scheme can be widely deployed. First, we need to demonstrate that paraloading is feasible and achieves good performance in today's network. Our believe is that the current network traffic is centered around busy servers. Hence, we hope to achieve good performance by "spreading" the traffic sources so that the paths to the wide area servers are disjoint from pockets of heavy congestion.

We will need to develop a good understanding of the network topology and traffic pattern in order for us to develop a suitable architecture and algorithms for paraloading. To achieve the goal of reducing download latency, we plan to develop a set of algorithms that will

determine when we should use paraloading
open and manage a set of parallel connections efficiently
load balance the download rate among the parallel connection to maximize the download performance
load balance to minimize the use of congested links

Research Methodology:

From here, I will refer the term `serial-loading' to distinguish single connection downloads from `paraloading.'

Step 1

To collect data to support our claim that parallel download is feasible and enhances download performance.

We will simulate parallel downloads by using a script to perform serial and parallel ftp downloads from various mirror sites at various ISPs. We will use, traceroute and tcpdump extensively to collect routing information to provide us with insight to the current network topology and traffic pattern.

Step 2

We will analyze the data to determine whether paraloading achieves good download performance when compared to serial-loading.
We will then construct ns simulation models based on tcpdump and traceroute information. Through simulation, we want to explore how different traffic patterns (e.g. congestion at the server, client, or in between) affect the performance of paraloading.
If time permits, we also want to explore how paraloading could affect traffic patterns. More specifically, we want to examine TCP more closely to find out how to determine whether two different TCP connections share the same bottleneck. (This will become very important as we try to develop algorithms that minimizes the use of congested links.)

Finally, we should also make theoretical efficiency comparisons with other proposed parallel data transfer scheme (namely, the Digital Fountain project.)

Step 3

Once we have determine that paraloading is a good idea, we can begin a full scale design of the system architecture. This may involve modifying the DNS server to obtain mirror servers information, developing a set of APIs that performs paraloading over TCP, developing a protocol to support paraloading, and modifying ftp and/or a web browser to use the API.
We will use the results of the data analysis in Step 2 to help us develop algorithms for load balancing to both maximize performance and minimize the use of congested links.

Step 4

We will then run tests after implementing the paraloading system. We should make refinements on Step 1 and 2 above and perform the same set of analysis to evaluate how well our system works.

Resources:

To conduct our paraloading experiments, we will need several ISP accounts.

Tentative Schedule:

End of week 1

specs for experiment to compare paraloading and serial-loading
setup ISP accounts
start scripting for paraloading

Week 2

finish script testing and refinements
draft hypothesis
design system architecture & algorithm
begin measurements on script paraloading

Week 3

build ns simulations
algorithm/architecture refinement

Week 4

run ns simulations and collect data
begin analysis and algorithm refinement
start implementation of the paraloading system.

Week 5

data crunching
debugging

Week 6

run tests on the real system and start collecting preliminary results

Week 7

start writing up results

Week 8

presentation

Last modified on September 30, 1999
[back to Comet Homepage]