Scaling Internet Routers Using Optics (4/26)

3 May

Context

 
This paper was written over three years ago and made predictions about technology three years from the time of writing.  At the time the paper was written, a lot of components were starting to fall behind the improvements predicted by Moore’s Law, largely due to power constraints.  As such, it is has become harder to scale due to the power requirements of electronics.
 
Demand has also decreased.  After the bubble burst there was an oversupply of network capacity, so service providers had excess capacity.  However, everyone was caught off guard by the rise of video, which is now 40-50% of the traffic in the US during peak hours.  ISPs are playing catch-up and their approach is generally to deploy many small routers to incrementally improve capacity instead of deploying a few large routers.  The technology of this paper has not been a focus as a result.  The capacity of some of the largest routers are ~100 TB/sec, approximatly what the paper proposed, but they use a different approach.  The paper predicted 160Gb/s line cards.  These exist now, though not three years after the paper.  These cards tend to have four 40Gb links because the link technology hasn’t increased as fast as line-card speeds.
 
Also note that in both this paper and the previous paper the router was designed for fix-length cells.  The biggest routers all used fixed length packets internally.  This provides benefits for pipelining and fixed timings internally.  If the packets were variable sizes then the pipeline stages would be delayed by the longer packets.  Splitting packets into cells is cheap but not free, reconstructing the packet is much harder because there may be reordering issues.  The whole packet must make its way through before it can be sent out, so the first cell to make it through has to wait for the last cell.  There are also the costs of transmitting extra header overhead.  Fragmentation can be an issue, such that 1-byte chunks still take up a whole cell, however common sizes are 64 and 128 bytes. TCP ACKs are 64-bytes, for example, but any fixed number will have some inefficiency.  All the largest routers do this internally, allowing them to have a fixed internal clock.
 

Background

 
This paper is largely about load-balancing.  The goal of load-balancing is to take an arbitrary traffic matrix and get predictable performance from it.
 
Imagine N inputs and M outputs of capacity R that are fully connected by an internal mesh.  Every crosslink has to have capacity R because we do not know in advance what our traffic pattern will be and we might have a 1-to-1 saturated link.  This means the total capacity needs to be (N^2 R) capacity.  If we know in advance what our pattern will be we can make the mesh have capacity NR.  We pay a factor of N penalty for not knowing the traffic in advance.
 
If the traffic is uniform then the crosslinks can have capacity R/N, leading to total capcity NR.  Can we make the traffic look sufficiently uniform?
 
The approach is to use three phases.  The input stage load-balances over over an intermediate stage which queues the packets.  Then the intermediate stage serves packets to the output stage with recreates the packet and sends it out.  This makes the traffic look “sufficiently uniform”.  C.-S. Chang made this observation, and it was independently made by Valiant, showing that we can get 100% throughput as long as we have “weakly mixing” traffic.  
 
This system was termed a Valiant Load-Balancing Switch.  It has stateless switching fabrics, and the intermediate links only have to have capacity R/N.  There is no longer any arbitration. It requires N queues with N virtual queues each.  The queues must be read/write at R so each must have a bandwidth of 2R.
 
We skip the technicalities of weak mixing, but most physical processes are.  It is hard to create an arrival process that is not weakly mixing.
 

Intuition

If you look at the arrival rate at the intermediate stage, then it is a transform on the vector of arivals to the first stage by a uniform matrix and scaled by 1/N, that is 
 
b = 1/N U * a
 
The capacity of the second mesh is 
 
C = R\N * U
 
For the second mesh, the arrival rate must be less than the service rate. 
 
a - C = 1/N (U*a - RU) < 0
 
Since a < R, this must be true.  The derivation of the above equation is where the weakly mixing requirement manifests.  If we can implement U in both phases, we get 100% throughput.  As long as two stages are not permuting in lockstep then we get the same result.  Total swtiching capacity is 2RN, but no part of the system has to opperate faster than the line rate.
 

Performance

How does this system compare to a centralized arbitration?
 
* Without the centralized arbiter it is a lot less complex, and can be better distributed.  
* If you could build a scheduler that could work fast enough, you wouldn’t have to build the extra capacity.  
* Fairness or guarantees are easier to make with a scheduler.
 
A question is where does the actual switching take place in this system?  No where have we mentioned looking up forwarding tables.  The actual switching is taking place in the assignments to a VOQ in the intermediate stage.  When an intermediate stage gets a packet it needs to look up the header of the packet to figure out which queue to put the packet into.
 
The internet does not provide any sequencing guarantees.  However, many applications and TCP assume that traffic in a flow will be in-order in the absence of congestion.  This has essentially become a requirement on network hardware.  Many load-balancing systems now will do load-balancing per-flow instead of per-packet to preserve packet order.
 
In our intermediate stage the queues can have different sizes, so when a new set of cells are spread they may have different delays before they are sent to the output, leading to mis-sequencing.  One approach is uniform frame spreading, which keeps the delta at 0 by filling holes with dummy packets.
 
The main contribution of the paper is a system called Full-Ordered Frames First.  This means doing the frames that won’t cause mis-sequencing first, then doing the left-overs.  This requires queues at the input, from which the system spreads each flow uniformly to the intermediate stage.
 
In the end, we’ve built a parallel output-queued switch.  Expected delay is within a constant of output-queued switches, and it has 100% throughput.
 

Optics

 The paper is titled after the optics, though it isn’t the most important contribution of the paper.  However, it does have some interesting properties.
 
One observation is that We can put the three stages on one line-card.  The input sends to the first mesh.  This mesh does the spreading into the intermediate stage which is also stored on the same line card.  Then second stage mesh maps from the intermediate stages to the output.  This takes advantage of each stage needing N components.  
 
For the spreading, any spreading device which will take in N channels at 2R/N and send one back will work.  There just has to be a way to send N items on the wire.  Examples: Full uniform mesh, round-robin crossbar, Static WDM.
 
The Array Wave Grating Router is of particular interest.  It sends wavelength i on input port j goes to output port (i + j – 1 mod N).  It is totally passive and quite small, using almost zero power.  It automatically does the shifting and it fairly easy to implement with a single fiber and multiplexed lasers.  Today up to 64-128 wavelengths on a single fiber is feasible.
 

Failures and Configuration

When a linecard is missing or has failed, a whole set of input, intermediate, output and output stages are missing.  Then we can no longer serve rate R because the capacity coming out of each stage is now only (N – 1) * R / N.  This means the system can no longer serve rate R for any given flow.  In short, we must be able to reconfigure something, fortunately it does not have to be at the rate of packets, instead only at the rate line-cards are added or removed.
 

Final Thoughts

The scaling provided by the load-balancing allowed optics to be used.  However, it turns out the load-balancing was the more significant contribution, not the optics.  How would we model the total buffer size of this style of switch to compare it to another?  It can be modeled just as the sum of the VOQs.  The distributed nature gives the intuition that it is processing in parallel and it makes sure that none of the interconnects is idle if another is busy, leading to high throughput.
Advertisements

Hello world!

4 Apr

Welcome to WordPress.com. After you read this, you should delete and write your own post, with a new title above. Or hit Add New on the left (of the admin dashboard) to start a fresh post.

Here are some suggestions for your first post.

  1. You can find new ideas for what to blog about by reading the Daily Post.
  2. Add PressThis to your browser. It creates a new blog post for you about any interesting  page you read on the web.
  3. Make some changes to this page, and then hit preview on the right. You can always preview any post or edit it before you share it to the world.