In this article, we cover what 40GbE is and how it works and outline the key areas of latency sensitivity that are related to 40GbE. We also cover why using a 40GbE MAC/PCS in FPGA designs is preferable when trading on an exchange that provides a 40GbE handoff. Finally, we examine the latency incurred when using a switching platform to convert between 40GbE and 10GbE.


What is 40Gb Ethernet?


In 2010, 40GbE was officially approved as Institute of Electrical and Electronics Engineers (IEEE) standard 802.3ba which enables Ethernet frames to be transmitted at a rate of 40 Gbit/s. Practically, this is achieved by transmitting and receiving data over four lanes. To fully understand how this works, we need to go back to how this fits into the Open Systems Interconnection (OSI) model.



When transmitting, data is received from the Data Link Layer (Layer 2) MAC into the Physical Layer (Layer 1). The Physical Coding Sublayer (PCS) is responsible for the encoding of data bits into code groups for transmission via the PMA. The PCS also takes care of decoding the code groups. Within the PCS, Multilane Distribution (MLD) is used for 40GbE.


At 40GbE, data is transmitted and received using a 64b/66b coding scheme. The MLD distributes 66-bit blocks across four 10.3125 Gb lanes.


When transmitting, each block is sent in a round-robin fashion on the next available lane and markers are inserted on every lane with the purpose of informing the lane order and computing the foreign skew. When receiving, the lanes are first reordered, then unskewed based on the markers and each 66-bit block is received in the original order.



40GbE for electronic trading


After the official approval of the IEEE standard, exchanges started offering 40GbE handoffs with the primary goal of catering to the growing market data volumes over a single link. Only recently has optimizing latency over 40GbE links become critical for trading firms.


To demonstrate where in electronic trading 40GbE would commonly be used, it is best to look at an example of a trading venue or exchange that provides 40GbE connectivity. In this example, we can see that communications are operating at 40Gb between the exchange infrastructure and the end-user facing connectivity switch.



To illustrate the difference in performance between using native 40GbE versus 10GbE connectivity, we take an example where the exchange sends a UDP frame to the end-user, which will in turn, react after the reception of the first 128 bytes of the packet by sending a 128 bytes TCP packet back to the exchange.


In this scenario, the end-user is receiving and sending over a 40GbE path between the exchange infrastructure and the end-user application. We compare that to a scenario where the end-user application is using a 10GbE path and the exchange infrastructure remains at 40GbE.


The goal of this exercise is to demonstrate how the 40GbE and 10GbE standards differ. When making our calculations, we assume that all other components remain equal and do not consider any potential latency added by the network switch or end-user application.


Using the example of a typical payload received by the end-user application, we make the assumption that the application logic trigger point is at the end of the payload. Aligning with the standard for many exchanges, the incoming (market data) frame is UDP and the outgoing (order) frame is TCP.



Assuming all else remains equal, the table below outlines the delta in the latency between receiving, triggering and sending a frame to the exchange over 10GbE compared to 40GbE. We can see that 40GbE allows bits to be serialized 4 times faster and therefore, receiving data over a 10GbE link is slower when processing a frame of 128 bytes.


We make an assumption that a trading application will need to check the input UDP Frame Check Sequence (FCS) before making a decision to send an order back to the exchange, thus the application needs to wait for the serialization of 128 bytes of the frame.


Theoretical example:


[1]: This is the theoretical lowest latency possible for going from 40GbE to 10GbE. To the best of our knowledge, no device currently exists that is capable of achieving this minimum delta and additional latency will be incurred in a real-life setting.


[2]: This is the theoretical lowest latency possible for going from 10GbE to 40GbE with a device capable of starting the forwarding egress frame before the end of the ingress is received.


Practical example:


We want to consider the latency difference between the use of 40GbE or 10GbE FPGA MAC interfaces and the latency of converting 10GbE to 40GbE, or vice-versa, within a network switch. Many switches have a ‘cut-through’ mode which permits lower latency in the scenario where both ports are operating at the same speed.


In most commonly deployed switches, we observe around 550 – 650ns for this ‘cut-through’ mode. However, when converting from 10GbE to 40GbE, most switches use a ‘store and forward’ mode which introduces a latency of between 600ns and 2.4μs, depending on the packet size.


Taking this into account, our table is updated below:


[4]: Store-and-forward for a frame of 128 Bytes from 40GbE to 10GbE
[5]: Store-and-forward for a frame of 128 Bytes from 10GbE to 40GbE


What to look for when considering using 40GbE?


Based on these results and our own experience with low latency trading infrastructure, we can recommend the following tips for electronic trading firms that have a 40GbE handoff from an exchange. The list below assumes that the path between the connectivity/edge switch and the exchange infrastructure always remains at 40GbE.


  • Ensure that the exchange is using 40GbE connectivity internally and is not simply upscaling the link at the edge.
  • Attempt to keep end user applications connected at 40GbE throughout the path to avoid store and forward latency being introduced by the switches.
  • Design FPGA applications that can read data from the frame as it comes off the wire, as opposed to waiting until the complete frame is received and can take advantage of a 40GbE compatible MAC.
  • When multiple trading applications are required locally (subsequent to the exchange hand-off), leverage Layer 1 technology for distribution, and if possible, 40GbE muxing on the return path to the exchange.


<< back to news list