Enyx 10G TCP ULL Performance Report

Version

3.0.0

Context

This document provides performance measures of the Enyx 10G TCP ULL core in two configurations.

The measurements are automatically imported from behavioral simulations of the Enyx 10G TCP ULL core.

The two measurement test configurations used are :

  • Store & Forward

    • Tx: A Store & Forward FIFO is used to compute the TCP header size and checksum values from user provided payload.

    • Rx: A Store & Forward FIFO is used to allow packets with MAC FCS or TCP layer checksum errors to be dropped. Payloads received on the user interface are only from valid packets.

  • Cut-through

    • Tx: The Tx Cut-through mode bypasses the Store & Forward FIFO which is enabled when the user provides both pre-computed size and checksum values of the TCP payload on the User Tx interface.

    • Rx: The Rx Cut-through mode is used where neither TCP nor IP layers checksums are verified and the Rx Store & Forward FIFO is bypassed. FCS errors verified by the MAC/PCS layer will still be provided to the user on EOP.

Enyx 10G TCP ULL core generic parameters configuration used all testing in this report can be found in this section Generic parameters applied to the Enyx 10G TCP ULL stack.

Resources & Working frequencies

Hereafter is a table showing Enyx 10G TCP ULL implementation results on our supported FPGA families. The compilations only include the Enyx 10G TCP ULL stack, therefore providing a maximum reachable frequency for each configuration. These results include the Enyx 10G TCP ULL resource usage averaged from 10 different runs along with the maximum frequency achieved on these 10 runs on either Vivado 2020.1 or Quartus 20.1 Standard, and the parameters applied to the tools are provided Constraints applied to FPGA compiler tools for Resources and Working frequencies estimations.

These results are all done with :

  • the Enyx 10G TCP ULL Tx Retransmission memory instantiated as external memory (therefore it is not accounted for in the following section, since it is only the FPGA RAM blocks that are reported here).

  • a 160 MHz memory clock since this does not affect the latency of the design but greatly improves timings. This will only affect the latency of packets that need to be retransmit to the peer, which are rarely latency sensitive compared to the original packets.

    • Logic

      • in ALM for Intel

      • in LUT for Xilinx

    • Registers in K

    • Memory as block memory usage

      • M20k for Intel

      • 36k for Xilinx

    Table 1 Resources summary for the Enyx 10G TCP ULL stack with the congestion feature enabled.

    Family

    Device

    Nb of Sessions

    Data width (bits)

    Logic

    %

    Registers

    %

    Memory

    %

    Virtex US+

    VU9P

    2

    64

    35558

    4

    54908

    3

    16

    1

    Virtex US+

    VU9P

    64

    64

    42675

    4

    66817

    3

    17

    1

    Table 2 Resources summary for the Enyx 10G TCP ULL stack without the congestion feature enabled.

    Family

    Device

    Nb of Sessions

    Data width (bits)

    Logic

    %

    Registers

    %

    Memory

    %

    Virtex US+

    VU9P

    2

    32

    30791

    3

    48612

    3

    17

    1

    Virtex US+

    VU9P

    64

    32

    37899

    4

    60558

    3

    17

    1

A Recommended Frequency is provided for each use case, showing reasonable working frequencies for the Enyx 10G TCP ULL stack (without congestion feature enabled) in a typical scenario, along with the best frequency achieved on the specified target amongst all runs.

Table 3 Recommended clock frequency for the Enyx 10G TCP ULL stack.

Family

Device

Speed Grade

Nb of Sessions

Data width (bits)

Freq (MHz)

Best Freq Achieved (MHz)

Virtex US+

VU9P

2

64

64

250

311

Virtex US+

VU9P

2

2

64

250

334

Virtex US+

VU9P

2

64

32

322

387

Virtex US+

VU9P

2

2

32

322

410

Virtex US+

VU9P

3

64

64

250

314

Virtex US+

VU9P

3

2

64

250

337

Virtex US+

VU9P

3

64

32

322

385

Virtex US+

VU9P

3

2

32

322

418

Latency Testing

In order to measure the latency of the Enyx 10G TCP ULL Stack, the following test is performed:

  • For Rx latency :

    • Packets are sent out from a MAC core to the Enyx 10G TCP ULL stack, one by one, from 1 to 2048 Bytes with a step of 8 Bytes.

    • Packets are throttled in order to respect 10Gbps to emulate a 10Gbps link.

    • Packets are immediately timestamped when they enter the Enyx 10G TCP ULL stack, at the Start of Packet (SoP).

    • As soon as the packets are received on the Enyx 10G TCP ULL Rx User Out interface, a second timestamp is performed, at the Start of Packet (SoP).

    • The lower the data width of the User Rx interface, the lower the latency can be since the SoP can be output faster on the interface without having to wait for the complete data width to be received.

    • Both timestamps are compared and provides the Enyx 10G TCP ULL Rx latency for each packet size.

    _images/TCP_perfs_latency_rx.svg

    Figure 1 Enyx 10G TCP ULL RX Latency Testing Scenario

  • For Tx latency :

    • Packets are pushed to the Enyx 10G TCP ULL Tx User In interface, one by one, from 1 to 2048 Bytes with a step of 8 Bytes.

    • Packets are sent out as soon as possible, so a higher bandwidth input (Clk * width for the User Tx interface) will result in better latencies for all store and forward scenarios.

    • Packets are immediately timestamped when they enter the Enyx 10G TCP ULL stack, at the Start of Packet (SoP).

    • As soon as the packets are received on the Enyx MAC core, a second timestamp is performed, at the Start of Packet (SoP).

    • Both timestamps are compared and provides the Enyx 10G TCP ULL Tx latency for each packet size.

    _images/TCP_perfs_latency_tx.svg

    Figure 2 Enyx 10G TCP ULL TX Latency Testing Scenario

Latencies are all measured from Start-of-Packet (SoP) to Start-of-Packet (SoP).

Latency Summary

Table 4 Enyx 10G TCP ULL Latency Summary table

TCP Payload Size (Bytes)

1.0

8.0

16.0

32.0

64.0

128.0

256.0

512.0

1024.0

2048.0

10G RTT Cut-Through - 322 MHz (ns)

52.75

55.86

55.86

55.86

55.86

55.86

55.86

55.86

55.86

55.86

  • 10G Tx Cut-Through - 322 MHz (ns)

6.21

6.21

6.21

6.21

6.21

6.21

6.21

6.21

6.21

6.21

  • 10G Rx Cut-Through - 322 MHz (ns)

46.54

49.65

49.65

49.65

49.65

49.65

49.65

49.65

49.65

49.65

10G RTT - 322 MHz (ns)

68.27

71.37

77.58

93.09

124.13

189.29

316.51

570.96

1079.86

2094.54

  • 10G Tx - 322 MHz (ns)

6.21

6.21

6.21

9.31

15.52

27.93

52.75

102.4

201.7

400.29

  • 10G Rx - 322 MHz (ns)

62.06

65.16

71.37

83.78

108.61

161.36

263.76

468.56

878.16

1694.25

10G RTT Cut-Through - 350 MHz (ns)

51.42

54.28

54.28

54.28

54.28

54.28

54.28

54.28

54.28

54.28

  • 10G Tx Cut-Through - 350 MHz (ns)

5.71

5.71

5.71

5.71

5.71

5.71

5.71

5.71

5.71

5.71

  • 10G Rx Cut-Through - 350 MHz (ns)

45.71

48.57

48.57

48.57

48.57

48.57

48.57

48.57

48.57

48.57

10G RTT - 350 MHz (ns)

65.71

68.57

77.14

91.43

122.86

185.71

311.43

560.0

1062.85

2062.85

  • 10G Tx - 350 MHz (ns)

5.71

5.71

5.71

8.57

14.29

25.71

48.57

94.29

185.71

368.57

  • 10G Rx - 350 MHz (ns)

60.0

62.86

71.43

82.86

108.57

160.0

262.86

465.71

877.14

1694.28

10G Tx

Table 5 Enyx 10G TCP ULL TX Latency table

TCP Payload Size (Bytes)

1.0

8.0

16.0

32.0

64.0

128.0

256.0

512.0

1024.0

2048.0

350 MHz - No Tx Cut-Through (ns)

5.71

5.71

5.71

8.57

14.29

25.71

48.57

94.29

185.71

368.57

322 MHz - No Tx Cut-Through (ns)

6.21

6.21

6.21

9.31

15.52

27.93

52.75

102.4

201.7

400.29

_images/TCP_10G_TX_Latency.svg

Figure 3 Enyx 10G TCP ULL TX Latency diagram

10G Tx Cut-Through

Table 6 Enyx 10G TCP ULL TX Cut-Through Latency table

TCP Payload Size (Bytes)

1.0

8.0

16.0

32.0

64.0

128.0

256.0

512.0

1024.0

2048.0

350 MHz - With Tx Cut-Through (ns)

5.71

5.71

5.71

5.71

5.71

5.71

5.71

5.71

5.71

5.71

322 MHz - With Tx Cut-Through (ns)

6.21

6.21

6.21

6.21

6.21

6.21

6.21

6.21

6.21

6.21

_images/TCP_10G_TX_Latency_CutThrough.svg

Figure 4 Enyx 10G TCP ULL TX Cut-Through Latency diagram

10G Rx

Note

Latency measurements are provided from input start-of-packet (SOP) to output start-of-packet (SOP). As the input packet contains the Ethernet/IP/TCP headers and the output packet does not, the latencies provided include incompressible protocol header deserialisation.

The following formula outlines the computation of this protocol header deserialisation time:

  • Clock period * ( CEIL[ (Protocol header size in bytes + Payload size in bytes) / (Number of bytes per clock cycle) ] - 1 + Enyx 10G TCP ULL clk cycles )

In the case of the Enyx TCP 10G ULL core with 32 bit data width running at 322 MHz clock:

  • 3.1 * ( CEIL [ (54 + 1) / (4) ] - 1 + Enyx 10G TCP ULL clk cycles )

  • = 3.1 * ( 13 + Enyx 10G TCP ULL clk cycles )

  • = 40.3 ns + (3.1 * Enyx 10G TCP ULL clk cycles )

As a result, the below latency measurements include 40.3 ns of Ethernet/IP/TCP header deserialization.

Table 7 Enyx 10G TCP ULL RX Latency table

TCP Payload Size (Bytes)

1.0

8.0

16.0

32.0

64.0

128.0

256.0

512.0

1024.0

2048.0

350 MHz - No Rx Cut-Through (ns)

60.0

62.86

71.43

82.86

108.57

160.0

262.86

465.71

877.14

1694.28

322 MHz - No Rx Cut-Through (ns)

62.06

65.16

71.37

83.78

108.61

161.36

263.76

468.56

878.16

1694.25

_images/TCP_10G_RX_Latency.svg

Figure 5 Enyx 10G TCP ULL RX Latency diagram

10G Rx Cut-Through

Note

Latency measurements are provided from input start-of-packet (SOP) to output start-of-packet (SOP). As the input packet contains the Ethernet/IP/TCP headers and the output packet does not, the latencies provided include incompressible protocol header deserialisation.

The following formula outlines the computation of this protocol header deserialisation time:

  • Clock period * ( CEIL[ (Protocol header size in bytes + Payload size in bytes) / (Number of bytes per clock cycle) ] - 1 + Enyx 10G TCP ULL clk cycles )

In the case of the Enyx TCP 10G ULL core with 32 bit data width running at 322 MHz clock:

  • 3.1 * ( CEIL [ (54 + 1) / (4) ] - 1 + Enyx 10G TCP ULL clk cycles )

  • = 3.1 * ( 13 + Enyx 10G TCP ULL clk cycles )

  • = 40.3 ns + (3.1 * Enyx 10G TCP ULL clk cycles )

As a result, the below latency measurements include 40.3 ns of Ethernet/IP/TCP header deserialization.

Table 8 Enyx 10G TCP ULL RX Cut-Through Latency table

TCP Payload Size (Bytes)

1.0

8.0

16.0

32.0

64.0

128.0

256.0

512.0

1024.0

2048.0

350 MHz - With Rx Cut-Through (ns)

45.71

48.57

48.57

48.57

48.57

48.57

48.57

48.57

48.57

48.57

322 MHz - With Rx Cut-Through (ns)

46.54

49.65

49.65

49.65

49.65

49.65

49.65

49.65

49.65

49.65

_images/TCP_10G_RX_Latency_CutThrough.svg

Figure 6 Enyx 10G TCP ULL RX Cut-Through Latency diagram

Bandwidth Testing

In order to measure the bandwidth of the Enyx 10G TCP ULL Stack, the following test is performed:

  • For Rx bandwidth :

    • 100 Packets are sent out from a MAC core to the Enyx 10G TCP ULL stack for each packet sizes from 1 to 512 Bytes with a step of 16 Bytes.

    • The packets are sent out at the maximum data rate possible (so for the Enyx 10G TCP ULL Bandwidth 10G Rx test we will send out the packets at a rate of 10Gbps).

    • If the Enyx 10G TCP ULL stack drops any of the 100 packets, then the test for this packet size fails, and it is repeated with a slower bandwidth.

    • As soon as the Enyx 10G TCP ULL stack doesn’t drop any of the packets, then the test for this packet size succeeds, and the current bandwidth is recorded.

    _images/TCP_perfs_bandwidth_rx.svg

    Figure 7 Enyx 10G TCP ULL RX Bandwidth Testing Scenario

  • For Tx bandwidth :

    • 100 Packets are pushed to the Enyx 10G TCP ULL stack on the TCP Tx User Interface for each packet sizes from 1 to 512 Bytes with a step of 16 Bytes.

    • The packets are sent out at the maximum data rate possible ( TCP_USR_DATA_WIDTH * CLOCK).

    • The output of the Enyx 10G TCP ULL stack to the MAC goes to a bandwidth limiter module, whose sole purpose is to limite it’s own bandwidth capability to the selected ethernet speed (so for the Enyx TCP Bandwidth 10G Tx test, the bandwidth limiter module will operate at maximum speed of 10Gbps)

    • The Tx bandwidth is directly measured between the Enyx 10G TCP ULL stack to the bandwidth limiter module.

    _images/TCP_perfs_bandwidth_tx.svg

    Figure 8 Enyx 10G TCP ULL TX Bandwidth Testing Scenario

Bandwidth Summary

Table 9 Enyx 10G TCP ULL Tx Latency Summary table

TCP Payload Size (Bytes)

1

16

32

64

128

256

512

10G Tx - 350 MHz (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

10G Tx Cut-Through - 350 MHz (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

10G Tx - 322 MHz (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

10G Tx Cut-Through - 322 MHz (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

Table 10 Enyx 10G TCP ULL Rx Latency Summary table

TCP Payload Size (Bytes)

1

16

32

64

128

256

512

10G Rx - 350 MHz (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

10G Rx Cut-Through - 350 MHz (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

10G Rx - 322 MHz (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

10G Rx Cut-Through - 322 MHz (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

10G Tx

Table 11 Enyx 10G TCP ULL TX Bandwidth table

TCP Payload Size (Bytes)

1

16

32

64

128

256

512

350 MHz - No Tx Cut-Through (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

322 MHz - No Tx Cut-Through (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

_images/TCP_10G_TX_Bandwidth.svg

Figure 9 Enyx 10G TCP ULL TX Bandwidth diagram

10G Tx Cut-Through

Table 12 Enyx 10G TCP ULL TX Cut-Through Bandwidth table

TCP Payload Size (Bytes)

1

16

32

64

128

256

512

350 MHz - With Tx Cut-Through (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

322 MHz - With Tx Cut-Through (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

_images/TCP_10G_TX_Bandwidth_CutThrough.svg

Figure 10 Enyx 10G TCP ULL TX Cut-Through Bandwidth diagram

10G Rx

Table 13 Enyx 10G TCP ULL RX Bandwidth table

TCP Payload Size (Bytes)

1

16

32

64

128

256

512

350 MHz - No Rx Cut-Through (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

322 MHz - No Rx Cut-Through (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

_images/TCP_10G_RX_Bandwidth.svg

Figure 11 Enyx 10G TCP ULL RX Bandwidth diagram

10G Rx Cut-Through

Table 14 Enyx 10G TCP ULL RX Cut-Through Bandwidth table

TCP Payload Size (Bytes)

1

16

32

64

128

256

512

350 MHz - With Rx Cut-Through (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

322 MHz - With Rx Cut-Through (Mbit/s)

10000

10000

10000

10000

10000

10000

10000

_images/TCP_10G_RX_Bandwidth_CutThrough.svg

Figure 12 Enyx 10G TCP ULL RX Cut-Through Bandwidth diagram

Constraints applied to FPGA compiler tools for Resources and Working frequencies estimations

Hereafter are the constraints that are provided to the default FPGA compiler tools for the Resources and Working frequencies estimations.

Virtex UltraScale+ (-2 speed grade) targets

################ FPGA ####################
set_property part xcvu9p-flgb2104-2-e [current_project]

regex {Vivado v(\d+)\.(\d).*SW Build (\d+).*IP Build (\d+)} [version] matched major minor sw_build ip_build
if {$major < 2020} {set_property STEPS.SYNTH_DESIGN.ARGS.FANOUT_LIMIT 400 [get_runs synth_*]}

set_property strategy Flow_PerfOptimized_high [get_runs synth_1]
set_property STEPS.SYNTH_DESIGN.ARGS.ASSERT true [get_runs synth_1]

#set_property STEPS.SYNTH_DESIGN.ARGS.FSM_EXTRACTION one_hot [get_runs synth_*]
#set_property STEPS.SYNTH_DESIGN.ARGS.KEEP_EQUIVALENT_REGISTERS true [get_runs synth_*]
#set_property STEPS.SYNTH_DESIGN.ARGS.RESOURCE_SHARING off [get_runs synth_*]
#set_property STEPS.SYNTH_DESIGN.ARGS.NO_LC true [get_runs synth_*]
#set_property STEPS.SYNTH_DESIGN.ARGS.SHREG_MIN_SIZE 5 [get_runs synth_*]

Virtex UltraScale+ (-3 speed grade) targets

################ FPGA ####################
set_property part xcvu9p-flgb2104-3-e [current_project]

regex {Vivado v(\d+)\.(\d).*SW Build (\d+).*IP Build (\d+)} [version] matched major minor sw_build ip_build
if {$major < 2020} {set_property STEPS.SYNTH_DESIGN.ARGS.FANOUT_LIMIT 400 [get_runs synth_*]}

set_property STEPS.SYNTH_DESIGN.ARGS.FSM_EXTRACTION one_hot [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.KEEP_EQUIVALENT_REGISTERS true [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.RESOURCE_SHARING off [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.NO_LC true [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.SHREG_MIN_SIZE 5 [get_runs synth_*]

Generic parameters applied to the Enyx 10G TCP ULL stack

Hereafter are the generic parameters that are used for the Enyx 10G TCP ULL stack throughout all Resources, Latency and Bandwidth tests :

  • Fixed parameters :

    • DEBUG_MODE_EN=0

    • USER2TCP_DATA_WIDTH=64

    • TCP2USER_DATA_WIDTH=64

    • EMI2USER_DATA_WIDTH=64

    • MM_ADDR_WIDTH=12

    • MTU=1500

    • VLAN_COUNT=1

    • MAC_ADDRESS_COUNT=1

    • VIRTUAL_INTERFACES_COUNT=1

    • PEER_IPV4_ADDRESS_COUNT=16

    • RX_FIFO_PACKET_COUNT=3

    • RX_REORDERING_EN=0

    • RX_OOS_SEQNUM_EN=0

    • TX_DROP_IF_NOT_ESTABLISHED_EN=0

    • TX_CONGESTION_CONTROL_EN=0

    • TX_PUSH_BIT_VALUE=0

    • EMI_STATUS_EN=0

    • EMI_CREDIT_EN=0

    • INSTANT_ACK_EN=0

    • ARP_SERVER_EN=1

    • ARP_TABLE_ENTRY_COUNT=32

    • ARP_GRATUITOUS_REFRESH_EN=1

    • ICMP_SERVER_EN=1

    • RX_REORDERING_MEM_ADDR_WIDTH=15

    • RX_REORDERING_MEM_DATA_WIDTH=128

    • RX_REORDERING_MEM_MASK_WIDTH=16

    • RX_REORDERING_MEM_INTERNAL_RAM_TYPE=AUTO

    • RX_REORDERING_MEM_INTERNAL_LATENCY=6

    • TX_RETRANSMIT_MEM_ADDR_WIDTH=16

    • TX_RETRANSMIT_MEM_DATA_WIDTH=128

    • TX_RETRANSMIT_MEM_MASK_WIDTH=16

    • TX_RETRANSMIT_MEM_EXTERNAL_EN=1

    • TX_RETRANSMIT_MEM_EXTERNAL_FULL_DUPLEX_EN=1

    • TX_RETRANSMIT_MEM_EXTERNAL_LATENCY=2

    • TX_RETRANSMIT_MEM_INTERNAL_RAM_TYPE=”AUTO”

    • TX_RETRANSMIT_MEM_INTERNAL_LATENCY=6

    • TX_RETRANSMIT_MEM_INTERNAL_DUAL_CLOCK_EN=1

    • RX_OUTPUT_PIPE_COUNT=0

    • TX_OUTPUT_PIPE_COUNT=0

  • Parameters that may be changed per test :

    • MAC2TCP_DATA_WIDTH=32 for all latency and bandwidth tests, set to explicit values for the resources tests.

    • TCP2MAC_DATA_WIDTH=32 for all latency and bandwidth tests, set to explicit values for the resources tests.

    • SESSION_COUNT=16 for all latency and bandwidth tests, set to explicit values for the resources tests.

    • RX_CHECKSUM_VERIFICATION_EN=0 for the specific Rx cut-through latency tests, 1 for all other tests.