Enyx 10G MAC/PCS ULL Performance Report

Version

2.3.0

This document was automatically generated and include measurements resulting from behavioral simulations of the Enyx 10G MAC/PCS ULL core.

Latency Testing

The latency of the Enyx 10G MAC/PCS ULL core is measured according to the following methodology:

  • A packet generator, hosted in the FPGA user logic, generates payloads of a size ranging from 1 to 64 bytes with random data content.

  • The generated payloads are forwarded to the Media Access Control (MAC) mac_tx interface where they are time-stamped at Start of Packet (SoP).

  • The resulting Ethernet packets (Ethernet header + generated payload) are sent through the FPGA vendor pma_tx simulation model, loopbacked and sent back through the FPGA vendor pma_rx simulation model.

  • The Ethernet packets which are looped back are forwarded to the MAC which will output two events through the mac_rx interface which will be both time-stamped by user logic :

    • Raise the Start of Frame (SoF) signal as soon as the Ethernet preamble is received by the MAC. This will be time-stamped by the user logic.

    • Forward the looped back payload to the user logic via the mac_rx interface.

  • These two receipt timestamps and the transmission timestamp are then used to calculate the two following latency metrics for each payload size :

    • A SoP-to-SoF latency

    • A SoP-to-SoP latency

_images/MACPCS_perfs.svg

Figure 1 Enyx 10G MAC/PCS ULL Latency Testing Scenario

Virtex UltraScale+ targets

The following measurements include the latency of Xilinx Virtex Ultrascale+ GTY transceivers.

Table 1 Xilinx UltraScale+ GTY Transceiver latency

Item

Absolute Minimum

RX Path

133 + 0.7 * (10 - 2) UI = 13.44 ns

TX Path

141 + 0.3 * (10 - 2) UI = 13.90 ns

Total RTT

282 UI = 27.345 ns

Note

The GTY transceiver RX and TX latencies used in these measurements are the ones detailed by Xilinx in this document

The GTY transceivers are configured as follows : 32b Internal datawidth, running frequency 322.265625 Mhz, 10.3125Gbps.

RX and TX lines under “Total – absolute minimum line”.

Add “TX Note 2) 1)” on TX Path

Add “RX Note 3) 1) ” on RX Path

Lines can be added together to get the RTT in UI

1 UI = 1/10.3125 ns

Table 2 Enyx 10G MAC/PCS ULL RTT Latency for Virtex UltraScale + target - SOP-to-SOP

Configuration

Minimum

Maximum

No FIFO - DATA_WIDTH = 32b - 322 MHz PMA Rx and Tx clocks(ns)

35.42

41.64

Rx FIFO - DATA_WIDTH = 32b - 322 MHz PMA Tx clock (ns)

46.54

52.75

Tx & Rx FIFO - DATA_WIDTH = 32b - 350 MHz user clock (ns)

57.14

62.86

Table 3 Enyx 10G MAC/PCS ULL RTT Latency for Virtex UltraScale + target - SOP-to-SOF

Configuration

Minimum

Maximum

No FIFO - DATA_WIDTH = 32b - 322 MHz PMA Rx and Tx clocks(ns)

29.22

35.43

Rx FIFO - DATA_WIDTH = 32b - 322 MHz PMA Tx clock (ns)

37.24

43.44

Tx & Rx FIFO - DATA_WIDTH = 32b - 350 MHz user clock (ns)

45.71

54.29

_images/MACPCS_10G_vusp_Latency.svg

Figure 2 Enyx 10G MAC/PCS ULL RTT Latency diagram for Virtex UltraScale + targets - SOP-to-SOP

Resources & Frequencies

Below is a table showing the Enyx 10G MAC/PCS ULL resource usage along with the timing margins and does not include the logic used by transceivers (PMA).

The MAC Rx, MAC Tx, PCS Rx and PCS Tx blocs are all running at 322.265625 MHz, at the respective PMA clocks.

  • In the No FIFO scenario, the user clock is set to the respective PMA clocks so everything is running at 322.265625 MHz.

  • In the Rx FIFO scenario, both user clocks are set to the pma_tx_clk so everything is running at 322.265625 MHz.

  • In the Tx & Rx FIFO scenario, the user clock is set to 350 MHz therefore the mac user Tx & Rx interfaces connected to the Dual Clock FIFOs.

Table 4 Resources and timing summary.

Family

Device

Speed Grade

Scenario

Data width

Slack (ns)

Logic

%

Registers

%

Memory

%

Virtex US+

VU9P

2

No FIFO

32b

0.298

1523

1

1245

1

0

0

Virtex US+

VU9P

2

Rx FIFO

32b

0.133

1670

1

1406

1

0

0

Virtex US+

VU9P

2

Tx & Rx FIFO

32b

0.166

1762

1

1556

1

0

0

Virtex US+

VU9P

3

Rx FIFO

32b

0.192

1646

1

1407

1

0

0

Virtex US+

VU9P

3

Tx & Rx FIFO

32b

0.22

1739

1

1556

1

0

0

Note

  • Logic is expressed in ALM for Intel and in LUT for Xilinx

  • Memory as block memory usage is expressed in M20k for Intel and in 36k for Xilinx

  • These resource utilization and slack metrics are the average of the values obtained by running 10 x synthesis and place and route.

The constraints applied to the FPGA toolchain are provided in this section: FPGA compiler constraints.

The configuration of the Enyx 10Gbs MAC/PCS ULL used for these tests are detailed in the section: Enyx 10G MAC/PCS ULL core configuration.

Testing scenarios description

The performance of the Enyx 10G MAC/PCS ULL core is measured according to 3 different configurations depending on which clock is being used to drive the user logic connected to the MAC.

No FIFO configuration

  • DUAL_CLOCK_FIFO_RX_ENABLE must be set to false

  • DUAL_CLOCK_FIFO_TX_ENABLE must be set to false

_images/enyx_macpcs_10g_nofifo.svg

Figure 3 Enyx 10G MAC/PCS ULL architecture with no Dual clock FIFOs : ‘No FIFO’

Rx FIFO configuration

  • DUAL_CLOCK_FIFO_RX_ENABLE must be set to true

  • DUAL_CLOCK_FIFO_TX_ENABLE must be set to false

_images/enyx_macpcs_10g_notx.svg

Figure 4 Enyx 10G MAC/PCS ULL architecture with only the Rx Dual clock FIFO : ‘Rx FIFO’

Note

Tx FIFO and Rx FIFO configurations and results are similar. Tx FIFO configuration is therefore not tested.

Tx & Rx FIFO configuration

  • DUAL_CLOCK_FIFO_RX_ENABLE must be set to true

  • DUAL_CLOCK_FIFO_TX_ENABLE must be set to true

_images/enyx_macpcs_10g.svg

Figure 5 Enyx 10G MAC/PCS ULL architecture with both Tx and Rx Dual clock FIFOs : ‘Tx & Rx FIFO’

FPGA compiler constraints

Below are the constraints that are provided to the FPGA compiler tools for the resources and working frequencies estimations.

Virtex UltraScale+ (-2 speed grade) targets

################ FPGA ####################
set_property part xcvu9p-flgb2104-2-e [current_project]

regex {Vivado v(\d+)\.(\d).*SW Build (\d+).*IP Build (\d+)} [version] matched major minor sw_build ip_build
if {$major < 2020} {set_property STEPS.SYNTH_DESIGN.ARGS.FANOUT_LIMIT 400 [get_runs synth_*]}

set_property strategy Flow_PerfOptimized_high [get_runs synth_1]
set_property STEPS.SYNTH_DESIGN.ARGS.ASSERT true [get_runs synth_1]

#set_property STEPS.SYNTH_DESIGN.ARGS.FSM_EXTRACTION one_hot [get_runs synth_*]
#set_property STEPS.SYNTH_DESIGN.ARGS.KEEP_EQUIVALENT_REGISTERS true [get_runs synth_*]
#set_property STEPS.SYNTH_DESIGN.ARGS.RESOURCE_SHARING off [get_runs synth_*]
#set_property STEPS.SYNTH_DESIGN.ARGS.NO_LC true [get_runs synth_*]
#set_property STEPS.SYNTH_DESIGN.ARGS.SHREG_MIN_SIZE 5 [get_runs synth_*]

Virtex UltraScale+ (-3 speed grade) targets

################ FPGA ####################
set_property part xcvu9p-flgb2104-3-e [current_project]

regex {Vivado v(\d+)\.(\d).*SW Build (\d+).*IP Build (\d+)} [version] matched major minor sw_build ip_build
if {$major < 2020} {set_property STEPS.SYNTH_DESIGN.ARGS.FANOUT_LIMIT 400 [get_runs synth_*]}

set_property STEPS.SYNTH_DESIGN.ARGS.FSM_EXTRACTION one_hot [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.KEEP_EQUIVALENT_REGISTERS true [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.RESOURCE_SHARING off [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.NO_LC true [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.SHREG_MIN_SIZE 5 [get_runs synth_*]

Enyx 10G MAC/PCS ULL core configuration

Below are the generic parameters used for the Enyx 10G MAC/PCS ULL core under test:

  • The PMA is configured with a data width of 32bits running at 322.265625 MHz.

  • Fixed parameters throughout all scenarios and FPGA targets :

    • PCS_RX_LATENCY_MODE = 0 // Fully asynchronous

    • PCS_TX_LATENCY_MODE = 0 // Fully asynchronous

    • MAC_RX_LATENCY_MODE = 0 // Fully asynchronous

    • MAC_TX_LATENCY_MODE = 0 // Fully asynchronous

    • DUAL_CLOCK_FIFO_RX_DEPTH = 16

    • DUAL_CLOCK_FIFO_TX_DEPTH = 16

    • MAC_RX_DATA_WIDTH = 32

    • MAC_RX_CRC32_LATENCY = 1

    • MAC_RX_CRC32_DISABLE = 0

    • MAC_RX_CRC32_REMOVE = 1

    • MAC_RX_MIN_PKT_LENGTH = 64

    • MAC_RX_ERROR_DELAYED_ENABLE = 0

    • MAC_TX_DATA_WIDTH = 32

    • MAC_TX_CRC32_DISABLE = 0

    • MAC_TX_CRC32_CORRUPTED_ON_ERROR = 0

    • MAC_TX_CRC32_ADD = 1

    • MAC_TX_PADDING_SIZE = 64

    • MAC_TX_PAUSE_LENGTH = 0

    • MAC_TX_IFG_COUNT = 12

  • Parameters specific to each scenario :

    • Dual clock FIFOs are activated or not :

      • DUAL_CLOCK_FIFO_RX_ENABLE = 0 or 1

      • DUAL_CLOCK_FIFO_TX_ENABLE = 0 or 1

    • User clock frequency is set to either 322.265625 MHz (PMA clock) or 350 MHz :

      • MAC_RX_CLK_FREQ_KHZ = 322266 or 350000

      • MAC_TX_CLK_FREQ_KHZ = 322266 or 350000