1. Context

This document provides performance measures of the Enyx HFP DMA engine in multiple configurations.

The measures are automatically imported from hardware nightly tests of the Enyx HFP DMA engine, in both Gen2 x8 and Gen3 x8.

The server used in the test setup is the following :

The document is split up as follows :

2. Resources & Working frequencies

Hereafter is a table showing HFP implementation results on our supported FPGA families. The compilations only include the HFP core, therefore providing a maximum reachable frequency for each configuration. These results include the HFP resource usage averaged from 10 different runs along with the maximum frequency achieved on these 10 runs on either Vivado 2019.2 or Quartus 16.0 Standard, and the parameters applied to the tools are provided Section 5 Constraints applied to FPGA compiler tools for Resources and Working frequencies estimations.

  • Logic

    • in K ALM for Intel

    • in K LUT for Xilinx

  • Registers in K

  • Memory as block memory usage

    • M20k for Intel

    • 36k for Xilinx

Table 2.1 Resources summary for the Enyx HFP core

Family

Device

Speed Grade

CPU to Accelerator

Accelerator to CPU

Acc to CPU buffer size

NETIF

NETIF buffer size

Logic

%

Registers

%

Memory

%

Arria 10

GX 1150

1

1

1

14

1

14

16224

4

31596

2

63

3

Arria 10

GX 1150

1

1

1

14

8

14

26950

7

56873

4

156

6

Arria 10

GX 1150

1

1

8

14

1

14

22109

6

44791

3

65

3

Arria 10

GX 1150

1

8

1

14

1

14

20620

5

43000

3

154

6

Arria 10

GX 1150

1

8

8

14

8

14

37610

9

81134

5

249

10

Arria 10

GX 1150

1

8

8

14

8

20

37676

9

81351

5

249

10

Arria 10

GX 1150

1

8

8

20

8

14

37629

9

81359

5

249

10

Stratix V

GX A7

2

1

1

14

1

14

15798

7

30771

4

63

3

Stratix V

GX A7

2

1

1

14

8

14

26140

12

55147

6

156

7

Stratix V

GX A7

2

1

8

14

1

14

21644

10

44135

5

65

3

Stratix V

GX A7

2

8

1

14

1

14

19807

9

42565

5

154

7

Stratix V

GX A7

2

8

8

14

8

14

36442

16

81418

9

249

10

Stratix V

GX A7

2

8

8

14

8

20

37010

16

83120

9

249

10

Stratix V

GX A7

2

8

8

20

8

14

36994

16

83097

9

249

10

Virtex US+

VU9P

2

1

1

8

14

14

53717

5

84483

4

50

3

Virtex US+

VU9P

2

8

8

8

14

14

69170

6

111818

5

64

3

Virtex US+

VU9P

2

8

8

8

14

20

69124

6

111984

5

64

3

Virtex US+

VU9P

2

8

8

8

20

14

82886

8

135759

6

76

4

Virtex US+

VU9P

3

1

1

8

14

14

53667

5

84469

4

50

3

Virtex US+

VU9P

3

8

8

8

14

14

69135

6

111818

5

64

3

Virtex US+

VU9P

3

8

8

8

14

20

69124

6

111982

5

64

3

Virtex US+

VU9P

3

8

8

8

20

14

82878

8

135759

6

76

4

A Recommended Frequency is provided for each use case, showing reasonable working frequencies for the Enyx HFP core in a typical scenario, along with the best frequency achieved on the specified target amongst all runs.

Table 2.2 Recommended clock frequency for the Enyx HFP core.

Family

Device

Speed Grade

Freq (MHz)

Best Freq Achieved (MHz)

Arria 10

GX 1150

1

250.0

304

Stratix V

GX A7

2

250.0

281

Virtex US+

VU9P

2

250.0

279

Virtex US+

VU9P

3

250.0

283

3. Latency Testing

In order to measure the latency of the Enyx HFP DMA engine, the following test is performed:

  • Random packets are sent out from a pattern generator core to the Enyx HFP DMA engine, one by one, from 1 to 2015 Bytes with a step of 19 Bytes (in order to cover a maximum of different alignment cases).

  • Packets are immediately timestamped when they exit the pattern generator, at the Start of Packet (SoP).

  • Packets are then sent to the HFP engine, and the software performs a loopback to send back the packets to the FPGA.

  • As soon as the Start of Packet is detected in the FPGA after the HFP core, a signal is sent to the pattern generator core (packets are simply dropped at this point in the FPGA).

  • The pattern generator core then simply compares the two times to compute the HFP engine RTT latency for each packet size.

  • 1000000 packets are sent out for each packet size.

_images/HFP_perfs_latency.svg

Figure 3.1 HFP Latency Testing Scenario

Latencies are all measured from Start-of-Packet (SoP) to Start-of-Packet (SoP).

Only the minimum measured latency on the entire run is presented in the following tables and diagrams.

3.1. Latency Summary

_images/latency_summary.svg

Figure 3.2 Enyx HFP RTT Latency Summary diagram

Table 3.1 Enyx HFP RTT Latency Summary table

Size (Bytes)

1.0

20.0

58.0

134.0

248.0

514.0

1027.0

Latency (ns) - VUSP Gen 3x8

728.0

736.0

756.0

768.0

792.0

820.0

892.0

Latency (ns) - VUSP Gen 2x8

872.0

892.0

916.0

940.0

972.0

1044.0

1188.0

Latency (ns) - A10 GX Gen 3x8

864.0

868.0

880.0

896.0

916.0

944.0

1016.0

Latency (ns) - SV GX Gen 2x8

996.0

1012.0

1020.0

1044.0

1084.0

1156.0

1300.0

3.2. Arria 10 GX Targets

3.2.1. Gen3 x8

_images/latency_rxa10lp_gen3_rxtx.svg

Figure 3.3 Enyx HFP RTT Latency diagram for Arria 10 GX target with PCIe Gen3 x8

Table 3.2 Enyx HFP RTT Latency table for Arria 10 GX target with PCIe Gen3 x8

Size (Bytes)

1.0

20.0

58.0

134.0

248.0

514.0

1027.0

Latency (ns) - A10 GX Gen 3x8

864.0

868.0

880.0

896.0

916.0

944.0

1016.0

3.3. Stratix V GX targets

3.3.1. Gen2 x8

_images/latency_fpb1_gen2_rxtx.svg

Figure 3.4 Enyx HFP RTT Latency diagram for Stratix V GX targets with PCIe Gen2 x8

Table 3.3 Enyx HFP RTT Latency table for Stratix V GX target with PCIe Gen2 x8

Size (Bytes)

1.0

20.0

58.0

134.0

248.0

514.0

1027.0

Latency (ns) - SV GX Gen 2x8

996.0

1012.0

1020.0

1044.0

1084.0

1156.0

1300.0

3.4. Virtex UltraScale + targets

3.4.1. Gen3 x8

_images/latency_fpb2_gen3_rxtx.svg

Figure 3.5 Enyx HFP RTT Latency diagram for Virtex UltraScale + targets with PCIe Gen3 x8

Table 3.4 Enyx HFP RTT Latency table for Virtex UltraScale + target with PCIe Gen3 x8

Size (Bytes)

1.0

20.0

58.0

134.0

248.0

514.0

1027.0

Latency (ns) - VUSP Gen 3x8

728.0

736.0

756.0

768.0

792.0

820.0

892.0

3.4.2. Gen2 x8

_images/latency_fpb2_gen2_rxtx.svg

Figure 3.6 Enyx HFP RTT Latency diagram for Virtex UltraScale + targets with PCIe Gen2 x8

Table 3.5 Enyx HFP RTT Latency table for Virtex UltraScale + target with PCIe Gen2 x8

Size (Bytes)

1.0

20.0

58.0

134.0

248.0

514.0

1027.0

Latency (ns) - VUSP Gen 2x8

872.0

892.0

916.0

940.0

972.0

1044.0

1188.0

4. Bandwidth Testing

In order to measure the bandwith of the Enyx HFP DMA engine, the following test is performed:

  • One million packets are sent out from a software executable to the Enyx HFP DMA engine for each packet sizes from 1 to 2015 Bytes with a step of 19 Bytes (in order to cover a maximum of different alignment cases).

  • The packets are sent out at the maximum data rate possible.

  • The hardware logic performs a loopback and sends back the packets to the software.

  • The software will then eventually receive all one million packets.

  • Once the last packet is received, the bandwidth can simply be computed by dividing the amount of data transfered by the amount of time spent for the entire transfer.

_images/HFP_perfs_bandwidth.svg

Figure 4.1 HFP Bandwidth Testing Scenario

4.1. Bandwidth Summary

_images/bandwidth_summary.svg

Figure 4.2 Enyx HFP Bandwidth Summary diagram

Table 4.1 Enyx HFP Bandwidth Summary table

Size (Bytes)

1.0

20.0

58.0

134.0

248.0

514.0

1027.0

Bandwidth (Gbps) - VUSP Gen 3x8

0.07

1.41

3.77

8.56

14.92

25.47

27.5

Bandwidth (Gbps) - VUSP Gen 2x8

0.07

1.41

3.78

8.58

14.86

17.21

18.3

Bandwidth (Gbps) - A10 GX Gen 3x8

0.07

1.36

3.65

8.27

14.43

25.55

27.5

Bandwidth (Gbps) - SV GX Gen 2x8

0.07

1.36

3.64

8.28

14.44

17.48

18.57

4.2. Arria 10 GX Targets

4.2.1. Gen3 x8

_images/bandwidth_rxa10lp_gen3.svg

Figure 4.3 Enyx HFP Bandwidth diagram for Arria 10 GX target with PCIe Gen3 x8

Table 4.2 Enyx HFP Bandwidth table for Arria 10 GX target with PCIe Gen3 x8

Size (Bytes)

1.0

20.0

58.0

134.0

248.0

514.0

1027.0

Bandwidth (Gbps) - A10 GX Gen 3x8

0.07

1.36

3.65

8.27

14.43

25.55

27.5

4.3. Stratix V GX targets

4.3.1. Gen2 x8

_images/bandwidth_fpb1_gen2.svg

Figure 4.4 Enyx HFP Bandwidth diagram for Stratix V GX targets with PCIe Gen2 x8

Table 4.3 Enyx HFP Bandwidth table for Stratix V GX target with PCIe Gen2 x8

Size (Bytes)

1.0

20.0

58.0

134.0

248.0

514.0

1027.0

Bandwidth (Gbps) - SV GX Gen 2x8

0.07

1.36

3.64

8.28

14.44

17.48

18.57

4.4. Virtex UltraScale + targets

4.4.1. Gen3 x8

_images/bandwidth_fpb2_gen3.svg

Figure 4.5 Enyx HFP Bandwidth diagram for Virtex UltraScale + targets with PCIe Gen3 x8

Table 4.4 Enyx HFP Bandwidth table for Virtex UltraScale + target with PCIe Gen3 x8

Size (Bytes)

1.0

20.0

58.0

134.0

248.0

514.0

1027.0

Bandwidth (Gbps) - VUSP Gen 3x8

0.07

1.41

3.77

8.56

14.92

25.47

27.5

4.4.2. Gen2 x8

_images/bandwidth_fpb2_gen2.svg

Figure 4.6 Enyx HFP Bandwidth diagram for Virtex UltraScale + targets with PCIe Gen2 x8

Table 4.5 Enyx HFP Bandwidth table for Virtex UltraScale + target with PCIe Gen2 x8

Size (Bytes)

1.0

20.0

58.0

134.0

248.0

514.0

1027.0

Bandwidth (Gbps) - VUSP Gen 2x8

0.07

1.41

3.78

8.58

14.86

17.21

18.3

5. Constraints applied to FPGA compiler tools for Resources and Working frequencies estimations

Hereafter are the constraints that are provided to the default FPGA compiler tools for the Resources and Working frequencies estimations.

5.1. Stratix V targets

regexp {[\.0-9]+} $quartus(version) quartus_version
regexp {Full|Standard|Pro} $quartus(version) quartus_edition
set quartus_version_major [lindex [regexp -all -inline {[0-9]+} $quartus_version] 0]
set quartus_version_minor [lindex [regexp -all -inline {[0-9]+} $quartus_version] 1]


set_global_assignment -name FLOW_ENABLE_IO_ASSIGNMENT_ANALYSIS ON
set_global_assignment -name OPTIMIZATION_TECHNIQUE SPEED
set_global_assignment -name SYNTH_TIMING_DRIVEN_SYNTHESIS ON
set_global_assignment -name OPTIMIZE_HOLD_TIMING "ALL PATHS"
set_global_assignment -name FITTER_EFFORT "STANDARD FIT"
set_global_assignment -name ALLOW_POWER_UP_DONT_CARE OFF
set_global_assignment -name SYNTH_PROTECT_SDC_CONSTRAINT ON

if {$quartus_version_major >= 15} {
    set_global_assignment -name OPTIMIZATION_MODE "HIGH PERFORMANCE EFFORT"
    set_global_assignment -name PROGRAMMABLE_POWER_TECHNOLOGY_SETTING "FORCE ALL USED TILES TO HIGH SPEED"
    set_global_assignment -name PERIPHERY_TO_CORE_PLACEMENT_AND_ROUTING_OPTIMIZATION AUTO
}
if {($quartus_version_major == 16 && $quartus_version_minor == 0) || ($quartus_version_major < 16)} {
    set_global_assignment -name PHYSICAL_SYNTHESIS_REGISTER_DUPLICATION ON
    set_global_assignment -name PHYSICAL_SYNTHESIS_COMBO_LOGIC ON
    set_global_assignment -name PHYSICAL_SYNTHESIS_REGISTER_RETIMING ON
    set_global_assignment -name PHYSICAL_SYNTHESIS_ASYNCHRONOUS_SIGNAL_PIPELINING ON
    set_global_assignment -name PHYSICAL_SYNTHESIS_COMBO_LOGIC_FOR_AREA ON
    set_global_assignment -name PHYSICAL_SYNTHESIS_MAP_LOGIC_TO_MEMORY_FOR_AREA ON
    set_global_assignment -name PHYSICAL_SYNTHESIS_EFFORT EXTRA
}

5.2. Arria 10 targets

regexp {[\.0-9]+} $quartus(version) quartus_version
regexp {Full|Standard|Pro} $quartus(version) quartus_edition
set quartus_version_major [lindex [regexp -all -inline {[0-9]+} $quartus_version] 0]
set quartus_version_minor [lindex [regexp -all -inline {[0-9]+} $quartus_version] 1]

set_global_assignment -name FLOW_ENABLE_IO_ASSIGNMENT_ANALYSIS ON
set_global_assignment -name OPTIMIZATION_TECHNIQUE SPEED
set_global_assignment -name SYNTH_TIMING_DRIVEN_SYNTHESIS ON
set_global_assignment -name OPTIMIZE_HOLD_TIMING "ALL PATHS"
set_global_assignment -name FITTER_EFFORT "STANDARD FIT"
set_global_assignment -name ALLOW_POWER_UP_DONT_CARE ON
set_global_assignment -name SYNTH_PROTECT_SDC_CONSTRAINT ON
set_global_assignment -name PROGRAMMABLE_POWER_TECHNOLOGY_SETTING "FORCE ALL USED TILES TO HIGH SPEED"
set_global_assignment -name PERIPHERY_TO_CORE_PLACEMENT_AND_ROUTING_OPTIMIZATION AUTO
set_global_assignment -name AUTO_GLOBAL_REGISTER_CONTROLS OFF
set_global_assignment -name OPTIMIZE_POWER_DURING_SYNTHESIS OFF
set_global_assignment -name OPTIMIZE_POWER_DURING_FITTING OFF
set_global_assignment -name ALLOW_REGISTER_MERGING ON
set_global_assignment -name ALLOW_REGISTER_RETIMING ON
set_global_assignment -name ALM_REGISTER_PACKING_EFFORT LOW
set_global_assignment -name ROUTER_TIMING_OPTIMIZATION_LEVEL MAXIMUM
set_global_assignment -name ECO_OPTIMIZE_TIMING ON
set_global_assignment -name AUTO_DELAY_CHAINS ON
set_global_assignment -name AUTO_GLOBAL_CLOCK ON

if {$quartus_version_major >= 19} {
    set_global_assignment -name OPTIMIZATION_MODE "HIGH PERFORMANCE EFFORT WITH MAXIMUM PLACEMENT EFFORT"
    set_global_assignment -name GLOBAL_PLACEMENT_EFFORT "MAXIMUM EFFORT"
} elseif {$quartus_version_major >= 16} {
        set_global_assignment -name OPTIMIZATION_MODE "AGGRESSIVE PERFORMANCE"
        set_global_assignment -name PHYSICAL_SYNTHESIS_REGISTER_DUPLICATION ON
        set_global_assignment -name PHYSICAL_SYNTHESIS_COMBO_LOGIC ON
        set_global_assignment -name PHYSICAL_SYNTHESIS_REGISTER_RETIMING ON
        set_global_assignment -name PHYSICAL_SYNTHESIS_ASYNCHRONOUS_SIGNAL_PIPELINING ON
        set_global_assignment -name PHYSICAL_SYNTHESIS_COMBO_LOGIC_FOR_AREA OFF
        set_global_assignment -name PHYSICAL_SYNTHESIS_MAP_LOGIC_TO_MEMORY_FOR_AREA OFF
        set_global_assignment -name PHYSICAL_SYNTHESIS_EFFORT EXTRA
}

5.3. Virtex UltraScale+ (-2 speed grade) targets

################ FPGA ####################
set_property part xcvu9p-flgb2104-2-e [current_project]


regex {Vivado v(\d+)\.(\d).*SW Build (\d+).*IP Build (\d+)} [version] matched major minor sw_build ip_build
if {$major < 2020} {set_property STEPS.SYNTH_DESIGN.ARGS.FANOUT_LIMIT 400 [get_runs synth_*]}

set_property STEPS.SYNTH_DESIGN.ARGS.FSM_EXTRACTION one_hot [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.KEEP_EQUIVALENT_REGISTERS true [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.RESOURCE_SHARING off [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.NO_LC true [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.SHREG_MIN_SIZE 5 [get_runs synth_*]

set_property strategy Performance_BalanceSLLs [get_runs impl_*]

5.4. Virtex UltraScale+ (-3 speed grade) targets

################ FPGA ####################
set_property part xcvu9p-flgb2104-3-e [current_project]

regex {Vivado v(\d+)\.(\d).*SW Build (\d+).*IP Build (\d+)} [version] matched major minor sw_build ip_build
if {$major < 2020} {set_property STEPS.SYNTH_DESIGN.ARGS.FANOUT_LIMIT 400 [get_runs synth_*]}

set_property STEPS.SYNTH_DESIGN.ARGS.FSM_EXTRACTION one_hot [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.KEEP_EQUIVALENT_REGISTERS true [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.RESOURCE_SHARING off [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.NO_LC true [get_runs synth_*]
set_property STEPS.SYNTH_DESIGN.ARGS.SHREG_MIN_SIZE 5 [get_runs synth_*]