Spaces:
Sleeping
Sleeping
File size: 16,097 Bytes
5aefcf4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
Ticket Name: TDA4VMXEVM: Performance of TIOVX kernels Query Text: Part Number: TDA4VMXEVM Hello all, SUMMARY: I've been benchmarking the J721EX board running the SDK auto Linux and OpenVX for C66 DSPs (TIOVX library). In summary, I've been observing at least one order of magnitude lower performance than I'd expect from datasheets and manuals. TI MEASURED PERFORMANCE ANALYSIS: Please consider the table at http://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/tiovx/docs/user_guide/TIOVX_PERFORMANCE_J7ES_LINUX.html The entry #80 for the Multiply OpenVX kernel states: Index Kernel Variant Frame Size (Pixels) Graph Performance (msec) Node Performance (msec) 80 Multiply S16 x S16 = S16 640x480 (307200) 4.057000 3.978000 This means that 307200 int16_t elements are multiplied by 307200 int16_t elements, one by one, and produces 307200 int16_t elements. Then, I disassembled the released C66 firmware loaded by Linux: ~/psdk_rtos_auto_j7_06_01_01_12/c6000_7.4.24/bin/dis6x --all ~/psdk_rtos_auto_j7_06_01_01_12/vision_apps/out/J7/C66/SYSBIOS/release/vx_app_tirtos_linux_c6x_1.out > /tmp/c66.lst And counted the number of cycles spent in the inner loop of VXLIB_multiply_i16s_i16s_o16s_core: ad1e86d8 $C$L32: ad1e86d8 0d66 SPLOOP 3 ad1e86da 5947 || MV.L2X A18,B18 ad1e86dc ec081000 .fphead n, h, W, BU, nobr, nosat, 1100000b ad1e86e0 $C$L33: ad1e86e0 2ce7 SPMASK L1,L2 ad1e86e2 1581 ||^ ADD.L2X A19,8,B16 ad1e86e4 044c5765 || LDDW.D1T1 *A19++[2],A9:A8 ad1e86e8 024857e7 || LDDW.D2T2 *B18++[2],B5:B4 ad1e86ec 09490058 ||^ ADD.L1 8,A18,A18 ad1e86f0 02485764 LDDW.D1T1 *A18++[2],A5:A4 ad1e86f4 044057e6 LDDW.D2T2 *B16++[2],B9:B8 ad1e86f8 00004000 NOP 3 ad1e86fc e0280003 .fphead n, h, W, BU, nobr, nosat, 0000001b ad1e8700 12209032 DMPY2.M2X B5:B4,A9:A8,B7:B6:B5:B4 ad1e8704 12209030 DMPY2.M1X A5:A4,B9:B8,A7:A6:A5:A4 ad1e8708 00002000 NOP 2 ad1e870c 0310a01b PACK2.L2 B5,B4,B6 ad1e8710 0398eff2 || PACK2.S2 B7,B6,B7 ad1e8714 1440d033 DMPY2.M2X B7:B6,A17:A16,B11:B10:B9:B8 ad1e8718 0310a019 || PACK2.L1 A5,A4,A6 ad1e871c 0398eff0 || PACK2.S1 A7,A6,A7 ad1e8720 1440c030 DMPY2.M1 A7:A6,A17:A16,A11:A10:A9:A8 ad1e8724 ac66 SPMASK D2 ad1e8726 39d7 ||^ MV.D2X A3,B17 ad1e8728 00430001 SPMASK D1 ad1e872c 018d0940 ||^ ADD.D1 A3,0x8,A3 ad1e8730 0a21201b PACK2.L2 B9,B8,B20 ad1e8734 0aa96ff2 || PACK2.S2 B11,B10,B21 ad1e8738 0a4457c7 STDW.D2T2 B21:B20,*B17++[2] ad1e873c e0400004 .fphead n, l, W, BU, nobr, nosat, 0000010b ad1e8740 0a212019 || PACK2.L1 A9,A8,A20 ad1e8744 0aa96ff0 || PACK2.S1 A11,A10,A21 ad1e8748 0c034001 SPKERNEL 3,0 ad1e874c 0a0c5744 || STDW.D1T1 A21:A20,*A3++[2] ad1e8750 $C$L34: Between SPLOOP and SPKERNEL, there are 20 cycles, and each DMPY2 processes 4 elements, resulting in 8 multiplications per loop iteration (the 3rd and 4th DMPY2 are applying a constant scaling factor). Therefore, as a 1st order of approximation, each element takes 20/8 = 2.5 cycles to be processed. Compared to http://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/vxlib_c66x_1_1_4_0/docs/VXLIB_c66x_TestReport.html, my estimation is 5 times bigger than the best measured result of 0.5 cycles/element. I could not find a better inner loop in the disassembled firmware. In total, the 307200 multiplications should take 2.5*307200 = 768000 cycles. We can finally estimate the DSP clock frequency as (768000 cycles) / (3.978 ms) = 193061840 cycles/s = 193 MHz. The TDA4VM datasheet states that the C66 DSP can be clocked up to 1.35 GHz, so this means that the evaluation kit DSP is largely underclocked by a factor of 7. MY FP IMPLEMENTATION: I've implemented and benchmarked FP kernels (with SIMD FP multiply QMPYSP) myself and observed a consistent underclocking factor of 10. In other words, the performance I'd expect is consistently 10x lower, even when processing multiplications of 5120 x 3840 image planes = 79 MB. Some hypotheses I can imagine are: 1) DSP is underclocked 2) L2 HW prefetcher in C66 not initialized 3) DDR bus not at full speed 1866 MHz Could someone please help here? Thanks, Fernando A. Endo Responses: Hello again, Just a mistake in my computations: The 20 cycles refers to the dynamic length of the loop. The iteration interval (one stage length) is 3. Because there is no data dependency between the 2 first DMPY2, they can fit in one stage. The same conclusion is valid for the 2 last DMPY2. So, the cycles/element is actually (3 cycles) / ((2 DMPY2) * (4 elements/DMPY2)) = 3/8 = 0.375 cycle/element. Following the same logic in the previous message, the DSP frequency should be 29 MHz only, almost 50x slower than the peak frequency! Regards, Fernando Fernando, The loop you mentioned with 3/8 (0.375) cycles/element is the fastest inner loop at line 163 (labeled case 1B in the source code comments). This is for aligned pointers and overflow_policy == VXLIB_CONVERT_POLICY_WRAP. This closely matches the curve fit equation of the performance results from the Mode 1 from the test report you mentioned: http://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/vxlib_c66x_1_1_4_0/docs/VXLIB_c66x_TestReport.html Mode 1: scale is integer; width == stride; WRAP Test vectors run: 3 Formula: Cycles: 0.36895*N + 142 Where: N = width * height Please be aware, that this test report is trying to communicate the absolute best baseline that can be achieved from the DSP core code, as it is run on a simulator which assumes that all code and data is in L1 memory (no memory hierarchy, therefore no cache stalls). At the VXLIB level, we wanted to show this baseline so you can see what the core loops can achieve relative to each other and not considering memory hierarchy latencies. The actual TDA4x board performance from OpenVX is shared in the other table you mentioned: http://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/tiovx/docs/user_guide/TIOVX_PERFORMANCE_J7ES_LINUX.html As of now, these VXLIB kernels are operating from DDR via L1 and L2 caches. DMA optimizations (for example using BAM framework) bring on average about 2x speed improvement, and are available on TDA2/3, but there is a pending DMA library update needed for TDA4 before we can enable the BAM DMA optimizations. So for now, these numbers are reflecting the cache-only mode for these kernels (as indicated at the top of the test report). So the discrepancy you are seeing is primarily due to the stalls due to cache misses through the L1/L2 and to DDR. DSP Speed = 1.35 GHz Pixels = 307200 Time = 3.978 ms Effective Cycles Per Pixel with cache stall/memory latencies = 0.003978 s * 1.35 GHz / 307200 = 17.48 cycles / element Many of these kernels in VXLIB are highly optimized at the DSP, but are very simple such that they are I/O bound (Add/absdiff/mult, etc). Simply reading images from DDR through the cache and doing a pixelwise multiply before writing back to DDR heavily under utilizes the DSP since most of the time the DSP will be stalled on the cache misses. The memory system through the caches simply can not feed this kernel fast enough to get no stalls to the DSP that is running at full speed. However, you shouldn't think of this as a blanket (17.48/0.375) 47x degradation due to memory because the effect is not linear. Assuming you keep the interface the same such that the kernel reads and writes the same amount of data, but the compute did much more than a multiply and it took 18 cycles per pixel, then the actual result with memory system would still be 18-19 cycles per element (estimate) because the compute and the IO latencies can largely be done in parallel are are much more balanced. This highlights how one should optimize loops and algorithms running on DSP to get maximum performance. If your loop takes a relatively high number of cycles (like > 17 in this case), it may be worth spending time to optimize the loop to bring the cycle time down. If optimized is still > 17 cycles, then using DMA to bring data into L2SRAM from DDR will not improve performance since the compute is the bottleneck. However, if the optimized loop is significantly less cycles than this, then this means that the memory I/O is the bottleneck and using DMA to ping/pong transferring of blocks of data into L2SRAM in parallel to compute, can bring about further improvements since we are reducing the latencies of the memory hierarchy. This was the purpose of using BAM and we hope to enable that framework on TDA4 in the coming year. Other optimizations to consider are, if you are cascading several kernels (loops) on the DSP for the whole image, one might consider tiling and putting intermediate results in L2SRAM so that you only have the latencies of L1 cache and not the more expensive L2 cache to DDR. If in the end you meet real time and all of your loops are still I/O bound, you can save power by reducing the clock speed of the DSP until the compute time is more balanced with the I/O time. Please let me know if this makes sense and if you have any follow up questions. Regards, Jesse Hello Jesse, Thanks for your detailed explanation. I still have some follow up questions: Jesse Villarreal said: As of now, these VXLIB kernels are operating from DDR via L1 and L2 caches. DMA optimizations (for example using BAM framework) bring on average about 2x speed improvement, and are available on TDA2/3, but there is a pending DMA library update needed for TDA4 before we can enable the BAM DMA optimizations. So for now, these numbers are reflecting the cache-only mode for these kernels (as indicated at the top of the test report). So, basically, with BAM working on TDA4, we would get around 2x speedup over the a non-BAM sequence of kernels. Then, instead of the 47x degradation, we would get 23x of slowdown compared to a full L1 hit ratio. I'm not yet convinced that 23x slowdown is a good result on average. Could you please give us a full example of the BAM results? If possible the best result, with a long pipeline of kernels. Jesse Villarreal said: So the discrepancy you are seeing is primarily due to the stalls due to cache misses through the L1/L2 and to DDR. Your conclusion seem fair, but only in the case that no hardware prefetchers are present in the cache hierarchy. Basically, without prefetching, every cache line miss (64 bytes for L1 and 128 for L2) will have to pay the DDR latency, which should be around hundreds of CPU cycles. However, the C66 has a L2 hardware prefetcher. In this case, if its has been properly designed, in a stream processing kernel, the DDR latency should be paid only a few times until the prefetcher warms up and detects the 2 sources and 1 destination streams. So, my next questions are: Is the L2 hardware prefetcher enabled by default in the TDA4 RTOS SDK? According to the C66x CorePac User's Guide (Rev. C), the prefetcher type seems to be a stride prefetcher. Is it possible to set up the prefetcher parameters? For instance, set the number of prefetch requests to DDR once a stream has been detected. As far as I understood, the L1 program cache is permanently disabled in the current TDA4 revision. Will the L1 program cache be enabled in future silicon revisions? What's the performance loss by not having it enabled? Do you have an estimate of when the BAM-plugin will be available for the TDA4? Thanks for your help, Fernando Fernando, Here are some answers to your questions: Fernando Endo said: Is the L2 hardware prefetcher enabled by default in the TDA4 RTOS SDK? As far as I know, there is no way to turn it off in SW, so yes it is enabled. Fernando Endo said: According to the C66x CorePac User's Guide (Rev. C), the prefetcher type seems to be a stride prefetcher. Is it possible to set up the prefetcher parameters? For instance, set the number of prefetch requests to DDR once a stream has been detected. No this is not configurable. Fernando Endo said: As far as I understood, the L1 program cache is permanently disabled in the current TDA4 revision. Will the L1 program cache be enabled in future silicon revisions? What's the performance loss by not having it enabled? What documentation or discussion led to this conclusion? The L1 program cache is not disabled in C66x on TDA4. Fernando Endo said: Do you have an estimate of when the BAM-plugin will be available for the TDA4? Current estimate given our priorities is end of year 2020. Please let me know if you will need this earlier or later based on your schedule and we may be able to adjust the priority. Our understanding is that since this is a performance optimization feature, it is typically needed after initial development but before production/optimization phases. Additional comments— the number of lines that will be prefetched by the prefetcher does not cover the entire latency trip to DDR memory. In addition, the L2 controller itself can only see a few cache line misses at a time. So the functioning of the prefetcher should not be expected to reduce the DDR latency penalty to zero after time. In the SDK, the L2 memory is configured for 64Kb cache, and the rest is set to addressable RAM. We did this in anticipation of people using the L2 RAM as a scratchpad for DMA (either custom DMA or using BAM in the future, for example). In your experiments, if you are not using L2RAM, then you can configure full L2 memory as cache to get better cache performance. Regards, Jesse Hello Jesse, Thanks again for your detailed answers. Here is some discussion and details requested: Jesse Villarreal said: What documentation or discussion led to this conclusion? The L1 program cache is not disabled in C66x on TDA4. There is a note in the "SPRUIL1A – May 2019 – Revised November 2019", AM752x/DRA829/TDA4xM Technical Reference Manual, section 6.4.1.1 C66SS Features: "NOTE: The C66x L1P memory is disabled (not supported) in this device." Jesse Villarreal said: the number of lines that will be prefetched by the prefetcher does not cover the entire latency trip to DDR memory. In addition, the L2 controller itself can only see a few cache line misses at a time. So the functioning of the prefetcher should not be expected to reduce the DDR latency penalty to zero after time. Yes, I agree, that's why I asked if it is possible to change the prefetcher configuration, especially the number of requests and/or prefetching distance (i.e., prefetch more cache lines in advance and/or prefetch one cache line at a time that is foreseen to be accessed after a configurable time in the future). By tunning these parameters, per kernel, we can satisfactorily hide the DDR latency. Jesse Villarreal said: In the SDK, the L2 memory is configured for 64Kb cache, and the rest is set to addressable RAM. We did this in anticipation of people using the L2 RAM as a scratchpad for DMA (either custom DMA or using BAM in the future, for example). In your experiments, if you are not using L2RAM, then you can configure full L2 memory as cache to get better cache performance. Good to know, thanks! Kind regards, Fernando A. Endo Fernando, Fernando Endo said: "NOTE: The C66x L1P memory is disabled (not supported) in this device." This must be referring to the addressable RAM option for L1P. In some devices, the L1 and L2 memories can be configured to be cache, or addressable RAM, or a combination of it. In the case of TDA4, the 32kB L1P is fixed to be full cache and can not be configured as addressable RAM. I will file a ticket to see if this note can be clarified to avoid confusion. Thanks, Jesse |