Try out the following procedure using the template coprocessor provided in Lab 1. Please try it using the original Lab 1 template first.
Once you are done with the above and gotten it to work, please attempt the assignment below.
The assignment essentially involves combining Lab 1 and Lab 2, such that the data can be streamed from RealTerm to the C code running on ARM Cortex A9, the processing done in hardware using the coprocessor, and the results sent back to RealTerm. You can optionally do the processing in software (C) too and compare the results, but the results have to be sent back to the console nevertheless.
Please follow the following procedure.
- Change your Lab 1 HDL code to accommodate the bigger matrix sizes for A, B, and RES (64x8, 8x1, 64x1 respectively). The A.csv file has some numbers which are 256, which was a mistake. Please open it in a text editor and replace all 256s by 255s.
- Test it thoroughly using a testbench. Note that the first version of the testbench provided in Lab 1 had a bug that M_AXIS_TVALID wasn't checked when M_AXIS_TLAST was asserted. Make sure your Lab 1 design is able to give M_AXIS_TLAST correctly, which can be tested using the second version of the testbench. You will have to modify the .mem files and the testbench to deal with the bigger matrix. There were also some other cases that were not tested by the Lab 1 testbench, such as the non-continuous assertion of S_AXIS_TVALID and M_AXIS_TREADY. Hopefully, these should be fine in your design.
- Now, integrate this coprocessor using the same procedure you had followed for the original Lab 1 template coprocessor. You might want to have a look at the Modifying the Coprocessor page to see how to re-package your modified coprocessor. You will have to modify the test_fifo_myip_v1_0.c file as appropriate. You will need to either increase the Transmit FIFO Depth of AXI Stream FIFO in Vivado (double click on the peripheral to change it) or send it over two transmit operations in C code to send more than 512 words. The first approach will take more hardware, but takes less time to send the full data over.
- Initially, hard code the test cases.
- Later, you can modify it to deal with the data streamed from RealTerm.
- An even better way to wait until the coprocessor responds would be to use interrupts. This requires the AXI FIFO interrupt output to be connected appropriately, with appropriate changes to the program.
Assignment 3 (7 marks)
Demonstrate in the first hour of week 4.
Upload an archive containing either (myip_1.0 folder or the .xsa file), C/H, .v/vhd RTL and testbench, input/output test files (only those files you have created/modified, not the entire project folder) used for the demo to the LumiNUS by 11:59 PM, 3rd June 2020.
It should be as a .zip archive, with the filename 01_<group_no>_3.zip for Tuesday groups and 02_<group_no>_3.zip for Wednesday groups.
Please DO NOT upload the whole project!
- Debugging in Lab 3 is hard. It is your hardware, interacting with your software. It is hard to identify and isolate issues.
- You can check the software functionality by setting breakpoints just before sending data to the coprocessor / just after receiving the data from the coprocessor.
- If the software appears to be ok, then it is likely your hardware. This can be
- A missing connection in the block diagram - 'Validate Design' running ok does not guarantee everything is ok. It only checks for some very essential stuff, which might not be good enough for proper functionality.
- The IP not being updated. Changing the HDL code alone is not sufficient. You need to repackage the IP (see the last part of Packaging the Coprocessor as an IP to know how to do this). Then try regenerating sources. Worst come to the worst, package the IP afresh.
- A functional issue with your co-processor. This typically involves not asserting M_AXIS_TLAST and M_AXIS_TVALID correctly, and not dealing with M_AXIS_TREADY properly. Some possible reasons:
- If the hardware doesn't respond with the data, maybe you should check if M_AXIS_TLAST is asserted at all.
- If it doesn't respond with the correct amount of data, it could be M_AXIS_TLAST getting asserted at the wrong time.
- If it responds with incorrect data, check the correspondence between M_AXIS_TVALID and M_AXIS_TDATA.
- You will likely get a critical warning : "[Timing 38-282] The design failed to meet the timing requirements. Please see the timing summary report for details on the timing violations.".
- This is not a warning that you should normally ignore. It comes up because doing a 32-bit multiplication in 1 clock cycle (10 ns) is not easy. The timing analysis tool is complaining that the design may not work at 100 MHz.
- However, timing analysis tools are typically conservative, and a design that doesn't meet the timing might still work on hardware. This is why overclocking GPUs and CPUs is possible. However, stability and correctness of results is not guaranteed and could vary from chip to chip (so-called 'silicon lottery' due to fabrication process variations) and also on the temperature of the chip (delay is temperature dependent).
- It is very unlikely to cause an issue in your case (assuming your hardware design, i.e., modifications to the co-processor isn't done too badly). When I attempted it, the critical path delay exceeds the period by just 0.231 ns even for a full 32-bit by 32-bit multiplication (an 8-bit by 8-bit multiplication will take even less), which isn't too bad and likely well within the tolerance built into the calculations done by the timing tool.
- We will see more about timing analysis in the Topic on 'Timing' later in the semester.
- There are 3 ways to fix the problem above. However, we need to be mindful of the implications of the solutions below. Options 1 and 3 will allow you to use the built-in combinational multiplier.
- Use a slower clock than 100 MHz for the module which performs multiplication. This can be done using clocking wizard IP (preferred) / using PS to generate a slower clock (but will need interface modifications to the co-processor to bring in the extra clock).
- Using a clock divider to generate a slower clock internally. This is generally not recommended. You need to be careful about asserting the AXIS bus signals at the correct time and for the correct duration - for example, S_AXIS_TREADY being on for double the time duration can data to be missed in reception, M_AXIS_TVALID being asserted for twice the time duration can cause the slave to capture the same data two times.
- Use a sequential multiplier yourself, which performs shift and add over 32 clock cycles (32 can be brought down to a smaller number by throwing in more hardware) to perform multiplication of two 32-bit numbers.
- Allowing multiplication to take 2 cycles (or more if need be) combinationally. This will require creating your design (writing HDL code) such that the results of multiplication are captured only once every 2 cycles (by enabling the capture flipflop only once every 2 cycles using some sort of counting mechanism (a 1-bit counter will suffice for 2-cycle operations). You will also need to ensure that the inputs to the multiplier don't change during these 2 cycles. Further, for informing the synthesis tool that it is ok to take 2 cycles (20 ns for 100 MHz clock), you will need to apply a set_multicycle_path (more on this in the chapter on timing) in the .xdc file for the path involving the multiplier. This will help in 1) suppressing the warning by allowing the tool to account for this fact in the timing analysis 2) avoiding the excessive logic and / or routing resources when synthesis / place & route tool tries to fit it into 10 ns.