Page tree
Skip to end of metadata
Go to start of metadata

Try out the following procedure using the template coprocessor provided in Lab 1. Please try it using the original Lab 1 template first. 

Once you are done with the above and gotten it to work, please attempt the assignment below.

Assignment 3

The assignment essentially involves combining Lab 1 and Lab 2, such that the data can be streamed from RealTerm to the C code running on ARM Cortex A9, the processing done in hardware using the coprocessor, and the results sent back to RealTerm. You can optionally do the processing in software (C) too and compare the results, but the results have to be sent back to the console nevertheless.

Please follow the following procedure.

  • Change your Lab 1 HDL code to accommodate the bigger matrix sizes for A, B, and RES (64x8, 8x1, 64x1 respectively). The A.csv file has some numbers which are 256, which was a mistake. Please open it in a text editor and replace all 256s by 255s.
  • Test it thoroughly using a testbench. Note that the first version of the testbench provided in Lab 1 had a bug that M_AXIS_TVALID wasn't checked when M_AXIS_TLAST was asserted. Make sure your Lab 1 design is able to give M_AXIS_TLAST correctly, which can be tested using the second version of the testbench.  You will have to modify the .mem files and the testbench to deal with the bigger matrix. There were also some other cases that were not tested by the Lab 1 testbench, such as the non-continuous assertion of S_AXIS_TVALID and M_AXIS_TREADY. Hopefully, these should be fine in your design.
  • Now, integrate this coprocessor using the same procedure you had followed for the original Lab 1 template coprocessor.  You might want to have a look at the Modifying the Coprocessor page to see how to re-package your modified coprocessor. You will have to modify the test_fifo_myip_v1_0.c file as appropriate. You will need to either increase the Transmit FIFO Depth of AXI Stream FIFO in Vivado (double click on the peripheral to change it) or send it over two transmit operations in C code to send more than 512 words. The first approach will take more hardware, but takes less time to send the full data over.
    • Initially, hard code the test cases.
    • Later, you can modify it to deal with the data streamed from RealTerm.
  • An even better way to wait until the coprocessor responds would be to use interrupts. This requires the AXI FIFO interrupt output to be connected appropriately, with appropriate changes to the program.

Submission Info

Assignment 3 (7 marks)

Demonstrate in the first hour of week 4.


Upload an archive containing either (myip_1.0 folder or the .xsa file), C/H, .v/vhd RTL and testbench, input/output test files (only those files you have created/modified, not the entire project folder) used for the demo to the LumiNUS by 11:59 PM, 3rd June 2020.

It should be as a .zip archive, with the filename  01_<group_no> for Tuesday groups and 02_<group_no> for Wednesday groups.

Please DO NOT upload the whole project!


  • Debugging in Lab 3 is hard. It is your hardware, interacting with your software. It is hard to identify and isolate issues.
    • You can check the software functionality by setting breakpoints just before sending data to the coprocessor / just after receiving the data from the coprocessor.
    • If the software appears to be ok, then it is likely your hardware. This can be
      • A missing connection in the block diagram -  'Validate Design' running ok does not guarantee everything is ok. It only checks for some very essential stuff, which might not be good enough for proper functionality.
      • The IP not being updated. Changing the HDL code alone is not sufficient. You need to repackage the IP (see the last part of Packaging the Coprocessor as an IP to know how to do this). Then try regenerating sources. Worst come to the worst, package the IP afresh.
      • A functional issue with your co-processor. This typically involves not asserting M_AXIS_TLAST and M_AXIS_TVALID correctly, and not dealing with M_AXIS_TREADY properly. Some possible reasons:
        • If the hardware doesn't respond with the data, maybe you should check if M_AXIS_TLAST is asserted at all. 
        • If it doesn't respond with the correct amount of data, it could be M_AXIS_TLAST getting asserted at the wrong time.
        • If it responds with incorrect data, check the correspondence between M_AXIS_TVALID and M_AXIS_TDATA.
  • You will likely get a critical warning : "[Timing 38-282] The design failed to meet the timing requirements. Please see the timing summary report for details on the timing violations.".
    • This is not a warning that you should normally ignore. It comes up because doing a 32-bit multiplication in 1 clock cycle (10 ns) is not easy. The timing analysis tool is complaining that the design may not work at 100 MHz.
    • However, timing analysis tools are typically conservative, and a design that doesn't meet the timing might still work on hardware. This is why overclocking GPUs and CPUs is possible. However, stability and correctness of results is not guaranteed and could vary from chip to chip (so-called 'silicon lottery' due to fabrication process variations) and also on the temperature of the chip (delay is temperature dependent).
    • It is very unlikely to cause an issue in your case (assuming your hardware design, i.e., modifications to the co-processor isn't done too badly). When I attempted it, the critical path delay exceeds the period by just 0.231 ns even for a full 32-bit by 32-bit multiplication (an 8-bit by 8-bit multiplication will take even less), which isn't too bad and likely well within the tolerance built into the calculations done by the timing tool.
    • We will see more about timing analysis in the Topic on 'Timing' later in the semester.
    • There are 3 ways to fix the problem above. However, we need to be mindful of the implications of the solutions below. Options 1 and 3 will allow you to use the built-in combinational multiplier.
      1. Use a slower clock than 100 MHz for the module which performs multiplication. This can be done using clocking wizard IP (preferred) / using PS to generate a slower clock (but will need interface modifications to the co-processor to bring in the extra clock).
      2. Using a clock divider to generate a slower clock internally. This is generally not recommended.  You need to be careful about asserting the AXIS bus signals at the correct time and for the correct duration - for example, S_AXIS_TREADY being on for double the time duration can data to be missed in reception, M_AXIS_TVALID being asserted for twice the time duration can cause the slave to capture the same data two times. 
      3. Use a sequential multiplier yourself, which performs shift and add over 32 clock cycles (32 can be brought down to a smaller number by throwing in more hardware) to perform multiplication of two 32-bit numbers.
      4. Allowing multiplication to take 2 cycles (or more if need be) combinationally. This will require creating your design (writing HDL code) such that the results of multiplication are captured only once every 2 cycles (by enabling the capture flipflop only once every 2 cycles using some sort of counting mechanism (a 1-bit counter will suffice for 2-cycle operations). You will also need to ensure that the inputs to the multiplier don't change during these 2 cycles. Further, for informing the synthesis tool that it is ok to take 2 cycles (20 ns for 100 MHz clock), you will need to apply a set_multicycle_path (more on this in the chapter on timing) in the .xdc file for the path involving the multiplier.  This will help in 1) suppressing the warning by allowing the tool to account for this fact in the timing analysis 2) avoiding the excessive logic and / or routing resources when synthesis / place & route tool tries to fit it into 10 ns.
  • No labels


  1. Hi,

    When i followed the first few steps and to modify the hdl wrapper code, it displayed about 37 warnings. Isit still ok or did i miss something out? i tried for both VHDL and Verilog and both had multiple warnings. Thank you.

    1. Can you take a screenshot of the warnings you received?

        1. Yes these warnings are safe to ignore, as not everything is connected.

  2. Hi,

    When i run the sample code given to us (lab3_coprocessor.c), it gets stuck in the while loop (Waiting for AXI DMA).. is there anything wrong with the original sample code or is it suppose to be like that.. i simply followed the steps in the lab manual.

    1. You are supposed to see the summation results in the console.

      If you suspect the AXI DMA gets stuck and does not send values back, try doing a simulation and see what's happening inside your coprocessor.

  3. 1) It doesn't matter. When you look at coprocessor.c, appropriate type definitions/conversions are done before and after AXI DMA transfer. Even if the original input is not 'u32', it gets converted to the format when it is stored in TX buffer.

    2) Up to you. Displaying it in different number representations requires nothing more than a small change in your function anyways.

  4. Hey!

    I am currently trying to get the co-processor to work. I added new peripherals to the first project, ie. next to the TFT peripherals. Now when I generate bitstream, expor thardware, launch SDK and import co_processor.c file, then I run into error that #define "xaxidma" library is missing. I checked the location where it should be and the library is missing from there as well. As much as I understand it has smth to do with the DMA peripheral (Direct Memory Access). Then I literally quadruple checked that all settings there were as shown in step by step guide. 

    Does someone has any idea what's wrong?

    Your comrade Janno

    1. An image of the error. 

      1. Chances are that you have not exported the latest hardware to SDK. Please check the system.mss (or the .hdf file) to see if the AXI DMA peripheral is present.

  5. Hi,

    For the files to be submitted,  can I simply submit the zip file which is generated by clicking 'archive project' from the file menu?  BTW, the custom ip folder is under the 'ipdefs' folder inside the zip file.

    1. If you want to do that, please remember to clean up project files before zipping it. If the file size is large, try avoiding this method, as for this particular lab we are more interested in your custom IP files, C/H files and test files related to your original design. 

  6. Hello! I think I have a problem changing my verilog code. When I want to modify it I change it, save it, repackage it, refresh all in repository manager in the main project and upgrade myip. Then I run synthesis and implementation again, generate bitstream, export and launch SDK. The code for myip in the main project seems to be changed, but once I run it I don't see any difference. The example code worked fine for me and I tried without modifying it at all to assign a random value instead of sum to M_AXIS_TDATA just to see if there is going to be any difference in the output, but there was no. Plus when I try to modify the code to send 2 values instead of 4 I get stuck after DMA to Device finishes and I suspect it is because myip is still not updated properly and waits for 2 more values. Can you notice something I do wrong?  

    1. Did you generate output products after repackaging and upgrading ip? If not, can you try that and see if it makes any difference?

      1. To generate output products I have the options Global/Out of context per IP/Out of context per block design. I generate using the default Out of context per IP and it doesn't make difference. I can try the others, too.

        1. Yea try global synthesis?

          1. Both others didn't work. P.S. I checked your post below - these settings were alright for me, so no difference from there too.

            1. I see. Another group reported the same problem, but we didn't encounter any of these ourselves, so still trying to figure out what could have possibly gone wrong. Meanwhile, if you need to work on your project, I'm afraid you will have to re-create a new project with your modified ip files every time you make changes:( 

              A tip to speed up your block design – save your tcl commands into a .tcl file and load it to your new project instead of redrawing everything (Remember we mentioned that in Lab1 briefing?). 

              Btw, did your groupmate encounter the same issue?

              1. We're moving the same project files on a thumb drive between both, but I'm not sure we have worked on his computer after I realized why my code is not changing no matter what. Thank you, I'll try to get prepared for the never ending project recreation...

                1. 💪🏻. You can try doing simulation (very helpful though not compulsory for this lab) for your custom ip if you need to debug – that reduces the time you need to recreate & reimplement as well😅

  7. FYI. 

    AR# 64110

    Vivado IP Flows - After repackaging a custom IP, the coreRevision is not updated in the component.xml


    When I repackage my IP in IP Packager, I expect the IP revision to be automatically updated. 

    However, I see that the coreRevison value does not get automatically updated in the component.xml.

    Because the revision has not changed, if I run "Report IP status" in a project which uses this IP from the repository, there is no notification that the IP core is out of date.

    Is this expected behavior and can it be avoided?


    This issue occurs when the "Close IP Packager Window" IP Packager option is enabled or when the "Add IP to the IP Catalog of the current project" IP Packager option  is disabled.

    To avoid this issue, ensure that the IP Packager settings for the project are set as follows:

    • "Close IP Packager Window" option is not selected.
    • "Add IP to the IP Catalog of the current project" is selected.


  8. Hello, 

    In the TFT controller part of the lab, where am I expected to see the output? I have added the lab3_vga.c as my source file in the application project. I'm not sure about what to do next. 

    1. Connect the VGA cable to the board and Monitor, configure the FPGA and run the application after you have created the hardware correctly, incorporating the TFT IP and its pin (location) constraints. You should be able to see colored bars on the screen (go through the C code to see how it does that). 

  9.  I'm unable to select the application path. There is no .elf file in my project. What could be the potential issue?

    1. .elf is the actual executable which runs on the ARM Cortex A9 processor. The .elf file should be generated as long as the application build in SDK is successful. You can use the search/browse to find the .elf. Normally, it should show up the .elf once you click search/browse. If it doesn't, you will have to browse to the directory containing the .elf file (which is easy to find, it need be using the search function in windows explorer).

      1. Could there be any reason why the application build in SDK is unsuccessful because my project does not contain any .elf file? I have tried browsing it ,it  generally should be within the sdk→debug folder in the project. I also tried relaunching my SDK multiple times.  But for some unknown reason I'm unable to generate the file. I can post a few more screenshots if that helps to understand the problem. 

        I even tried creating the project from scratch twice but in vain. 

        1. If the build is unsuccessful, you will be able to see the reasons in the SDK console when you build (just like how it is for any IDE). If it is successful, it will clearly show that xxxx/elf has been created. General reasons include syntax errors in C-code, not having the relevant dependency projects in the workspace, not including the appropriate header files etc.

          You could try cleaning and rebuilding the project.

          This is applicable not just for lab 3, but for any eclipse-based project/IDE. 

          1. When I try building my project, ' Nothing to build for project' is printed on the console. I have added my C file in the src folder, so I'm really not sure why it says that. Am I missing something?

            1. If things don't work, it is always better to start with a 'safe' option. Try creating and running the built-in "Hello World" application. If it works, simply copy paste the contents of your C file into the hello world C file, build and run.

  10. Hi,

    In the overall design layout, i notice there are 2 axi interconnect and only 1 dma. Since there could be multiple masters and slaves on a single axi interconnect block,why is it that 2 axi interconnects are needed instead of 1 ? And if there are 2 axi interconnect blocks, why is it that there only needs 1 dma block instead of 2 as well? 


    1. Hi Nicholas,

      The AXI interconnect on the left is used by the ARM processor to access various peripherals - ARM processor (in the PS) alone is the master - the slaves are the various devices such as GPIO, the slave interface of the DMA controller, slave interface of the TFT controller etc (in the PL). The processor can read/write the various registers in these AXI slaves to control them or get information from them (for example, to initiate DMA transfers, read switches connected to GPIO etc).

      The one on the right is used by the master interfaces on AXI DMA and TFT controller (in the PL) to access the DDR memory, as these peripherals access the memory directly - the slave interface is ultimately the memory controller (in the PS). 

      To summarize, in the first case, the master is in PS, slaves are in PL. In the second case, the masters are in PL, the slave is in PS. Hence the need for 2 interconnects.

      The number of DMA controllers is not related to the number of interconnects (except that the presence of DMA controller requires 2 interconnects as it has both master and slave interfaces). A DMA controller takes commands from the CPU (through S_AXI) to either read data from memory (through M_AXI_MM2S) and write it to the co-processor through M_AXIS_MM2S or receive data through S_AXIS_S2MM and write it to main memory through M_AXI_S2MM. The number of DMA controllers depends on the number of co-processors.

  11. Hi,

    What exactly does "width of buffer length register" refer to?


    1. That determines the maximum number of bytes that can be handled in one DMA operation (max number of bytes = 2^width-1). It isn't much of a concern with the low amount of data that we send /receive in lab 3, even the minimum width (8) is good enough.

  12. In Lab 1, the co-processor we developed only worked for up to final values of (127*127*4), which would definitely fit in 8 bits of data after dividing by 256.

    However, now that the matrix sizes have changed, the maximum value has now become (255*255*8), which requires 19 bits. Even after dividing by 256, the biggest result could (theoretically) be larger than 8 bits (though this doesn't happen with these particular matrix inputs since all the result values happen to fit in 8 bits due to it representing a probability from 0-1)

    Do we need to modify our co-processor to handle the theoretical maximum 11-bit output value?

    1. For now, we can just assume that no element of the result will exceed 8 bits.

      It is in fact common to use some extra bits for intermediate results to preserve precision. Many arithmetic circuits also have some form of saturation logic to ensure that the result is clamped to the maximum representable value in case the arithmetically correct result exceeds the range. 

      This is the reason why FPGAs with built-in multipliers tend to have sizes that look a bit odd, such as 18x18 or 25x18.

      Teh Nian Fei , in this context, it is also worth mentioning that some applications such as machine learning / neural networks are generally highly tolerant of the lack of precision. There has been a recent surge in interest in for this reason.

  13. Another issue: A.csv has some 256 values which are incompatible with the assumption that all values are 8-bit

    1. Oh oops, my bad again. Thanks for noticing. Please replace all 256s by 255s.

  14. Hi Prof, is a recording of Interactive Lecture 3 going to be uploaded?

    1. It is up, though it may take 2-3 hours for YouTube to finish processing it.

  15. Hi all,

    I am stuck in the loop of waiting for transmission in line 109. I suspect the reason is due to my co-processor not accepting data properly. Could the assertion of S_axis_ready within the coprocessor be a reason for this loop? Thanks.

    1. Yes, should be a case of coprocessor not accepting all the words written to the FIFO. Does it happen for the first vector itself or only for the second? If it happens for the first test vector itself, you are likely deasserting s_axis_ready too early (perhaps a wrong number of bits used for the counter counting the number of words received?). A simulation of the coprocessor with the correct input size should help identify the issue.

      1. You could also try commenting out the part where you wait indefinitely, and see whether the coprocessor responds. If it does, the response might give you clues as to which all data it has received. 

        1. My coprocessor is working fine. I decided to check how much of the input was fed into the coprocessor and is was only 509 elements, which caused my coprocessor continue waiting for the rest of the inputs. I decided to split the transmission into half and it works alright now. Why is this the case?

          1. The default buffer size is 512. You can increase the buffer size in your AXI Stream FIFO settings and send it in one go. The way you are doing it is perfectly fine as well!

            Thanks for noticing this. I am sure it was a good debugging experience (big grin). I have included it in the wiki for the benefit of others.

  16. Hi, I am met with the error: "Failing in receive complete ... "  after putting in the A+B.csv through realterm and receiving a bunch of zeros.

    This can mean that my hardware is not reacting to my inputs. In the simulation in vivado my code works fine, and all the flags seem to be correct.

    Is there any way to access M_AXIS_DATA/LAST/READY from vitis so that I can see the status of the interface....

    1. Since you're getting some zeros back, perhaps the hardware is giving you back something, which might give you clues regarding what could be wrong. Have you checked ReceiveLength?

      While it's indeed possible to check the various signals, it's not particularly easy to debug using that. If you wish to try, you can Google about "Integrated Logic Analyzer". This has to be inserted, connected and configured properly in Vivado (not Vitis). 

      1. Now there is an improvement. I redid my wrapper from scratch, at least the coprocessor is giving me some good outputs.


        I am getting some multiplications correct, some (about 6 values) are wrong as seen in the screenshot.

        However, in vivado test bench, all are correct and pass the test. 

        Is it possible that the clock speed plays a part here or is there other possible scenarios like the unknown FPGA delays for certain clocks at some point of time?

        1. Maybe you want to refer to my comment above about the 256 values in A.csv (or Rajesh's update in red in the assignment brief above)

          1. Yep, that is indeed the case (smile)

            1. I made the a stupid mistake....

              I made changes(from 256 to 255) to the csv files, then I send in the unedited version into the realterm...AHHHH.........

              Thank you very much 😁😁😁😁😁😁😁😁😁😁😁

  17. Hi,

    Sometimes I face some issues when debugging, such as launching debugger stuck at 40% (launching delegate...), stuck at 70% (Launching : dow <elf file location>, and program stuck in boot? with the following output from XSCT Console: 

    Info: ARM Cortex-A9 MPCore #0 (target 2) Stopped at 0xffffff28 (Suspended)
    _vector_table() at asm_vectors.S: 71
    71: B _boot

    How should I resolve this problem?

    Thank you.

    1. Well, the irony is that most debuggers are buggy. Embedded system debuggers are buggier. 

      Things you could try are - ensuring you build your code just before debugging, programming the fpga afresh (which I guess you are doing by default), clean and rebuild, closing and opening Vitis, trying with a new workspace, rebooting computer, in that order.

  18. Hey, I'd like to check if it is alright if I packed both A and B into a single file instead of sending two separate files? 

  19. Regarding submission, I would like to ask:

    Do we need to submit the archive for the vivado file containing my_ip (Basically the vivado project we generate the xsa wrapper)?

    (Because this file can be 70MB+ easily)

    1. No, it is not the whole Vivado project folder. Just the ip_repo>myip_1.0 folder (the exact location of this depends on whether the packaging was done inside the Lab 2 project or elsewhere), which is usually less than 1 MB. This folder is portable, can be used with other projects too.

      Alternatively, you can also just upload the .xsa file. I have edited the requirements such as uploading .xsa is ok too.

  20. Rajesh Chandrasekhara Panicker For the practice problems, will the solutions be uploaded? 

    1. It will be uploaded for problem set #2, not for others. I tend to mention a lot more things than just the answers, so I prefer that people learn from the videos (smile)