edgetpu: Installation failing on Raspberry Pi CM4 for PCI-E driver
Following the installation guide for the M.2 I get several compilation errors when its trying to install gasket. Here the log of the make process: gasket-make.log
It seems its mostly the 3 same errors
invalid use of undefined type ‘struct msix_entry’’
implicit declaration of function ‘writeq_relaxed’; did you mean ‘writel_relaxed’
implicit declaration of function ‘readq_relaxed’; did you mean ‘readw_relaxed’
implicit declaration of function ‘pci_disable_msix’; did you mean ‘pci_disable_sriov’
This is using gcc version 8.3.0 using the latest Raspbian with Kernel 5.4.51-v7l+
Unsure whether this is compiler, kernel header or code issues.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 93 (13 by maintainers)
Ok,
I’ve committed my changes as a fork of google-coral/libedgetpu and google/gasket-driver here:
https://github.com/ghollingworth/libedgetpu https://github.com/ghollingworth/gasket-driver
I’ve tried to keep my changes to an absolute minimum, so they’re easy to understand
To force the CM4 to use a window of 0x01000040 you just need to copy the pcie-32bit-dma-overlay-dts and change the dma-ranges to:
But as I said above, this will help for a few accesses, but then the upper 32bits of the page table address registers changes to something else and this no longer works (but can be used as a proof of concept)
So, having spent a little time understanding this problem I’ve come to the following conclusions:
Doing a downstream 64bit access (either read or write) through the PCIe interface is broken into two 32bit accesses incorrectly such that the least significant word is repeated. So if you do
*(unsigned long long *) addr = 0x12345678aabbccdd;
It’ll actually write 0xaabbccddaabbccdd to the address. This is only in the downstream direction (root to device) and the upstream memory bus is not affected (i.e. when the device does an access to main memory it doesn’t matter what width of data is used.
This means that a possible solution would instead to do the following:
*(unsigned long *) addr = 0xaabbccdd; *(unsigned long *) (addr+4) = 0x12345678;
This will work correctly through the downstream interface sending two separate 32bit writes to the hardware.
Doing this I’ve been able to get further with the code such that it seems to correctly set up the registers in the Google Coral device. The problem occurs when the device tries to access data from main memory (in the upstream direction). It does this through some page tables on the device which first need to be programmed up to point to the scattered user data in physical memory.
These page table registers/memories on the Google Coral device cannot be written to 32bits at a time, for example, if you write the previous values as two 32bit accesses (least significant word first) to one of these page table registers and read back the data you get:
print(“%04x”, *(unsigned long *) addr); 0xyyyyyyyy print(“%04x”, *(unsigned long *) (addr+4)); 0x12345678
Or if you write most significant word first you get:
print(“%04x”, *(unsigned long *) addr); 0xaabbccdd print(“%04x”, *(unsigned long *) (addr+4)); 0xyyyyyyyyy
Whichever word gets written second will be correct, but the other word looks like something hanging around in a latch somewhere (0xyyyyyyyy just means it’s not absolutely certain what this number will be). I did find that, in general, if you only wrote to the lower 32bits the upper 32bits got set to 0x01000040
So I changed the upsteam PCIe window to be 0x01000040 and forced the driver to only use simple page table entries (rather than extended page table entries)…
Suddenly, I was getting some upstream accesses working… Around about the first couple of kB of accesses get through correctly! RESULT!
Actually no, because soon after that the most significant word changes to something else, it seems that the 0x01000040 is not fixed (kind of obvious really) and suddenly the PCIe root just fails the transaction and you end up in the same place.
So, this now requires a Google Coral hardware engineer to look at the verilog around the page table AXI bus interface to see whether it is possible somehow to set both words using only 32bit accesses. This may be possible using some internally programmable master that can write to those registers (a DMA controller or processor of some type).
Will create a pull request against google-coral/libedgetpu and mbrooksx/gasket-driver with my changes which limits it to just the changes required to at least get this far…
I completely agree about the potential with the combination. At this point, it looks like a irreparable hardware issue with the antiquated CM4 PCIe module. I have forced all the allocations into simple mapping (see above for more info about this) so that all the virtual addresses are 32-bit, as well as previously setting all reads/writes to 32-bit. However, the device itself (in hardware) makes reads/writes in the coherent cache - all of these read/writes are 64-bits.
For now, the plan is to wait until the office is open so we can use a PCIe analyzer and confirm this hypothesis. But there doesn’t appear to be any additional changes that we can do in SW - the device expecting a host to be able to perform 64-bit read/write is built into the hardware.
USB is still the recommendation for the CM4. USB2.0 is possible out of box, and USB3.0 may be possible although extra design considerations are required (more info here: https://coral.ai/products/accelerator-module/).
@timonsku : Yes, I’m actively working with the people in the Pi forum discussion. While MSI-X isn’t technically supported by the BCM2711, as you saw from that patch if SW indicates it works then the PCIe hardware is actually able to map some MSI-X interrupts correctly.
We’ve validated farther than you have (including MSI-X), your errors are because you’re building for the 32-bit kernel but the driver expects 64-bit read/write (thus why writeq/readq don’t exist). My plan is to customize the driver for Pi (including 32-bit workarounds) and likely submit it to the Pi kernel vs trying to update our DKMS package. Will keep you informed of the status.
@SamueldeFaria could you maybe specify which product have you chosen instead. This may be helpful for others who follow, or will stumble on this issue in the future.
It would be so sad if it would never be possible to use the Coral Boards via PCIE on the CM4. The combo is the perfect high performance - low power - compact formfactor - multi camera - mainline kernel supported - embedded inference platform. Please please find a way to make it useable.
incredible this issue is still open after 2 years
Any updates on this issue?
There was a tentative plan to investigate further when a PCIe Analyzer became available. Have these tests been done?
Thanks!
Although i cannot help you but, I came here everyday with a hope to see it can work together ^^
I unfortunately don’t have an estimated date. The CM4 PCIe hardware is antiquated, and there are endless hacks required to try to have it operate competently (note that the TPU is a PCIe bus master, and I don’t see any evidence of a bus master ever being tested with the CM4). We haven’t been receiving the support needed from the Pi team, so for now it’s continuing to try things to understand the issues with communication (at this point it seems an issue with the shared memory). It may be within the next few weeks for operation (in which case I would post the hacked up version for your evaluation while we decide the best way to release this without polluting the main Coral codebase). I will keep this thread up to date.
Depending on the board configuration, USB may be a better choice.
If someone at Google is working on it, or is going to, it would be nice to get a very rough ETA (weeks, months) on when we can expect to know whether or not the TPU will ever work over PCIe on a CM4. I’ll be creating a new revision of my products PCB in few weeks, and if there’s very little chance the PCIe TPU won’t work anytime soon, I’ll have to switch both to USB.
I’d like to add some clarifications to the comments from SamueldeFaria above… the reason Coral team could not provide the host board design support with CM4 and Coral Accelerator, is due to the fact that we won’t know if a certain board design is going to work or not until the board is physically made and tuned, given the stated reason of PHY trace. And we asked for some information about what product a developer is making or what customer they serve, is not trying to get any confidential product information or having any nefarious intention, as the comment might have implied, but simply try to get a sense of if the product has a compelling use case or important customer that it might merit us providing extra assistance from our engineering team. Due to the sheer volume of support request we receive for potentially using CM4 + Coral Accelerator, as you could imagine, we just use this info to help us assessing and prioritizing the help inquiries we received, so we can allocate limited engineering support resources to help as many key customers as possible. Of course, if anyone is not comfortable telling us what they are developing, then there’s no need to reply on our follow up.
Regardless, there are some customers have successfully implemented USB3 interface with the Coral Accelerator module in their product design, since the PCIe interface with CM4 is out of the question at the moment, and one of them is Upverter, and they are offering board design tools and services to other customers who’d like to leverage their experiences and expertise in designing CM4 based host board with Coral Accelerator integrated on it, so anyone can also try their tools & services if you’d like to try a USB3 interface, Thanks!
According to Gumstix and Google, their solution is CM4 -> PCIe -> USB3 -> Coral TPU, so you still get the fast performance instead of CM4 -> USB2 -> Coral TPU. Gumstix is using ASM1142 USBH1 to Coral Accelerator Module.
As long as this issue is still open, there’s no working solution yet for CM4 -> PCIe -> Coral TPU.
Thanks for your kind response. My questions are: 1- Why don’t you provide complete and professional looking made datasheets as all the other manufacturers? 2 - Whats the idea to provide that link? A company that is designing hardware as are we? Thanks but no thanks. I’m glad I didn’t provide any information after all.
The product is working now with solution from other manufacturer. They were really helpful and provide all the documentation needed. No Client id, product type/idea, production numbers or … Will look to them again if needed in another product. In what regards to me, I will not look to you for any other hardware solutions you may have.
Has anyone had a go at this? I’ve done a bit of debugging and hacking myself and got the kernel module to load and libedgetpu to start an inference (although it never finishes, some event is missing, and there is an HIB error?).
There are some changes needed in both the kernel module and the user-space drivers, so far primarily replacing 64bit memory accesses with two 32bit ones. My progress is here for the module which I have updated to the latest version from the dkms package and here for libedgetpu, but these changes are of course nowhere near merge-quality.
This is what libedgetpu logs:
Also the only interrupt firing seems to be the fatal error one:
I’m just going to do a shameless plug here: https://github.com/will127534/Coral-USB3-M2-Module A full opensourced design with CTS test passed.
No it doesn’t… It’s no more promising that any of the other products that will not work due to hardware limitations of BCM2711 and the Google Coral device
Designed m.2 card with Coral Accelerator Module that seem to work fine with Piunora CM4 baseboard. Test suggestions are welcomed
This has nothing to do with this issue. PCIe is used on RPi4 for USB3 and need customization to make use of it. I think your best chance is still to get the USB Accelerator which is using USB3.
At Google I/O 2021, the Coral team announced companies they were working with to develop TPU projects, Gumstix was one. Gumstix has a Pixhawk development board which uses Coral and CM4. I asked the Coral team if that unit works (since it uses the PCIe interface to talk to the TPU. There response was:
“Unfortunately, we haven’t been able to run the TPU on a 32-bit system (estaban - am assuming this means any?) . Please refer to this issue: https://github.com/google-coral/edgetpu/issues/280 (estaban - this posting). The CM4 has a 32-bit bus, and despite changing both the driver and userspace (see bug for links to GitHub repos with those changes) - the device still is hardcoded to issue 64-bit operations. We expect that the 32-bit host simply omits the upper word, leading to invalid read/writes (as reflected in the HIB error).”
The bottom of the email has the following bug report fields: Status - In-progress Priority - Medium Status Detail - Assigned
Not solved but maybe not dropped…and maybe “no” 32-bit processor can use it, Seems important for an Edge TPU.
R/Estaban
@n1mda - Coral and CM4 are a no go. Coral seems to work on the Pi 5 (and hopefully the CM5 when it is released), as it has a more compliant PCIe bus.
Really appreciate you spending so much time and energy on this,
Thanks Manoj. I have posted the query on the sales link, but it usually takes a long time to hear back from them. Since this is a bit urgent, I was hoping someone from Google here can connect me to the right team to take this forward.
@vebmaster - So far I haven’t been able to get that compute module (nor Pine64’s SOQuartz) to get to a state where I can test it but TPUs are some of the first devices I’m planning on testing!
Will this problem be wherever “Raspberry Pi CM4” is used or are there exceptions?
Did I understand correctly that at the moment (December 2021) it makes no sense to buy “Raspberry Pi CM4” to use “Coral edge Mini PCIe” and “Coral edge m.2”?
There hasn’t been an update to the gasket driver (which I’ve now moved to https://github.com/google/gasket-driver) or libedgetpu to enable 32-bit operation required by the CM4. The only way we’re aware of to communicate between the CM4 and the TPU is via USB3.
This can be accomplished by starting with a known-good design from Gumstix and customizing in Upverter (based on this board) or by reaching out to Coral Sales to discuss how to build your own USB3 design.
As for the performance of the USB3 + CM4, here is the models_benchmark output on the GumStix PoE camera (this can be compared to the tested CTS runs) - you’ll see it significantly outperforms the USB2 design (Dev Board Mini) but is slightly less than the x86 USB3 (due to the more powerful CPU) or Dev Board (due to the slight latency added with the PCIe-USB bridge on the CM4 design).
The accelerator module supports PCIe, USB3, and USB2. USB3 requires working with us extra design considerations to ensure that the design will work properly. As we validate more designs, we may make this information generally available but we want to ensure it can work across many designs (instead of setting up people for failure).
As for performance, it’s a significant difference. I would recommend referring to the CTS outputs. The bottom of the CTS outputs is benchmarks - specifically I’d compare the Dev Board (A53 + PCIe) and Dev Board Mini (A35 + USB2). While the Dev Board is PCIe, it’s a more fair comparison then x86+USB (also in CTS outputs) because of the much faster CPU time on the beefier platform.
@julled - Yeah this new overlay is unrelated (and frankly I can’t believe that overlay is actually useful). Thanks for finding the source.
Choosing to believe this is still possible…here are my current DMESG and libedgetpu logs: (Kernel: 5.10.23-v8+ (aarch64) with gasket/apex modules and libedgetpu from mbooksx’s repos, custom Buildroot Rootfs)
DMESG
libedgetpu (verbosity=10)