8000 dual core VPU? · Issue #14 · hermanhermitage/videocoreiv · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

dual core VPU? #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fanoush opened this issue Feb 8, 2019 · 11 comments
Open

dual core VPU? #14

fanoush opened this issue Feb 8, 2019 · 11 comments

Comments

@fanoush
Copy link
fanoush commented Feb 8, 2019

Hello. Are there actually two VPU cores? There are several hints about it in different places but nothing definite. If yes do you know how to start the second one or how it behaves at boot time? Are they equivalent and can do same stuff?

@christinaa
Copy link
Collaborator
christinaa commented Feb 27, 2019

Yes there is a 2nd core, the early code in my firmware (IIRC) actually checks if it's on core 0 or core 1 and if it's on core 1 it asks cprman to shut it down and then goes into a loop. Since my firmware is only a SPL-type firmware, it does not use the 2nd core. As far as the second question goes, I guess they're equivalent, but I never personally looked into the MMIO ranges responsible for multicore execution as well as any ISA extensions for it.

@fanoush
Copy link
Author
fanoush commented Feb 28, 2019

Thank you. Later I also checked Brcm_Android_ICS_Graphics_Stack.tar.gz released by Broadcom and there is interesting stuff in brcm_usrlib\dag\vmcsx e.g. vcinclude\hardware_vc4.h (and everything in vcinclude\bcm2708_chip) Also vmcsx\helpers has tons of interesting VPU assembly stuff. Also some include files mention VPU0 and VPU1 so yes there appears to be two VPU cores after all. I was just surprised that no details are documented here.

BTW thanks for your firmware it is a very good start together with the vc4 gcc toolchain. I am thinking about porting micropython or espruino or other small interpreter to VC4 VPU - the $5 Zero may be good enough for typical 'arduino' stuff even with just VPU running and ARM core turned off. However another reason is to have something to poke VC4 registers from - some scripting language to figure out stuff in more interactive way than compiling C fragments over and over.

the early code in my firmware (IIRC) actually checks if it's on core 0 or core 1 and if it's on core 1 it asks cprman to shut it down and then goes into a loop

Could you point me to it? Because I checked your code before and noticed that the _main method takes 'unsigned int cpuid' however it runs uart and sdram init code
https://github.com/christinaa/rpi-open-firmware/blob/master/romstage.c#L130
and main is called from https://github.com/christinaa/rpi-open-firmware/blob/master/start.s#L117 so I did not see any code there that stops the second core or run different code based on cpuid (or other id?). That was second reason why I was not sure there are indeed two cores running. So if it is somewhere later then sdram, pll, uart init code is called twice - once by each VPU core ?

BTW what is your source of VC4 info (if you can/want to answer)? There are lot of magic constants in your code which makes it hard to figure out stuff (pll, sdram setup) but you obviously know it from somewhere.

@thubble
Copy link
Collaborator
thubble commented Feb 28, 2019

The second core is initially not executing (although based on what Kristina said, it may be powered on?). To start it, simply write the start address to IC1_WAKEUP (0x7e002834) and it will immediately start executing there.

The currently-executing core is determined by bit 16 in the version instruction value (set = core1, clear = core0). The first thing the default bootcode.bin executes is this:

version r0 
btest r0, 16
bne L_Core1Entry
;Core0-only code here

As far as I'm aware the 2 cores are identical. There is only 1 vector register file, so all vector code uses mutexes in the default firmware.

@fanoush
Copy link
Author
fanoush commented Mar 4, 2019

Oh, thank you. That's interesting. And it is great someone is listening :-)

Also I have other questions unrelated to the topic that you may possibly know. How can I control 128K L2 cache after I enable DRAM and then plan to turn it off. Is it always at same address like at the boot start? I guess not since it is possibly just prepopulated mapping for L2 cached address 0 (?) without any backing store(?)

I saw there is also some bootrom RAM area(?) that could be used for running code? Are there also some other spare memory mapped SRAM buffers that could be possibly (ab)used for data or code? like e.g. memory for USB enpoints or stuff like that. Basically I am checking how much RAM there is without enabling SDRAM or when it is put to sleep. Also the bootcode.bin code starts at nonzero offset, is the memory above it usable? Why it doesn't start at offset zero then?

@christinaa
Copy link
Collaborator
christinaa commented Mar 4, 2019

The entire VC4 side of my firmware runs in VPU cache, which is 128K if I recall correctly. ARM stuff runs in SDRAM since it cannot run in that mode. If you want to load a second stage firmware (like start.elf) onto the VPU you would have to copy the bootcode into an SDRAM region without cache (The whole address space is partitioned in 4 "mirrors", in other words, 2 bits of the address determine cache properties of that access).

Once VPU is running in RAM, changing anything about RAM or accessing the SDRAM controller requires an undocumented cache dance before making it cooperative (where you can manually do stuff like query MR registers), which is roughly this (writable and executable code). You will have to do that every time you want to do something like reclock SDRAM once you disable cache-as-RAM the first time. Also I think the cache-as-RAM to RAM execution requires a similar trampoline (this is much much smaller than 128KB):

Note: 
   CLSCn = Cache Line Sized Chunk n
   ECLSCn = End of Cache Line Sized Chunk n

[       Start       ]
[SetCond,JmpTo CLSC1]
[      CLSC1        ]
[     FUN_PART      ] <- If condition is not set the stuff below will actually run fully.
[CondJump to  ECLSC1]
[Code/SDRAM disable ] <- Code doesn't exec with cond, just jumped over to prime cache. 
[  New SDRAM param  ] <- Just data, jumped over by code regardless
[      ECLSC1       ]
[      CLSC2        ]
[CondJump to  ECLSC2]
[       Code        ] <- Same, jump over, without executing.
[      ECLSC2       ]
[      CLSC3        ]
[CondJump to  ECLSC3]
[       Code        ] <- Etc ...
[      ECLSC3       ]
...... etc etc ......
[      CLSCn        ]
[CondJump to  ECLSCn]
[      Code         ] <- Will reenable SDRAM when runs
[      ECLSCn       ]
[    UnsetCond      ] <- All the cache aligned chunks are in cache now
[   JmpTo FUN_PART  ] <- Fun part begins: run same code from cache.

During the dance ARM is stalling and VPU will lock up if accessing any memory while SDRAM Controller is off, hence the need to copy data into that region. I don't know why bootram is not used for this, either it's too expensive to copy code there and run from there or it requires fully enabled cache as RAM which may require a lot more teardown.

(I'll note that I did attempt that and once SDRAM controller is on, doing the above from bootram will lock up when trying to access some the SDRAM controller MMIOs (especially trying to use MR registers), but the above somehow doesn't. Who knows why.)

This hardware is odd.

@fanoush
Copy link
Author
fanoush commented Mar 7, 2019

There is only 1 vector register file, so all vector code uses mutexes in the default firmware.

was just explained here and in followup post
https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=234167#p1432851
so there are 2 files but one vector unit

EDIT: and also here about how it is shared between VPUs
https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=234167#p1438883

@thubble
Copy link
Collaborator
thubble commented Mar 9, 2019

There is only 1 vector register file, so all vector code uses mutexes in the default firmware.

was just explained here and in followup post
https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=234167#p1432851
so there are 2 files but one vector unit

EDIT: and also here about how it is shared between VPUs
https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=234167#p1438883

Ah, thanks - that clears some things up! I noticed in the stock firmware that a lot of VRF-accessing code uses locks ("vclib_obtain_VRF()"). I assumed it was to avoid the 2 cores writing to a single VRF simultaneously, but now that I think about it, 2 threads on the same core would have the same issue.

@DarumasLegs
Copy link

I have a random question for this audience - I am looking for a way to get super accurate (sub millisecond latency) timestamps for individual video frames while running Raspivid and/or Picamera. I am developing a multi camera video system using RPi Compute Modules and RPi Camera Modules v2.1, and I need to log timestamps from a RTC on the Compute Modules as closely as possible to the time the camera sensor either starts or finishes imaging a frame. As it is now, with Raspivid and Picamera, the Presentation Timestamps are not fine-grained enough (I need millisecond accuracy), and are only captured when the frame makes it through the GPU to the CPU. I want a signal either from the camera module directly or the GPU as closely as possible to the time the light hit the image sensor and the sensor imaged each frame. Is this possible?

@phire
Copy link
Collaborator
phire commented Oct 12, 2019

want a signal [snip] as closely as possible to the time the light hit the image sensor and the sensor imaged each frame.

This isn't really possible. The camera module is a CMOS sensor and there isn't really a single time for either of those events.

Light is collected for many miliseconds, quite possibly a full 16ms (a shorter collection results in less blur, but a less accurate representation of the light). The camera module signals the pixels to start collecting light and they will collect light until the camera module tells them to stop, summing the result into an analog value.

Then there is the scan out process. With light collection stopped, the camera module will scan across each row, one pixel at a time, reading the stored charge with an analog to digital converter. These digital values will be streamed down the cable to the SOC.

In the highest resolution/framerate, this scan out process will take a full 16ms (when running in 60fps mode) to read the entire frame of data. Why a full 16ms? Because if it took less time to scan out, then the module would support a higher resolution/framerate.

Which brings up another issue... If it takes a full 16ms to collect light, and a full 16ms to scan out a full frame of data, then how is it doing both at the same time?

For a cheap camera module like this, the answer is a rolling shutter. Assuming 60fps again, The first line will end it's 16ms worth of exposure at 0ms and then the line will be scanned out over 0.016ms. Then at 0.016ms, the second line will stop it's 16ms of exposure, the first line will start it's next 16ms of exposure and the second line will be scanned out.

The very last line of the frame won't be scanned out until about 16ms after the first line, and will have collected it's light from an almost completely different time period than the first line.

This works fine for still image, but creates a noticable distortion when you have objects moving quickly across your frame.

All the numbers in this comment are simplified and rounded based on generic CMOS sensors, but will hopefully get my point across.

However, for your use case you might want to look into hacked high framerate recording modes, which sacrifice resolution and noise for much higher framerates, which might get you milisecond resolution.

@DarumasLegs
Copy link

Thank you for the rapid and thoughtful reply - I really appreciate it!

I understand that frames are imaged line by line with a rolling shutter. My use case only requires 720p30fps, and I need accurate timestamps for the frames within 1/30 sec (within one frame). Is it possible to know precisely when the imaging for the first line begins - or, alternatively, the imaging for the last line ends? Either when the light is collected or when the scanning begins? I don't mind a little latency (particularly if it's within 1/30 second) as long as it's constant and deterministic. My software can adjust the times if necessary after the video files are uploaded to my application in the cloud.

@phire
Copy link
Collaborator
phire commented Oct 13, 2019

Cool, if your usecase is ok with rolling shutter then you should be able to get something working.

One potential option might be using a really short external synchronisation pulse of light to mesure the timing. Trigging it with a gpio.

As for a software solution, unfortunately I know next to nothing about the CSI2 interface or the ISP block. But take a look at
raspiraw source code.

It directly controls the CSI2 interface and camera modules via I², I think it DMAs camera data directly into userspace memory before writing it to disk, bypassing any processing or latency inherent to the ISP block.
It also outputs timestamps to a resolution of several microseconds.

Take a look, either you can use it directly for your usecase or modify it to meet your needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
0