napari: Slider is slow with time series of large 2d images
š Bug
For a time series of 8k x 8k x float64 images it takes around 200ms to switch slices. For 16k x 16k it takes around 1500ms. In both cases moving the slider through many slices feels slow and laggy.
To Reproduce
# 8k images
napari.view_image(
np.random.random((2, 8192, 8192)), name='two 8k 2d images'
)
# 16k images
napari.view_image(
np.random.random((2, 16384, 16384)), name='two 16k 2d images'
)
Note: 16384 is the max texture size on a macbook w/ MD Radeon Pro 5300M 4 GB. Any image larger that will be downsampled to 16384. So we basically just cannot view images larger than that unless they are multiscale.
Expected behavior
- Switching images is much faster.
- To the degree itās not faster the loading should be incremental and interruptible.
You want to be able to interactively ādial inā what slice you care about, moving the slider freely without delays. Itās okay if the image takes a bit to fully load once you stop moving.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 23 (23 by maintainers)
No problem @sofroniewn itās hard to parse exactly whatās happening, you just know it feels slow and laggy. BTW, hereās a proposed subjective scale for scientific imaging:
60Hz š 30Hz š 20Hz š 10Hz š 5Hz š” 1Hz š
For games 60Hz is more the minimum, but often in scientific viz the content is not moving. The āanimationā is only for view manipulation, itās functional not aesthetic. I think for us 60Hz is great but really 20Hz is probably fine in most cases. But 5Hz is awful.
I will write up the basic ideas weāve been talking about. I think solving this for 2d images is pretty straightforward. This is stuff people have been doing for decades. We donāt have to invent anything just implement a decent version of it.
Solving it for 3d images, labels, points, shapes and meshes is another matter, but I think we have to start somewhere. I think the machinery we create for 2d images will be necessary for the other problems, but it wonāt be sufficient, especially for geometry.
I think the bottom line is chunks/tiles are needed as much for going from RAM to VRAM as they are for going from disk/network to RAM. You might think āwell my large image fits in RAM so Iām goodā but a 16k x 16k x 32bit image is 1GB. Any operation you do on 1GB is going to be slow relative to a frame which is only 16.7ms.
Separately a full screen 4k display can only show 8M pixels not the full 268M pixels. That alone suggests waiting while that 1GB is copied to the card in order to draw 3% of it is not a good idea, assuming you were fully zoomed in.
I think the very general lesson here is both disk and memory have sectors/blocks/pages. Everything is 4kb or 16kb or some small size. At the level of numpy you can trivially create a giant uniform block of memory, but thatās just an abstraction. Any actual machinery that processes it with real hardware needs to break it down into smaller pieces.
I donāt think the slider draws all 511 slices!!! But it can easily start falling behind through normal usage on things that are slow (and to @jni point will always be slow) to ārenderā (where right now ārenderā might include some crazy dask computation to āloadā the data). For example, if i move the slider, then stop, then move in the other direction, then move again etc. I can easily get to the point where it looks like many calls to
_set_view_slice
have been requested and they are all backed up, say 15 calls or something. I think what @jni is asking for at the beginning now is really just aqueue
where we can know that weāve requested 15 calls and just drop the first 14 and only execute the last one. Iām not sure how hard that is to do as a stand alone thing, or if it something that will fall our of the proposed tiled rendering scheme. Making sure that everyone is on the same page about the above scenario seems important though.I guess not blocking on loading into VRAM is they key that means the scenario above with the slider will always be fast. I think the point that @jni was making was sometimes weāll have a lazy computation setup to go RAM -> RAM that might be very slow, and we canāt get blocked on that (and want to drop excess calls).
Yes, I think we will still preserve this speed, I think the experience that @jni liked with the 7k x 7k image will be even better as we will progressively load it (I think?) so that you see something low res right away, but then fairly quickly the high res thing, and then all the panning/ zooming is still being done on the GPU. I think what @jni wants to avoid is my rather poor āmultiscaleā code which right now forces you to go back to the CPU every time you pan/ zoom to fetch a new ātileā.
I think a few diagrams would help @pwinston. Iām pretty excited about how all these conversations are going, but thereās a lot going on so definitely good to make sure weāre all keeping up and understanding.
Itād be helpful to understand Neuroglancer and BigDataViewer and maybe some others. Although thatās challenging, the code can be very dense. Sometimes itās good to just get started on something. Then when we start to hit issues dive into the other packages for ideas. Once you have more context. Thereās a chance you need to rewrite things, but thatās not horrible if you learned a lot. Of course if we can just ask around, thatās probably worth it.
I can somewhat picture a totally generic octree where āimagesā is kind of a plugin. A plugin that at a minimum specified how to store images at any node (2d or 3d) and how to downsample images. For images downsampling is literally just downsampling, but for other types of a data it could be something totally different. So then you can create new āpluginsā for other types of data. Mesh decimation can be super involved and complicated, like 100x harder than downsampling images, but you can maybe start simple.
One not totally obvious thing you can do are figurative types of ādownsamplingā. So you donāt try to be visually accurate at all, but at the higher levels you have bounding boxes or blobs or transparent rectangles. Like they do with maps sometimes, below are giant aggregated circles but if you zoom in you see the actual points:
But thatās a side point. The main idea I think is an octree with āpluginsā for each datatype, where the plugins can get more sophisticated over time.
The hardest thing about Ravellerās quadtree was there was one operation where the user could actually modify the pixels. The pain there was that invalidated every layer the of quadtree, you had to recompute them all on the fly. If we had to do that for meshes, that could get seriously complicated.
Performance Monitoring Results
Using the performance monitoring stuff from #1262 here what it looks like to move between two 8k images using the ānextā button:
Selected Times In Above Diagram:
Where:
data.astype
conversion fromfloat64
forfloat32
innapari._vispy.VispyImageLayer._on_data_change
data = (data - clim[0])
which triggers a copy invispy.visuals.ImageVisual._build_texture
data /= clim[1] - clim[0]
invispy.visuals.ImageVisual._build_texture
Notes:
flush_commands
is waiting on the card to draw. This alone is maybe okay for 8k but way too slow for 16k. Itās not clear how much of this is just the card drawing a texture that big, and how much is other overhead that vispy is adding. Would have to be investigated.clim
related lines but they are slow, need to explain whatās. happening there and can it be avoided or sped up.Performance Today
Ideally the total time to switch slices is under 16.7 ms (60Hz) but 50ms (20Hz) might be pretty reasonable. Where we are today:
So we are 93 times too slow for a 16k image if want it to go at 60Hz.
Tiled (Chunked) Rendering
If we can speed these things up thatās great and will help a lot. Beyond that though I suspect ultimately we need a tiled renderer here just as much as we do with multi-scale.
Itās tempting to think we have 2 different types of data with napari multi-scale (big) and in-memory (small). But really I think all data whether in-memory or not needs to be treated as if it were big.
For multi-scale data thatās chunked on disk the path is Disk/Network -> RAM -> VRAM. There tiles benefit us in both hops. For in-memory data the path is just RAM -> VRAM, but tiles are just as critical for that one hop. And unlike disk/network where we can do stuff with threads, as far as I know paging to VRAM must be done in the main thread. We can only page a small amount of data each frame, so it has to be done in chunks.
Image Sizes and Tile Sizes
Because of squaring itās not intuitive just how big these big images are. A 16k x 16k image is 1024X bigger than a 512x512 tile. That alone is kind of surprising.
256MB or 1024MB is just a lot of data to move around in RAM or send to the GPU. A lot to move as one solid block. Itās much easier and better to move 0.25MB to 1MB chunks. In a tight loop moving a lot of data in small chunks wonāt be much slower than a single big move, but it will be vastly more granular and interruptible.
Also to cover a 4k screen (3840 x 2160) you only need 12% of the 8k image or 3% of the 16k one, assuming you have downsampled imagery. So rendering the full 8k or 16k is overkill, you are sending all the data to the GPU just so it can downsample it.
Benefits of Tiles
Recommendations
clim
stuff. This will help a lot.Re the slider performance, hereās what I think is happening with the āone intermediate pointā
In other words, itās always moving to the current mouse position, and is one step behind if youāre moving.
Re
I think this is low-hanging fruit. Hereās the line where the image gets instantiated and blocks the UI when thereās IO/compute associated with it:
https://github.com/napari/napari/blob/c04484fc08694d23b4790387c3618d9fb4fad00f/napari/layers/image/image.py#L527
and then hereās where it gets set to data, which triggers a vispy draw:
https://github.com/napari/napari/blob/c04484fc08694d23b4790387c3618d9fb4fad00f/napari/layers/image/image.py#L538-L539
From what I can tell, we should have the first line trigger compute/IO on a new thread, and the Future get saved somewhere handy. When a new Future gets added to this 1-element queue, the older future is canceled and clobbered. Then there can be a 60Hz poll on whether the Future is done and the data_raw setting part can happen. Am I missing some steps @pwinston?
Yep I am totally among those. š And donāt get me started on televisions with āfancyā interpolation to make movement smoother!
As an aside about frame rate, here is NVIDIA arguing you need their most expensive cards to run popular games at 240Hz to improve your K/D (kill-to-death-ratio): https://www.nvidia.com/en-us/geforce/news/geforce-gives-you-the-edge-in-battle-royale/
Once I saw a demo of a specialized display that ran at 600Hz. He had a 120Hz and 300Hz version and with his demo you could see 600Hz was in fact clearly better. But the demo was of rapidly turning fan blades! So the ideal frame rate depends highly on what you are looking at.
Related to that movies are historically 24Hz and what most people donāt realize is that highly constrains what shots you can do. If panning you have to pan at a certain slow rate or it will look awful. Movie people just know this and plan accordingly. With digital theyāve shot some movies at 48Hz and 60Hz and it changes things a lot. But some people say they look awful and āfakeā!
This implies for OpenGL you can use multiple threads but thereās really no point, it doesnāt help performance because āthereās usually just 1 GPUā: https://www.khronos.org/opengl/wiki/OpenGL_and_multithreading
But this implies you can do it in some cases but itās complicated: https://developer.apple.com/library/archive/documentation/GraphicsImaging/Conceptual/OpenGL-MacProgGuide/opengl_threading/opengl_threading.html
I think you can use multiple threads in WebGL and Neuroglancer does.
At any rate I think these 2 design goals are so basic theyād benefit us while using any API:
Iām going to focus on vispy+opengl only, but I just think this is heading a good direction regardless. Iāll include links like these in the references. Itād be nice to a have good rendering page that linked to various rendering-related things.
@tlambert03 : I love this.
I know itās not the topic of this thread, but it may come up with asynchronous loading of remote chunked data sources (zarr): image tearing is terrible, too.
[This entire conversation brings back memories of poor CATMAID rendering performance. It used to be ~10Hz @ ~1080p, with tiles ātearingā as they loaded. With WebGL, it can do ~30Hz @ 4K.]