napari: Slider is slow with time series of large 2d images

šŸ› Bug

For a time series of 8k x 8k x float64 images it takes around 200ms to switch slices. For 16k x 16k it takes around 1500ms. In both cases moving the slider through many slices feels slow and laggy.

To Reproduce

# 8k images
napari.view_image(
            np.random.random((2, 8192, 8192)), name='two 8k 2d images'
        )
# 16k images
napari.view_image(
            np.random.random((2, 16384, 16384)), name='two 16k 2d images'
        )

Note: 16384 is the max texture size on a macbook w/ MD Radeon Pro 5300M 4 GB. Any image larger that will be downsampled to 16384. So we basically just cannot view images larger than that unless they are multiscale.

Expected behavior

  1. Switching images is much faster.
  2. To the degree it’s not faster the loading should be incremental and interruptible.

You want to be able to interactively ā€œdial inā€ what slice you care about, moving the slider freely without delays. It’s okay if the image takes a bit to fully load once you stop moving.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 23 (23 by maintainers)

Most upvoted comments

No problem @sofroniewn it’s hard to parse exactly what’s happening, you just know it feels slow and laggy. BTW, here’s a proposed subjective scale for scientific imaging:

60Hz 😁 30Hz šŸ™‚ 20Hz šŸ˜‘ 10Hz šŸ˜ž 5Hz 😔 1Hz šŸ’€

For games 60Hz is more the minimum, but often in scientific viz the content is not moving. The ā€œanimationā€ is only for view manipulation, it’s functional not aesthetic. I think for us 60Hz is great but really 20Hz is probably fine in most cases. But 5Hz is awful.

I will write up the basic ideas we’ve been talking about. I think solving this for 2d images is pretty straightforward. This is stuff people have been doing for decades. We don’t have to invent anything just implement a decent version of it.

Solving it for 3d images, labels, points, shapes and meshes is another matter, but I think we have to start somewhere. I think the machinery we create for 2d images will be necessary for the other problems, but it won’t be sufficient, especially for geometry.

I think the bottom line is chunks/tiles are needed as much for going from RAM to VRAM as they are for going from disk/network to RAM. You might think ā€œwell my large image fits in RAM so I’m goodā€ but a 16k x 16k x 32bit image is 1GB. Any operation you do on 1GB is going to be slow relative to a frame which is only 16.7ms.

Separately a full screen 4k display can only show 8M pixels not the full 268M pixels. That alone suggests waiting while that 1GB is copied to the card in order to draw 3% of it is not a good idea, assuming you were fully zoomed in.

I think the very general lesson here is both disk and memory have sectors/blocks/pages. Everything is 4kb or 16kb or some small size. At the level of numpy you can trivially create a giant uniform block of memory, but that’s just an abstraction. Any actual machinery that processes it with real hardware needs to break it down into smaller pieces.

If you drag the slider from 0 to 511 it does not draw all the slices from 0 to 511, that would be horrific. It will skip through drawing just a few slices.

I think this is a key point to nail down here. I’ve heard it feared in previous discussions that this sort of computation backup was happening… but I’ve also wondered whether that’s actually the case, and agree with @pwinston here (though, this seems like a pretty easy thing to ā€œjust proveā€).

I don’t think the slider draws all 511 slices!!! But it can easily start falling behind through normal usage on things that are slow (and to @jni point will always be slow) to ā€œrenderā€ (where right now ā€œrenderā€ might include some crazy dask computation to ā€œloadā€ the data). For example, if i move the slider, then stop, then move in the other direction, then move again etc. I can easily get to the point where it looks like many calls to _set_view_slice have been requested and they are all backed up, say 15 calls or something. I think what @jni is asking for at the beginning now is really just a queue where we can know that we’ve requested 15 calls and just drop the first 14 and only execute the last one. I’m not sure how hard that is to do as a stand alone thing, or if it something that will fall our of the proposed tiled rendering scheme. Making sure that everyone is on the same page about the above scenario seems important though.

What I’m calling ā€œtiled renderingā€ decouples rendering from loading, so rendering never blocks. We only draw tiles that are in VRAM, so that’s always fast. Today we block rendering while loading stuff into RAM from disk or network, it would no longer do that. And today we block rendering loading large textures into VRAM, it would no longer do that.

So in the proposed system rendering just never blocks, it always draws at 60Hz, the slider always moves freely. At the same time though stuff is loaded from disk into RAM and from RAM into VRAM as fast as possible.

I guess not blocking on loading into VRAM is they key that means the scenario above with the slider will always be fast. I think the point that @jni was making was sometimes we’ll have a lazy computation setup to go RAM -> RAM that might be very slow, and we can’t get blocked on that (and want to drop excess calls).

You mention 7k x 7k being fast today. It will be just as fast with tiles, because the tiles will all be in VRAM. Consider a 16k x 16k texture with 512x512 tiles. That’s 1024 tiles so 2048 triangles. A game or simulation will draw millions of triangles per frame. Drawing 2048 triangles probably leaves 98% of the GPU silicon dark and unused!

Yes, I think we will still preserve this speed, I think the experience that @jni liked with the 7k x 7k image will be even better as we will progressively load it (I think?) so that you see something low res right away, but then fairly quickly the high res thing, and then all the panning/ zooming is still being done on the GPU. I think what @jni wants to avoid is my rather poor ā€œmultiscaleā€ code which right now forces you to go back to the CPU every time you pan/ zoom to fetch a new ā€œtileā€.

I think I need to draw up diagrams/docs for what I mean by ā€œtiled renderingā€ here. Basically a proposal for what I think we need to fix this issue and #845. It’s a big project and certainly we want to make sure everyone understands it and is on board. On the other hand we don’t want to do a big design up front, but we do need to agree it’s worth doing.

I think a few diagrams would help @pwinston. I’m pretty excited about how all these conversations are going, but there’s a lot going on so definitely good to make sure we’re all keeping up and understanding.

It’d be helpful to understand Neuroglancer and BigDataViewer and maybe some others. Although that’s challenging, the code can be very dense. Sometimes it’s good to just get started on something. Then when we start to hit issues dive into the other packages for ideas. Once you have more context. There’s a chance you need to rewrite things, but that’s not horrible if you learned a lot. Of course if we can just ask around, that’s probably worth it.

I can somewhat picture a totally generic octree where ā€œimagesā€ is kind of a plugin. A plugin that at a minimum specified how to store images at any node (2d or 3d) and how to downsample images. For images downsampling is literally just downsampling, but for other types of a data it could be something totally different. So then you can create new ā€œpluginsā€ for other types of data. Mesh decimation can be super involved and complicated, like 100x harder than downsampling images, but you can maybe start simple.

One not totally obvious thing you can do are figurative types of ā€œdownsamplingā€. So you don’t try to be visually accurate at all, but at the higher levels you have bounding boxes or blobs or transparent rectangles. Like they do with maps sometimes, below are giant aggregated circles but if you zoom in you see the actual points:

Screen Shot 2020-05-27 at 11 44 14 PM

But that’s a side point. The main idea I think is an octree with ā€œpluginsā€ for each datatype, where the plugins can get more sophisticated over time.

The hardest thing about Raveller’s quadtree was there was one operation where the user could actually modify the pixels. The pain there was that invalidated every layer the of quadtree, you had to recompute them all on the fly. If we had to do that for meshes, that could get seriously complicated.

Performance Monitoring Results

Using the performance monitoring stuff from #1262 here what it looks like to move between two 8k images using the ā€œnextā€ button:

napari-slow-8k

Selected Times In Above Diagram:

step 8k time (ms) 16k time (ms)
overall 208 1551
Dims.set_point 85 714
Paint 114 818
1 - data.astype 64 625
2 - clim[0] 43 578
3 - data /= 22 105
flush_commands 39 125

Where:

  1. The data.astype conversion from float64 for float32 in napari._vispy.VispyImageLayer._on_data_change
  2. The line data = (data - clim[0]) which triggers a copy in vispy.visuals.ImageVisual._build_texture
  3. The line data /= clim[1] - clim[0] in vispy.visuals.ImageVisual._build_texture

Notes:

  1. The last line flush_commands is waiting on the card to draw. This alone is maybe okay for 8k but way too slow for 16k. It’s not clear how much of this is just the card drawing a texture that big, and how much is other overhead that vispy is adding. Would have to be investigated.
  2. Converting from float64 to float32 is really slow. Could we do this on load?
  3. Not sure about the two clim related lines but they are slow, need to explain what’s. happening there and can it be avoided or sped up.

Performance Today

Ideally the total time to switch slices is under 16.7 ms (60Hz) but 50ms (20Hz) might be pretty reasonable. Where we are today:

Image Size Over 60Hz Goal Over 20Hz Goal
8k 12X 4X
16k 93X 31X

So we are 93 times too slow for a 16k image if want it to go at 60Hz.

Tiled (Chunked) Rendering

If we can speed these things up that’s great and will help a lot. Beyond that though I suspect ultimately we need a tiled renderer here just as much as we do with multi-scale.

It’s tempting to think we have 2 different types of data with napari multi-scale (big) and in-memory (small). But really I think all data whether in-memory or not needs to be treated as if it were big.

For multi-scale data that’s chunked on disk the path is Disk/Network -> RAM -> VRAM. There tiles benefit us in both hops. For in-memory data the path is just RAM -> VRAM, but tiles are just as critical for that one hop. And unlike disk/network where we can do stuff with threads, as far as I know paging to VRAM must be done in the main thread. We can only page a small amount of data each frame, so it has to be done in chunks.

Image Sizes and Tile Sizes

Image Size Memory Number of 512x512 tiles Number of 256x256 tiles
8k x 8k x 4 bytes 256MB 256 1024
16k x 16k x 4 bytes 1024MB 1024 4096
Tiles Size Memory
256 x 256 x 4 bytes 0.25MB
512 x 512 x 4 bytes 1MB

Because of squaring it’s not intuitive just how big these big images are. A 16k x 16k image is 1024X bigger than a 512x512 tile. That alone is kind of surprising.

256MB or 1024MB is just a lot of data to move around in RAM or send to the GPU. A lot to move as one solid block. It’s much easier and better to move 0.25MB to 1MB chunks. In a tight loop moving a lot of data in small chunks won’t be much slower than a single big move, but it will be vastly more granular and interruptible.

Also to cover a 4k screen (3840 x 2160) you only need 12% of the 8k image or 3% of the 16k one, assuming you have downsampled imagery. So rendering the full 8k or 16k is overkill, you are sending all the data to the GPU just so it can downsample it.

Benefits of Tiles

  1. Incremental streams of small updates instead of one huge one.
  2. If downsampled versions are available, we can fill the screen with far less data.

Recommendations

  1. We fix the 32-bit conversion the clim stuff. This will help a lot.
  2. We create a tiled renderer that works for these in-memory cases just like it does for #845 and multi-scale.

Re the slider performance, here’s what I think is happening with the ā€œone intermediate pointā€

  • start sliding. This immediately triggers the neighbor set_view_slice, but because it takes (say) 5 seconds per slice and runs on the main thread, the UI blocks. You’re moving the mouse but the slider is stuck at it’s original spot.
  • as soon as it unblocks, it moves to the current mouse position, and takes another 5s to render.

In other words, it’s always moving to the current mouse position, and is one step behind if you’re moving.

Re

Don’t do IO or random compute in the render thread.

I think this is low-hanging fruit. Here’s the line where the image gets instantiated and blocks the UI when there’s IO/compute associated with it:

https://github.com/napari/napari/blob/c04484fc08694d23b4790387c3618d9fb4fad00f/napari/layers/image/image.py#L527

and then here’s where it gets set to data, which triggers a vispy draw:

https://github.com/napari/napari/blob/c04484fc08694d23b4790387c3618d9fb4fad00f/napari/layers/image/image.py#L538-L539

From what I can tell, we should have the first line trigger compute/IO on a new thread, and the Future get saved somewhere handy. When a new Future gets added to this 1-element queue, the older future is canceled and clobbered. Then there can be a 60Hz poll on whether the Future is done and the data_raw setting part can happen. Am I missing some steps @pwinston?

With digital they’ve shot some movies at 48Hz and 60Hz and it changes things a lot. But some people say they look awful and ā€œfakeā€!

Yep I am totally among those. šŸ˜‚ And don’t get me started on televisions with ā€œfancyā€ interpolation to make movement smoother!

As an aside about frame rate, here is NVIDIA arguing you need their most expensive cards to run popular games at 240Hz to improve your K/D (kill-to-death-ratio): https://www.nvidia.com/en-us/geforce/news/geforce-gives-you-the-edge-in-battle-royale/

Once I saw a demo of a specialized display that ran at 600Hz. He had a 120Hz and 300Hz version and with his demo you could see 600Hz was in fact clearly better. But the demo was of rapidly turning fan blades! So the ideal frame rate depends highly on what you are looking at.

Related to that movies are historically 24Hz and what most people don’t realize is that highly constrains what shots you can do. If panning you have to pan at a certain slow rate or it will look awful. Movie people just know this and plan accordingly. With digital they’ve shot some movies at 48Hz and 60Hz and it changes things a lot. But some people say they look awful and ā€œfakeā€!

This implies for OpenGL you can use multiple threads but there’s really no point, it doesn’t help performance because ā€œthere’s usually just 1 GPUā€: https://www.khronos.org/opengl/wiki/OpenGL_and_multithreading

But this implies you can do it in some cases but it’s complicated: https://developer.apple.com/library/archive/documentation/GraphicsImaging/Conceptual/OpenGL-MacProgGuide/opengl_threading/opengl_threading.html

I think you can use multiple threads in WebGL and Neuroglancer does.

At any rate I think these 2 design goals are so basic they’d benefit us while using any API:

  1. Don’t do IO or random compute in the render thread.
  2. Break up large resources into smaller resources.

I’m going to focus on vispy+opengl only, but I just think this is heading a good direction regardless. I’ll include links like these in the references. It’d be nice to a have good rendering page that linked to various rendering-related things.

@tlambert03 : I love this.

I know it’s not the topic of this thread, but it may come up with asynchronous loading of remote chunked data sources (zarr): image tearing is terrible, too.

[This entire conversation brings back memories of poor CATMAID rendering performance. It used to be ~10Hz @ ~1080p, with tiles ā€œtearingā€ as they loaded. With WebGL, it can do ~30Hz @ 4K.]