bevy: entity_ref `get_component_with_type` is slow

What problem does this solve or what need does it fill?

I’m trying to get components data from EntityIds(Vec<Entity>) with world.get(entity)… I’m experience some “unexpected” performance drop (in compare to Vec<Rc>).

Looking at the code of World::get<T: Component>(&self, entity: Entity) there is so much indirection… Getting component index from hashmap, then bursting through 3 indirections to get component “Column”, and one branch between storage type.

What solution would you like?

Is it possible to somehow … prefetch/cache that archetype table_row/column?

That’s just too much… Thats waaaay slower then just cache misses with Rc.

Additional context

I use Vec<Entity> not that often… But sometimes you just have to use it …

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 28 (12 by maintainers)

Most upvoted comments

Just took a look. Lets start with:

entity_iter

First: there is a methodology issue. This is an apples-to-oranges comparison because you’re running the entire scheduling infrastructure for the Query benchmark and comparing that to a single raw vec iterator. The scheduling infrastructure is cheap to run, but it will still register in a small way for benchmarks like this.

Additionally, even after accounting for that, this is still apples to oranges because the Vec benchmark doesn’t do change detection. Functionally, change detection performs the same as iterating and writing to another component (because it needs to fetch the tracking state and write to it when Mut<Point> is mutated).

After disabling change detection (currently a manual process of commenting out the relevant code, but will ultimately be configurable per-component once we switch to #[derive(Component)]) I get:

image

Then, after removing the Schedule/Stage/System in favor of iterating a raw Query, I get:

image

After righting the scales, we are actually faster than iterating a raw Vec (by an extremely small margin). Despite the fact that we provide way more functionality than a simple Vec, we still come out on top because the extra work required to accomplish that functionality amortizes to zero.

The test:

fn entity_iter_raw_query_bench(criterion: &mut Criterion) {
    // Setup world
    let mut world = World::default();
    let mut query = world.query::<&mut Point>();

    // // Setup test entities
    for i in 0..ENTITIES_COUNT {
        world.spawn().insert(Point {
            x: i as u32,
            y: i as u32,
        });
    }

    // Run systems
    criterion.bench_function("entity iter", |b| {
        b.iter(|| {
            query.for_each_mut(&mut world, |mut point| {
                point.x += 1;
                point.y += 1;
            })
        })
    });
}

Additionally it seems like there is a general misunderstanding about how relevant numbers like 2x are in this context. Thanks to the cache friendly nature of both implementations, we are talking about extremely small timescales here. The second you start doing anything meaningful inside these iterators, that work will dwarf the iteration costs. We get a ~30% increase in cost, just by throwing in some more ops:

    criterion.bench_function("entity iter", |b| {
        b.iter(|| {
            query.for_each_mut(&mut world, |mut point| {
                point.x += (10 - point.y).pow(4);
                point.y += 1;
            })
        })
    });

image

Doing any amount of work on these scales registers in a big way.

From the perspective of a game engine, the benefits of change detection far outweigh the 2x cost (by enabling us to optimize / skip expensive operations each frame and opening up extremely useful coding patterns). The cost won’t meaningfully register for anything but the most resource constrained benchmarks and in the very near future you will be able to easily opt out for those scenarios.

Lets segue into the next benchmark:

entity_get

After accounting for change detection, here are the results:

image

box_unique is ~4.4x faster than entity. First, even if this was an apples to apples comparison, these numbers would be acceptable because as we’ve previously covered, we’re working on very small timescales here. We are also quite competitive when compared to other ECS implementations. So why isn’t box_unique a fair comparison to ECS get(Entity) operations?

Lets compare the inner loops of the two benchmarks:

  • box: iterate an array of boxes and deref each box
    • for each item (a box), to find the component we need to do the following:
      1. deref a pointer
  • entity: iterate an array of entities and look up each entity in the query
    • for each item (an entity), we need to:
      1. map it to a location in an archetype (this is an array lookup, which is a pointer offset then a pointer deref)
      2. validate that the location is a valid entity (comparison and a branch)
      3. get the pointer to the storage for a specific component in the archetype (a sparse set lookup, which amounts to two array lookups, which equals two pointer offsets and two pointer derefs)
      4. get the component from the storage pointer (which amounts to a pointer offset and a pointer deref)

This amounts to the totals:

  • box: 1 pointer deref
  • entity: 4 pointer derefs and 4 pointer offsets

And that doesn’t even account for the extra branching to ensure Entity and sparse set lookups are valid. In total, the 4.4x number makes perfect sense!

If you don’t think too hard about it, it sure seems like we’re doing a lot of unnecessary work. We should just use Boxes! What were we thinking?

Lets do some small (zero cost) refactors so we can start building our Box ECS:

  • Box indices are now Entity(usize)
  • PointBoxes is now Components<T> { components: [Box<T>; BOXES_COUNT] } (in practice this would actually be a Vec<T> for faster iteration)

Now lets pretend we want Entity(0) to have component [A, B] and Entity(1) to have [B]. How will we record this?

Here is the simplest path forward:

Use an option (or null pointer) to indicate that a specific item in the array exists. (ex: Vec<Option<T>> or [MaybeUninit<Box<T>>; BOXES_COUNT]). This is known as a “Sparse Array”. But now we’ve created a couple of problems relative to PointBoxes:

  • All Component<T> arrays must be sized to accommodate the entire entity space. They are no longer “densely packed”. If Component A has 10 entities and Component B has 2 entities, both arrays need to be length 10.
    • This memory usage problem can be fixed by storing only valid components in a dense Vec<T>, then having another Vec<Option<usize>> array to determine whether or not the entity has component T and where it is located in the dense Vec. This is called a “Sparse Set” and is a popular ECS variant.
  • When looking up an entity’s component, we now need to branch to determine whether or not the entity has that component.
  • When iterating multiple “sparse arrays”, we no longer have linear component iterations (we need to “either hop around” the arrays using sparse sets or do the extremely expensive operation of checking each component index individually for each component)

Ex: The specs ECS “vec storage” is basically a sparse array. I had benchmarks handy that compare its entity lookups to Bevy’s

image

Funnily enough, specs is about 4x faster than bevy_system, which makes sense given what we just talked about / the performance of your box benchmark!

However sparse sets have significant downsides:

  1. If you want to do multiple component lookups for an entity, you need to branch for each component type.
  2. Iterating sparse sets (especially when iterating the entities from multiple components at the same time) is expensive. Way more branching and hopping around in a cache unfriendly-way. This is slow (when compared to linear dense vec iterators).

The solution to this is “archetypal ecs”.

Entities belong to a specific archetype, with per-archetype storages that are densely packed and perfectly aligned.

This adds the indirection of finding the archetype, but it means:

  1. You only need to find the archetype once. After that, you already have the component storage locations and you don’t need any more branching to check if the storages contain the components. This means that the more components you add to query.get(entity), the faster it gets relative to sparse sets.
  2. Iterating can be a cache friendly linear scan. No need to hop around sparse sets!

Archetypal and Sparse Set ECS are the two most popular ECS implementations. They are a result of the best minds in the business putting their heads together and making informed compromises. They each have their tradeoffs, but they provide functionality that a simple Vec<T> cannot. It is literally a limitation in the laws of physics. You cannot have the flexibility of ECS without more branching and indirection. If you want the performance of an array, you are stuck with the rigidity of an array. If you think you can do better … awesome! Just be aware that this is an extremely competitive space and that we are extremely competitive in this space (https://github.com/rust-gamedev/ecs_bench_suite). Don’t expect to find anything but small / marginal wins (or wins that come at the cost of other benchmarks / features).

Just took a look. Lets start with:…

@cart is this write up in a dev blog somewhere? I learned so much just reading it.

@cart is this write up in a dev blog somewhere? I learned so much just reading it.

Not currently. I might ultimately consolidate this and other things like the Bevy ECS V2 description (which is basically a blog post by itself). But that takes time and I’ve got renderer code to write right now 😄

For a given world borrow and set of entity lookups using that borrow, we can assume it is stable because nothing can add/remove entities or components during that time. But we can’t store that in QueryState, which exists outside of a given World borrow.

We might be able to cache the Fetch and Filter instances inside of System Queries though (because components cannot be added / removed in a way that invalidates these pointers when the system is running). I can’t assert that is valid off the top of my head though.