assertj: Improve the performance of `containsExactly` from quadratic to linear complexity

Summary

The performance of the containsExactly function of Iterables is currently quadratic, while it could be linear. It seems to be because of the IterableDiff class it uses, which checks the extra and missing elements with an algorithm suitable for checking the elements in any order. However, this is not necessary, since they have to be present in the exact order and so a more efficient linear solution could look something like this:

boolean compare(expected, actual) {
    if (expected.size() != actual.size()) return false;
    for (int i = 0; i < expected.size(); i++) {
       if (expected.get(i) != actual.get(i)) return false;
    }
    return true;
}

We ran a scalability analysis on various assertions, which you can find here if you are interested.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 3
  • Comments: 27 (27 by maintainers)

Most upvoted comments

containsExactly is not a good fit for some Set like HashSets (but it’s fine for SortedSets or LinkedHashSet), on this one we expect users to know the semantics of the iterables being tested. If the iteration order is not consistent then the assertion will likely fail and that is fine since the iterable did not honor the assertion contract.

containsExactlyInAnyOrder can be used with HashSet aspointed out by @sarajuhosova.

Plus, it seems that that IterableDiff_Test lacks testing with iterables that aren’t ArrayLists where the ordering could be different, like HashSets.

I agree that we could have used different iterable types but only those with consistent iteration order (so no HashSet).

In my experience we should not rely on HashSet iteration even though it could look consistent, I remember on one project tests that relied on this and failed once we migrated to a newer version of java that changes the underlying HashSet iteration order.

So moving forward, we should continue treating iterable as if they had a consistent iteration order for containsExactly and let users the responsibility to choose the correct assertion for HashSet, what we can do though is to enhance the javadoc with a warning to prefer containsExactlyInAnyOrder over containsExactly for HashSet or any iterable whose iteration order is not predictable.

Good discussion btw!

one comment about the issue is that we want to keep the error message as it is.

it’s probably for another issue, it list was introduced after newArrayList, we now prefer to use list and refactor when we change a piece of code.

Thanks @joel-costigliola ! I’ll probably get to fixing my code during this week after my last midterm.

Rough idea, I’m just unsure on the length check:

public void assertContainsExactly(AssertionInfo info, Iterable<?> actual, Object[] values) {
    checkIsNotNull(values);
    assertNotNull(info, actual);

    List<Object> actualAsList = newArrayList(actual);

    // length check
    if (actualAsList.size() != values.length) {
      IterableDiff<Object> diff = diff(actualAsList, asList(values), comparisonStrategy);
      throw failures.failure(info,
        shouldContainExactly(actual, asList(values), diff.missing, diff.unexpected,
          comparisonStrategy));
    }

    // actual and values have the same number elements but are they equivalent and in the same order?
    int i = 0;
    // use actualAsList instead of actual in case actual is a singly-passable iterable
    for (Object elementFromActual : actualAsList) {
      // if the objects are not equal, begin the error handling process
      if (!areEqual(elementFromActual, values[i])) {
        IterableDiff<Object> diff = diff(actualAsList, asList(values), comparisonStrategy);
        if (diff.differencesFound()) {
          throw failures.failure(info,
            shouldContainExactly(actual, asList(values), diff.missing, diff.unexpected,
              comparisonStrategy));
        } else {
          throw failures.failure(info, elementsDifferAtIndex(elementFromActual, values[i], i, comparisonStrategy));
        }
      }
      i++;
    }
    // If here, then done!
  }

I feel like you could still first check the order and pass if it’s correct. If it’s wrong, you remember the index and only then instantiate diff. It if finds no differences in elements, you return the index you remembered. That way you just move the loop in front of the diff and provide an efficient exit if they are equal.

So then to be clear, given the following two iterables: actual = [0, 1, 2, 3, 4, 5] expected = [9, 9, 0, 1, 2, 3, 4]

What would be the resulting unexpected actual list and missing actual list with these new changes? My guess was: unexpected actual list = [5] missing actual list = [9, 9]

But I’m not so sure anymore since my head’s kinda boggled right now.

On your statement about iterators, it was my understanding that iterators generally follow an order that is based on the implementation, but whether that order is known by the client or not depends on the nature of the data structure. For example, ArrayLists are purposefully designed to have ordering that is clear to the user at all times, while HashSets do not preserve an order that is consistently known by the user in order to prioritize speed.

I’ll have to hold off make any edits until that gets sorted I guess.

Hmm, I second your thoughts on this! But it seems as though our test suite is using exclusively ArrayLists to test both containsExactly and containsExactlyInAnyOrder. Making the distinction is pretty important as you said, so that may be something that @joel-costigliola will have give some input on, since it seems to be more of a design decision that is out of my hands 😄

I guess here the question also is what the difference for unordered datastructures (such as sets) should be between containsExactly and containsExactlyInAnyOrder. I would argue that to keep the meaning of these the same across all data structures, unordered items should only have the second one 🤔 An iterator, though, is ordered (I guess?), and if you are verifying that instead of the map it comes from, there is still the distinction between these two methods.

@nith2001 and I would like to work on this together.