docarray: Initialization and construction of a `Document` and `DocumentArray` takes too long

Describe the bug Initialization and construction of a Document and DocumentArray takes a lot of time compared to simple list initialization:

from time import time 
n_docs = 1_000_000

# list initialisation
start_time = time()
numbers = [i for i in range(n_docs)]
duration = time() - start_time
print(f"duration list init = {duration}")

# DocumentArray initialisation
start_time = time()
numbers = DocumentArray[MyDoc]([MyDoc(number=i) for i in range(n_docs)])
duration = time() - start_time
print(f"duration da init = {duration}")

# construction of list of Documents
start_time = time()
numbers = [MyDoc.construct(number=i) for i in range(n_docs)]
duration = time() - start_time
print(f"duration list construct = {duration}")

# construction of DocumentArray of Documents
start_time = time()
numbers = DocumentArray[MyDoc]([MyDoc.construct(number=i) for i in range(n_docs)])
duration = time() - start_time
print(f"duration da construct = {duration}")

output:

duration list init         =  0.035775423049926 s
duration da init           = 11.883232831954956 s
duration list construct    = 10.953930139541626 s
duration da construct      =  9.914959907531738 s

Expected behavior Quicker.

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (15 by maintainers)

Most upvoted comments

We could consider using something like this crabcrab ? Seems to be about an order of magnitude faster than built-in, according to their benchmarks: built-in uuid4, mean time: 2.56e-6; their uuid4, mean time: 1.68e-7

I am not sure that this is mature enough tbh

samsja on Mar 1, 2023

For id we checked with @anna-charlotte and we could remove the [parse_obj](https://github.com/docarray/docarray/blob/13cc6696924033ca93a1ce52bd422a4a1ab93d2b/docarray/base_document/document.py#L21) in the BaseDocument class

This should remove most of the overhead. @anna-charlotte will update the benchmark we did offline togheter

samsja on Mar 1, 2023

@anna-charlotte it would be nice to see what kind of speed up pydantic v2 would offer.

you could do the same experiment but instead of creating a document you create a py dantic base model instance using pydantic v2.

Overall I think we cannot make this faster. That is on pydantic, and pydantic is doing nothing fancy there so that is just on python being slow for these custom things.

Could you post the results with dataclass as well ?

samsja on Mar 1, 2023