docarray: Initialization and construction of a `Document` and `DocumentArray` takes too long
Describe the bug
Initialization and construction of a Document and DocumentArray takes a lot of time compared to simple list initialization:
from time import time
n_docs = 1_000_000
# list initialisation
start_time = time()
numbers = [i for i in range(n_docs)]
duration = time() - start_time
print(f"duration list init = {duration}")
# DocumentArray initialisation
start_time = time()
numbers = DocumentArray[MyDoc]([MyDoc(number=i) for i in range(n_docs)])
duration = time() - start_time
print(f"duration da init = {duration}")
# construction of list of Documents
start_time = time()
numbers = [MyDoc.construct(number=i) for i in range(n_docs)]
duration = time() - start_time
print(f"duration list construct = {duration}")
# construction of DocumentArray of Documents
start_time = time()
numbers = DocumentArray[MyDoc]([MyDoc.construct(number=i) for i in range(n_docs)])
duration = time() - start_time
print(f"duration da construct = {duration}")
output:
duration list init = 0.035775423049926 s
duration da init = 11.883232831954956 s
duration list construct = 10.953930139541626 s
duration da construct = 9.914959907531738 s
Expected behavior Quicker.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (15 by maintainers)
I am not sure that this is mature enough tbh
For id we checked with @anna-charlotte and we could remove the
[parse_obj](https://github.com/docarray/docarray/blob/13cc6696924033ca93a1ce52bd422a4a1ab93d2b/docarray/base_document/document.py#L21)in theBaseDocumentclassThis should remove most of the overhead. @anna-charlotte will update the benchmark we did offline togheter
@anna-charlotte it would be nice to see what kind of speed up pydantic v2 would offer.
you could do the same experiment but instead of creating a document you create a py dantic base model instance using pydantic v2.
Overall I think we cannot make this faster. That is on pydantic, and pydantic is doing nothing fancy there so that is just on python being slow for these custom things.
Could you post the results with dataclass as well ?