mongoengine: Mongoengine is very slow on large documents compared to native pymongo usage

(See also this StackOverflow question)

I have the following mongoengine model:

class MyModel(Document):
    date = DateTimeField(required = True)
    data_dict_1 = DictField(required = False)
    data_dict_2 = DictField(required = True)

In some cases the document in the DB can be very large (around 5-10MB), and the data_dict fields contain complex nested documents (dict of lists of dicts, etc…).

I have encountered two (possibly related) issues:

  1. When I run native pymongo find_one() query, it returns within a second. When I run MyModel.objects.first() it takes 5-10 seconds.
  2. When I query a single large document from the DB, and then access its field, it takes 10-20 seconds just to do the following: m = MyModel.objects.first() val = m.data_dict_1.get(some_key)

The data in the object does not contain any references to any other objects, so it is not an issue of objects dereferencing. I suspect it is related to some inefficiency of the internal data representation of mongoengine, which affects the document object construction as well as fields access.

About this issue

  • Original URL
  • State: open
  • Created 8 years ago
  • Reactions: 10
  • Comments: 47 (21 by maintainers)

Most upvoted comments

You do whatever you want in your own project 😃 I’ll make sure to check how you dealt with that (compared to pymodm) when I’ll work on that, out of curiosity. I understand the reasons that made you move away from mongoengine but I would appreciate if we could keep the discussions in the mongoengine project constructive.

…well…i just got annoyed enough with mongoengine enough to google what’s what to find this…great.

should be on current versions of pymongo and mongoengine per pip install -U.

here’s my output a la @apolkosnik: dict: viz_dict

embed: viz_embed

console: pymongo with dict took 0.06s pymongo with embed took 0.06s mongoengine with dict took 16.72s mongoengine with embed took 0.74s mongoengine with dict as_pymongo() took 0.06s mongoengine with embed as_pymongo() took 0.06s mongoengine aggregation with dict took 0.11s mongoengine aggregation with embed took 0.11s

if DictField is the issue, then please for the love of all that is holy, let us know what to change it to or fix it. watching mongo and pymongo respond almost immediately and then waiting close to a minute for mongoengine to…do whatever it’s doing…kind of a massive bottleneck. dig the rest of the package, but if this can’t be resolved on the package side…

Is there a plan to add lazy init to MongoEngine any time soon?

I’ve seen your post many times and also all the references that it created in our tickets, I didn’t need you to elaborate. If you don’t like MongoEngine and gave up on improving it, that’s ok but if we could keep the discussions in the MongoEngine project (thus its Issues) to actually improve MongoEngine, that would be more helpful

@benjhastings If you perf trouble comes from a too big document… well there is nothing that can save you right now 😦 I guess the DictField could be improved (or a RawDictField could be created) to do no deserialization at all on the data.

@wojcikstefan first of all, thank you for contributions to Mongoengine.

We are using Mongoengine heavily in production and running into this issue. Is this something you are actively looking into?

The entire approach of eager conversion is potentially fundamentally flawed. Lazy conversion on first access would defer all conversion overhead to the point where the structure is actually accessed, completely eliminating it in the situation where access is not made. (Beyond such a situation indicating proper use of .only() and related helpers is warranted.)

The initial lazy version I completed myself 5 years ago — with minor deficits later corrected, e.g. from_mongo of an already cast Document. I hope you don’t mind that I didn’t wait.

@pikeas Yes, with some variance for some additional optimization here, and further complication over there… the underlying mechanism remains “eager”, that is, upon retrieval of a record MongoEngine recursively casts all elements of the document it can to native Python types via repeated to_python invocation.

This contrasts with my own DAO’s approach (if I’m going to be fixing everything, might as well start from scratch) which is purely lazy: transformers to “cast” (or just generally process) MongoDB values to native values are executed on attribute access. Bypassed by dictionary dereferencing. The Document class’ equivalent from_mongo factory class method only performs the outermost Document object lookup and wrapping. Mine was written after many years of MongoEngine use and frustration with lack of progress on numerous fronts. Parts are still enjoyably crazy, but at least I can very exactly explain the “crazy” in mine. 😉

Edited to add: Note that a double underscore-prefixed (dunderscore / soft private) initializer argument is available to disable eager conversion. The underlying machinery iteratively utilizes both explicit to_python invocation, and indirect invocation via setters (L125), which doesn’t make it much easier to follow. 🙁

Using my silly simple benchmark, the latest deserialization numbers:

  • MongoEngine 0.20.0 0.4516570568084717s (5× longer)
  • Marrow Mongo 2.0 (next) 0.08598494529724121s

Admittedly, other areas differ in the opposite direction. Unsaved instance construction is faster under MongoEngine:

  • MongoEngine: 0.03315997123718262s
  • Marrow Mongo: 0.26718783378601074s (8×; Marrow Mongo shifts most of the responsibility to the initializer, zero work at save-time: no waiting until save for a validation error, for example; make the assignment, you get your ValueError. The Document instance is directly usable with native PyMongo APIs as a dictionary.)

As some absolutely direct <ShamelessSelfPromotion> I’d like to point out again that I also offer an alternative, directly designed to alleviate some of the issues with ME I’ve encountered or issues I’ve submitted but never had corrected. E.g. promotion/demotion, the distinction between embedded and top-level documents (there should be none; allow the embedding of top-level documents, collection and active record behaviour isolated and optional), lazy conversion (not eager, let alone eager sub-findMany and conversion of References, or worse, Lists of References…), minimal interposing (I don’t track document dirty state), inline comparison generates filter documents (alternative to parametric querying, which is… limiting), extremely rich and expressive allowable type conversions across most field types (ObjectId ~= datetime, but also anything date-like, like timedelta), 99.27% coverage, 100% if you ignore two codepaths rarely hit (unless you dir() or star import specific modules…) My package even has an opinion on how one should store localized data, something a naive approach harshly penalizes. (Naive being {"en": "English Text", "fr": "Text Francois", …}don’t do that.)

Marrow Mongo (see also: WIP documentation manual)

Using the parametric helpers, syntax is nearly identical to MongoEngine, even with most of the same operator prefixes and suffixes so as to maintain that compatibility:

q1 = F(Foo, age__gt=30)  # {'age': {'$gt': 30}}
q2 = (Foo.age > 30)  # {'age': {'$gt': 30}}
q3 = F(Foo, not__age__gt=30)  # {'age': {'$not': {'$gt': 30}}}
q4 = F(Foo, attribute__name__exists=False)  # {'attribute.name': {'$exists': 1}}

Combineable using & and | operators. There’s much more interesting things you can do, though. (Direct iteration of filter sets planned, currently.)

# Iterate all threads created or replied to within the last 7 days.
for record in (Thread.id | Thread.reply.id) >= -timedelta(days=7):
    ...