ChatterBot: pymongo.errors.OperationFailure: distinct too big, 16mb cap

When trying to train from and use the Ubuntu Dialog Corpus with the MongoDB Storage Adapter I’m hitting the following exception. The code is pretty much identical to the Ubuntu Corpus example in this repo.

I believe the issue is related MongoDB not being able to handle strings over 160 characters (which there is in the Ubuntu Corpus). So this should either be somehow resolved, or support dropped as it’s currently “broken”.

Traceback (most recent call last):
  File "./InfraBot.py", line 198, in <module>
    main()
  File "./InfraBot.py", line 194, in main
    bot(args)
  File "./InfraBot.py", line 91, in __call__
    r = self.bot.get_response("are you there?")
  File "/usr/local/lib/python3.6/site-packages/chatterbot/chatterbot.py", line 114, in get_response
    statement, response = self.generate_response(input_statement, session_id)
  File "/usr/local/lib/python3.6/site-packages/chatterbot/chatterbot.py", line 134, in generate_response
    response = self.logic.process(input_statement)
  File "/usr/local/lib/python3.6/site-packages/chatterbot/logic/multi_adapter.py", line 39, in process
    output = adapter.process(statement)
  File "/usr/local/lib/python3.6/site-packages/chatterbot/logic/best_match.py", line 54, in process
    closest_match = self.get(input_statement)
  File "/usr/local/lib/python3.6/site-packages/chatterbot/logic/best_match.py", line 16, in get
    statement_list = self.chatbot.storage.get_response_statements()
  File "/usr/local/lib/python3.6/site-packages/chatterbot/storage/mongodb.py", line 275, in get_response_statements
    response_query = self.statements.distinct('in_response_to.text')
  File "/usr/local/lib/python3.6/site-packages/pymongo/collection.py", line 2030, in distinct
    collation=collation)["values"]
  File "/usr/local/lib/python3.6/site-packages/pymongo/collection.py", line 232, in _command
    collation=collation)
  File "/usr/local/lib/python3.6/site-packages/pymongo/pool.py", line 419, in command
    collation=collation)
  File "/usr/local/lib/python3.6/site-packages/pymongo/network.py", line 116, in command
    parse_write_concern_error=parse_write_concern_error)
  File "/usr/local/lib/python3.6/site-packages/pymongo/helpers.py", line 210, in _check_command_response
    raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: distinct too big, 16mb cap

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 21 (2 by maintainers)

Most upvoted comments

@gunthercox Many thanks for your efforts. Chatterbot is very cool 😃 I do agree though about the Ubuntu corpus. Since it wont “work” with Chatterbot atm its probably better to remove it from the docs or at least make folks aware that this wont work now. The alternative is, like for me, to spend days on training and then realizing that it doesnt work. Again, thanks for your efforts!

The maximum BSON document size is 16 megabytes.

There is lots of stuff on this and workarounds on stack overflow.

Most recommend using GridFS. Which would mean switching from ‘.distinct’ to ‘.aggregate’ (as mentioned above in this thread too).

This doesn’t work but it would probably look something like this (mongodb.py):

def get_response_statements(self):
    """
    Return only statements that are in response to another statement.
    A statement must exist which lists the closest matching statement in the
    in_response_to field. Otherwise, the logic adapter may find a closest
    matching statement that does not have a known response.
    """
    #response_query = self.statements.distinct('in_response_to.text')  # current
    response_query = self.statements.aggregate({'$group': {'_id': '$in_response_to.text'}})

    _statement_query = {
        'text': {
            # '$in': response_query  # current
            '$in': list(response_query)  # works with aggregate
        }
    }

    _statement_query.update(self.base_query.value())

    statement_query = self.statements.find(_statement_query)

    statement_objects = []

    for statement in list(statement_query):
        statement_objects.append(self.mongo_to_object(statement))

    return statement_objects

http://api.mongodb.com/python/current/examples/aggregation.html http://docs.mongodb.org/manual/reference/gridfs/