meilisearch: Unclear error when there is not enough space left

Describe the bug When trying to index documents in a kubernetes volume that does not have enough space left, Meilisearch returns the following error:

{
      “details”: {
        “indexedDocuments”: 0,
        “receivedDocuments”: 46206
      },
      “duration”: “PT512.665928885S”,
      “enqueuedAt”: “2022-03-17T14:58:24.465745303Z”,
      “error”: {
        “code”: “internal”,
        “link”: “https://docs.meilisearch.com/errors#internal”,
        “message”: “I/O error (os error 5)“,
        “type”: “internal”
      },
      “finishedAt”: “2022-03-17T15:06:57.168374834Z”,
      “indexUid”: “bgg”,
      “startedAt”: “2022-03-17T14:58:24.502445949Z”,
      “status”: “failed”,
      “type”: “documentAddition”,
      “uid”: 3
}

This error might also happen outside of a kubernetes environment, not tested.

To Reproduce Steps to reproduce the behavior:

  1. Create a Meilisearch instance in a Kubernetes cluster and use a persistent volume to store the data.ms
  2. Index documents until the volume is full

Expected behavior A clear error that is documented should be returned

Meilisearch version: v0.26.0


EDIT from @curquiza

How?

When Meilisearch does not have enough place on the machine, you get the following error":

{
    "message": "I/O error (os error 5).",
    "code": "internal",
    "type": "internal",
    "link": "https://docs.meilisearch.com/errors#internal"
}

We want to replace

  • the code (not type) internal by no_space_left_on_device
  • the link #internal by #no_space_left_on_device

⚠️ This is what the specification already mentioned and it looks like Meilisearch does not follow it yet: https://github.com/meilisearch/specifications/blob/main/text/0061-error-format-and-definitions.md#no_space_left_on_device

Impacted teams

Since this error is already in the spec (and so in the docs) and supposed to exist, no ping to do. https://docs.meilisearch.com/reference/errors/error_codes.html#no-space-left-on-device

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 29 (26 by maintainers)

Commits related to this issue

Most upvoted comments

OK SO BIG NEWS, it’s worse than I thought.

It’s actually reproducible but I have NO IDEA how we could understand what makes the first or second error to be thrown. What I know is that, if I index the movies.json datasets first and then the nested_movies.json I get os error 28. But if I index the nested_movies.json and then the movies.json then I get the os error 5.

Ok, so after a meeting with @dureuill and another with @nicolasvienot we realized that overriding entirely the os error 5 could have huge drawbacks, so the final proposition is to:

  1. Create a no_space_left_on_device error code for the os error 28 (that happens sometimes)
  2. Create an io_error for the os error 5
  3. While we still let the kernel? generate a first error message for the os error 5. We’re going to add at the end of the message some extra info: Input/output error (os error 5). This error generally happens when you have no space left on device or when your database doesn't have read or write right.

Implementation:

  • Create and merge the PR
  • Update the spec

@meilisearch/docs-team You might be interested by this change; it introduces error changes: no_space_left_on_device is already in your docs but io_error will be a new created one. The spec will be updated accordingly. Also, you might want to review the error message 👀

gmourier Yes, totally, and it is already a problem. We display the space used in the volume, and the value does not seem accurate. We need to investigate more, but I think this should be a separate GitHub issue.

@nicolasvienot is it possible that the system you were testing had many incremental updates (huge /tasks backlog) to the index? I’m running into the same issue, were my index should be just a few MB big but the task queue is cluttered with old processed tasks (hitting the /tasks route will also kill my process because the size is too big) as we currently have many small updates to the index.

(as you wrote, this should be a separate GH issue, I just wanted to point it out here)

I can reproduce the issue as well; I created a really smol partition on my Linux machine and started indexing documents; here’s the result I got;

{
  "details": {
    "indexedDocuments": 0,
    "receivedDocuments": 19547
  },
  "duration": "PT11.489335705S",
  "enqueuedAt": "2022-07-05T13:25:58.967517187Z",
  "error": {
    "code": "internal",
    "link": "https://docs.meilisearch.com/errors#internal",
    "message": "An internal error has occurred. `Input/output error (os error 5)`.",
    "type": "internal"
  },
  "finishedAt": "2022-07-05T13:26:10.463413698Z",
  "indexUid": "mieli",
  "startedAt": "2022-07-05T13:25:58.974077993Z",
  "status": "failed",
  "type": "documentAdditionOrUpdate",
  "uid": 1
}

I’m running under Linux 5.18.8.

Hey @meilisearch/cloud-team,

Are you sure it is an os error 5 that you get and not an os error 28. The 5 one seems to be access denied when the 28 is related to a lack of space on the device?

I validate your message suggestion!

It’s not sending them in the wrong direction, it’s giving a clue to the users. We say “this might be due to…” and not “this is definitely…” When reading this error

  • the users will check their available space. On the SaaS, you can check it easily, and I personally had quickly the problem (missing space) when using it. So this clue can definitely help the SaaS users. Users who don’t use the SaaS would also check the available space: this is a first investigation for them. Users would rather fix the problem alone than contact the support and wait for an answer.
  • for the users who still have enough space, they will report the error by saying “I also checked I have enough place on my device, which is the case, but I still have the issue”. It’s even worthy for us since we will not even have to ask the question “do you have enough space on your device?”

We investigate how to fix this, this is not an easy improvement to do; we will try to do it for v1, but impossible for v0.30.0, sorry! 😢

@gmourier 👍 already working with that version.

Won’t the fact that LMDB can’t shrink the space previously used when emptying an index (without deleting it) cause problems on the cloud side when displaying the used space? nicolasvienot

What I observed (with meilisearch v0.27 and v0.28) is, that the data.ms/data.mdb file in the root folder stays at the same file size even if all indexes where removed (the index files live in the data.ms/indexes folder anyway. As I have a huge amount of tasks in meilisearch that are not cleared automatically by meilisearch, my guess was/is that the tasks table will take up all the space in that file.

Hey @mmachatschek !

(hitting the /tasks route will also kill my process because the size is too big)

We have released v0.28 which finally brings tasks pagination and filtering by (index, type, and status) capabilities. You should not have this problem anymore if you upgrade to this version !

Lol, I wanted to reproduce the issue for @Kerollmops. I did the EXACT same thing and it throwed another error;

{
  "details": {
    "indexedDocuments": 0,
    "receivedDocuments": 10271
  },
  "duration": "PT9.279810746S",
  "enqueuedAt": "2022-07-05T17:30:30.714754378Z",
  "error": {
    "code": "internal",
    "link": "https://docs.meilisearch.com/errors#internal",
    "message": "No space left on device (os error 28)",
    "type": "internal"
  },
  "finishedAt": "2022-07-05T17:30:40.009485687Z",
  "indexUid": "mieli",
  "startedAt": "2022-07-05T17:30:30.729674941Z",
  "status": "failed",
  "type": "documentAdditionOrUpdate",
  "uid": 1
}

And here is what we get from the system;

    Err(
        Milli(
            IoError(
                Os {
                    code: 28,
                    kind: StorageFull,
                    message: "No space left on device",
                },
            ),
        ),
    )

I then redid the same thing and got the first error again;

{
  "details": {
    "indexedDocuments": 0,
    "receivedDocuments": 19547
  },
  "duration": "PT11.699438861S",
  "enqueuedAt": "2022-07-05T17:32:36.427322701Z",
  "error": {
    "code": "internal",
    "link": "https://docs.meilisearch.com/errors#internal",
    "message": "An internal error has occurred. `Input/output error (os error 5)`.",
    "type": "internal"
  },
  "finishedAt": "2022-07-05T17:32:48.14102773Z",
  "indexUid": "mieli",
  "startedAt": "2022-07-05T17:32:36.441588869Z",
  "status": "failed",
  "type": "documentAdditionOrUpdate",
  "uid": 1
}
    Err(
        Internal(
            Io(
                Os {
                    code: 5,
                    kind: Uncategorized,
                    message: "Input/output error",
                },
            ),
        ),
    )

Note that for the same os error 5, both systems return a different text, the Tamo one says Input/Output error when the Nico one says I/O error. However, it shouldn’t be an issue if we are able to directly catch the io::Error raw number.

Hey @Kerollmops, I just did the test again, trying to index documents when there is no space left on the volume. Here is the failed task with Meilisearch v0.27.2:

        {
            "uid": 24,
            "indexUid": "movies20",
            "status": "failed",
            "type": "documentAddition",
            "details": {
                "receivedDocuments": 31968,
                "indexedDocuments": 0
            },
            "error": {
                "message": "An internal error has occurred. `I/O error (os error 5)`.",
                "code": "internal",
                "type": "internal",
                "link": "https://docs.meilisearch.com/errors#internal"
            },
            "duration": "PT92.749877984S",
            "enqueuedAt": "2022-07-04T22:57:33.046582055Z",
            "startedAt": "2022-07-04T23:17:53.517704393Z",
            "finishedAt": "2022-07-04T23:19:26.267582377Z"
        },

There should not be any access denied as the previous task went well.

Capture d’écran 2022-07-05 à 01 51 59

@Kerollmops is it technically doable?

I would say yes, we must be able to retrieve the os error code (5) and change that.

As discussed with @gmourier, here is my suggestion to improve this specific user message

"error": {
     "code": "internal",
     "link": "https://docs.meilisearch.com/errors#internal",
     "message": "I/O error (os error 5). This might be due to a lack of space on the device. If not, please contact us.",
     "type": "internal"
},

@Kerollmops is it technically doable?

@gmourier do you validate this solution?

If I get 2 yes, I will make this issue as good first issue and will update the spec once it’s fixed 😇

@gmourier Yes, totally, and it is already a problem. We display the space used in the volume, and the value does not seem accurate. We need to investigate more, but I think this should be a separate GitHub issue.

Won’t the fact that LMDB can’t shrink the space previously used when emptying an index (without deleting it) cause problems on the cloud side when displaying the used space? @nicolasvienot

Without creating a new error category, since the consequences for debugging could be bad, maybe could we just add a sentence in the error message like “This might be due to a lack of space on the device”? This would orient the user without impacting us. WDYT @gmourier?