meilisearch: Unclear error when there is not enough space left
Describe the bug When trying to index documents in a kubernetes volume that does not have enough space left, Meilisearch returns the following error:
{
“details”: {
“indexedDocuments”: 0,
“receivedDocuments”: 46206
},
“duration”: “PT512.665928885S”,
“enqueuedAt”: “2022-03-17T14:58:24.465745303Z”,
“error”: {
“code”: “internal”,
“link”: “https://docs.meilisearch.com/errors#internal”,
“message”: “I/O error (os error 5)“,
“type”: “internal”
},
“finishedAt”: “2022-03-17T15:06:57.168374834Z”,
“indexUid”: “bgg”,
“startedAt”: “2022-03-17T14:58:24.502445949Z”,
“status”: “failed”,
“type”: “documentAddition”,
“uid”: 3
}
This error might also happen outside of a kubernetes environment, not tested.
To Reproduce Steps to reproduce the behavior:
- Create a Meilisearch instance in a Kubernetes cluster and use a persistent volume to store the
data.ms
- Index documents until the volume is full
Expected behavior A clear error that is documented should be returned
Meilisearch version: v0.26.0
EDIT from @curquiza
How?
When Meilisearch does not have enough place on the machine, you get the following error":
{
"message": "I/O error (os error 5).",
"code": "internal",
"type": "internal",
"link": "https://docs.meilisearch.com/errors#internal"
}
We want to replace
- the code (not type)
internal
byno_space_left_on_device
- the link
#internal
by#no_space_left_on_device
⚠️ This is what the specification already mentioned and it looks like Meilisearch does not follow it yet: https://github.com/meilisearch/specifications/blob/main/text/0061-error-format-and-definitions.md#no_space_left_on_device
Impacted teams
Since this error is already in the spec (and so in the docs) and supposed to exist, no ping to do. https://docs.meilisearch.com/reference/errors/error_codes.html#no-space-left-on-device
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 29 (26 by maintainers)
OK SO BIG NEWS, it’s worse than I thought.
It’s actually reproducible but I have NO IDEA how we could understand what makes the first or second error to be thrown. What I know is that, if I index the
movies.json
datasets first and then thenested_movies.json
I get os error 28. But if I index thenested_movies.json
and then themovies.json
then I get the os error 5.Ok, so after a meeting with @dureuill and another with @nicolasvienot we realized that overriding entirely the
os error 5
could have huge drawbacks, so the final proposition is to:no_space_left_on_device
error code for the os error 28 (that happens sometimes)io_error
for the os error 5Input/output error (os error 5). This error generally happens when you have no space left on device or when your database doesn't have read or write right.
Implementation:
@meilisearch/docs-team You might be interested by this change; it introduces error changes:
no_space_left_on_device
is already in your docs butio_error
will be a new created one. The spec will be updated accordingly. Also, you might want to review the error message 👀@nicolasvienot is it possible that the system you were testing had many incremental updates (huge /tasks backlog) to the index? I’m running into the same issue, were my index should be just a few MB big but the task queue is cluttered with old processed tasks (hitting the /tasks route will also kill my process because the size is too big) as we currently have many small updates to the index.
(as you wrote, this should be a separate GH issue, I just wanted to point it out here)
I can reproduce the issue as well; I created a really smol partition on my Linux machine and started indexing documents; here’s the result I got;
I’m running under Linux 5.18.8.
Hey @meilisearch/cloud-team,
Are you sure it is an os error 5 that you get and not an os error 28. The 5 one seems to be access denied when the 28 is related to a lack of space on the device?
I validate your message suggestion!
It’s not sending them in the wrong direction, it’s giving a clue to the users. We say “this might be due to…” and not “this is definitely…” When reading this error
We investigate how to fix this, this is not an easy improvement to do; we will try to do it for v1, but impossible for v0.30.0, sorry! 😢
@gmourier 👍 already working with that version.
What I observed (with meilisearch v0.27 and v0.28) is, that the
data.ms/data.mdb
file in the root folder stays at the same file size even if all indexes where removed (the index files live in thedata.ms/indexes
folder anyway. As I have a huge amount of tasks in meilisearch that are not cleared automatically by meilisearch, my guess was/is that the tasks table will take up all the space in that file.Hey @mmachatschek !
We have released v0.28 which finally brings tasks pagination and filtering by (index, type, and status) capabilities. You should not have this problem anymore if you upgrade to this version !
Lol, I wanted to reproduce the issue for @Kerollmops. I did the EXACT same thing and it throwed another error;
And here is what we get from the system;
I then redid the same thing and got the first error again;
Note that for the same os error 5, both systems return a different text, the Tamo one says Input/Output error when the Nico one says I/O error. However, it shouldn’t be an issue if we are able to directly catch the
io::Error
raw number.Hey @Kerollmops, I just did the test again, trying to index documents when there is no space left on the volume. Here is the failed task with Meilisearch
v0.27.2
:There should not be any access denied as the previous task went well.
I would say yes, we must be able to retrieve the os error code (5) and change that.
As discussed with @gmourier, here is my suggestion to improve this specific user message
@Kerollmops is it technically doable?
@gmourier do you validate this solution?
If I get 2 yes, I will make this issue as
good first issue
and will update the spec once it’s fixed 😇@gmourier Yes, totally, and it is already a problem. We display the space used in the volume, and the value does not seem accurate. We need to investigate more, but I think this should be a separate GitHub issue.
Won’t the fact that LMDB can’t shrink the space previously used when emptying an index (without deleting it) cause problems on the cloud side when displaying the used space? @nicolasvienot
Without creating a new error category, since the consequences for debugging could be bad, maybe could we just add a sentence in the error message like “This might be due to a lack of space on the device”? This would orient the user without impacting us. WDYT @gmourier?