bbolt: Deleting a large bucket cause infinite write

Hi,

I work at the Internet Archive in the Wayback Machine’s team. We started using bbolt in a new software that we are developing internally. We think that we discovered a possible bug.

We have a database with many large buckets and at some point, we deleted a bucket that contained few hundreds millions of keys. Following that, bbolt started writing indefinitely for days (it was not just a short I/O spike, we are talking about like a week of writing at 300MB/s). We have no idea what the “write” was, what it was writing and to where, but it was doing write I/O.

Restarting the software that uses bbolt didn’t fix it. I had to run a bbolt compact on the database (which reduced it from 211GB to 29GB), and after that when we restarted the software it didn’t continue to max out write I/O.

I then tried to reproduced it by deleting another big bucket (44M keys), it caused the same issue. then I stopped our software, ran bbolt compact again and re-started it, the issue disappeared.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 46 (24 by maintainers)

Most upvoted comments

Sorry for the wait, it took some time to add a couple of buckets with 50M 32 byte keys in them.

I created a super naive test where you put something first, delete the bucket and then put something into another bucket. The tests indeed show a 10x-20x increase in the time it takes:

PUT took 17.335668ms
<delete>
PUT took 166.673496ms

The delete takes some time, but I do not really see it having a large chunk of continuous IO on the second put - even when running the fsync. @CorentinB can you maybe share a little how you’re using bbolt? So what kind of “requests” you process and how they translate to the respective functions?

That the compaction makes the process faster is not super surprising, for the same reason we suggest people to regularly defrag etcd. Fragmentation does take a toll, not sure whether that’s the issue here for the latency increase.


code for reference here: https://github.com/tjungblu/bbolt/commit/2effd9c24a8c6418a947abdbf5eb8afda221ec05#diff-2c97e5ffc11514491315dcfdfe66968d3c22aae95764f385d23f717717783131R13-R48

Talking about “scale” issues, we had to implement our own stats system to keep track of bucket’s size because from what I understand the Stats() method reads the entire bucket to return the number of keys. So it was way too slow for our usage.

We think it could be way faster by keeping track of the number of keys as we add / remove them from the bucket in real time with an atomic int variable. That’s potentially something I’ll open an issue for and maybe try to make a PR if that’s something you all would like to see implemented in bbolt.

I wonder under what circumstances do you recommend we do the above (append to file first, then write to boltDB)? Is it whenever we have write heavy loads? What are the symptoms if we don’t follow this recommendation for write heavy loads, will we be seeing the same behaviou as in this issue?

There are two prerequisites to follow this async solution:

  1. Eventually consistentency is accepted. Once the request data is persisted in the append-only file (WAL file: Write-Ahead Log), then the data will not be lost. It’s possible that other users/clients might get stale data right after the successful write. But eventually they will get the new data once the data in WAL files are applied to bboltDB.
  2. The users experience matter; in other words, you expect clients to get fast responses.

Makes me wonder what would happen if I were to delete a big bucket (then we are back to our original issue from this thread…).

The main reason for the big-pages is to make the ‘freepage-list’ smaller. If your page is 256KB pages… there is 64x less pages… so 64x less IO on transaction commit to sync the free list. Maybe the 64x smaller usage will be good enough for your use-case.

That’s interesting.

  1. Out of curiosity - you might try to log the size of the free-pages list that gets written with each transaction:

https://github.com/etcd-io/bbolt/blob/505fc0f7af3c9bba93a80fc33918c90c1b0517ad/tx.go#L238

The list is not differential, so with each transaction it’s written entirely.


  1. NoFreeListSync seems like a good fix. Another approach for speedup is to use significantly bigger pages. In such case there would be: a) less nodes in general b) better prefetching on reads and writes c) less pages in the freelist (so not only cost of flushing it to disk (that NoSync turns off), but also cost of scanning inmemory list to find an empty page ) d) please also you use map-based freelist (instead of the list-based)

I think that it’s fair in 1.4 to change the defaults to: a) no-sync for free pages b) Map-based freepages implementation: https://github.com/etcd-io/bbolt/blob/505fc0f7af3c9bba93a80fc33918c90c1b0517ad/db.go#L1263-L1266

what NoFreelistSync implies so I’m not sure I can use it in production though…

It’s a tradeoff. If the freelist isn’t synced to disk, then it may take a while when you open the db, see also https://github.com/etcd-io/bbolt/issues/392#issuecomment-1407852612. But the benefit is that you get better performance when committing each transaction because bbolt doesn’t need to sync freelist.

I am thinking probably we should set NoFreelistSync to true by default? Usually users just open the db once, and do as many transactions as they want. So loading the freelist on startup is just one-time operation, so taking a little longer might be accepted. But taking a long time for each committing is obviously more expensive.

any comments? @ptabor @tjungblu

I’ll try db.NoFreelistSync = true and report here 😃 Thanks a lot @tjungblu !

Or is just simply writing 44M keys and deleting the bucket enough to trigger this?

I don’t have anything to give you publicly to reproduce it sadly, but I guess that yes, do that and you should be seeing the same thing. If not then there is more digging to do…