bbolt: Deleting a large bucket cause infinite write
Hi,
I work at the Internet Archive in the Wayback Machine’s team. We started using bbolt in a new software that we are developing internally. We think that we discovered a possible bug.
We have a database with many large buckets and at some point, we deleted a bucket that contained few hundreds millions of keys. Following that, bbolt started writing indefinitely for days (it was not just a short I/O spike, we are talking about like a week of writing at 300MB/s). We have no idea what the “write” was, what it was writing and to where, but it was doing write I/O.
Restarting the software that uses bbolt didn’t fix it. I had to run a bbolt compact
on the database (which reduced it from 211GB to 29GB), and after that when we restarted the software it didn’t continue to max out write I/O.
I then tried to reproduced it by deleting another big bucket (44M keys), it caused the same issue. then I stopped our software, ran bbolt compact
again and re-started it, the issue disappeared.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 46 (24 by maintainers)
Sorry for the wait, it took some time to add a couple of buckets with 50M 32 byte keys in them.
I created a super naive test where you put something first, delete the bucket and then put something into another bucket. The tests indeed show a 10x-20x increase in the time it takes:
The delete takes some time, but I do not really see it having a large chunk of continuous IO on the second put - even when running the fsync. @CorentinB can you maybe share a little how you’re using bbolt? So what kind of “requests” you process and how they translate to the respective functions?
That the compaction makes the process faster is not super surprising, for the same reason we suggest people to regularly defrag etcd. Fragmentation does take a toll, not sure whether that’s the issue here for the latency increase.
code for reference here: https://github.com/tjungblu/bbolt/commit/2effd9c24a8c6418a947abdbf5eb8afda221ec05#diff-2c97e5ffc11514491315dcfdfe66968d3c22aae95764f385d23f717717783131R13-R48
Talking about “scale” issues, we had to implement our own stats system to keep track of bucket’s size because from what I understand the Stats() method reads the entire bucket to return the number of keys. So it was way too slow for our usage.
We think it could be way faster by keeping track of the number of keys as we add / remove them from the bucket in real time with an atomic int variable. That’s potentially something I’ll open an issue for and maybe try to make a PR if that’s something you all would like to see implemented in bbolt.
There are two prerequisites to follow this async solution:
The main reason for the big-pages is to make the ‘freepage-list’ smaller. If your page is 256KB pages… there is 64x less pages… so 64x less IO on transaction commit to sync the free list. Maybe the 64x smaller usage will be good enough for your use-case.
That’s interesting.
https://github.com/etcd-io/bbolt/blob/505fc0f7af3c9bba93a80fc33918c90c1b0517ad/tx.go#L238
The list is not differential, so with each transaction it’s written entirely.
I think that it’s fair in 1.4 to change the defaults to: a) no-sync for free pages b) Map-based freepages implementation: https://github.com/etcd-io/bbolt/blob/505fc0f7af3c9bba93a80fc33918c90c1b0517ad/db.go#L1263-L1266
It’s a tradeoff. If the freelist isn’t synced to disk, then it may take a while when you open the db, see also https://github.com/etcd-io/bbolt/issues/392#issuecomment-1407852612. But the benefit is that you get better performance when committing each transaction because bbolt doesn’t need to sync freelist.
I am thinking probably we should set
NoFreelistSync
totrue
by default? Usually users just open the db once, and do as many transactions as they want. So loading the freelist on startup is just one-time operation, so taking a little longer might be accepted. But taking a long time for each committing is obviously more expensive.any comments? @ptabor @tjungblu
I’ll try
db.NoFreelistSync = true
and report here 😃 Thanks a lot @tjungblu !I don’t have anything to give you publicly to reproduce it sadly, but I guess that yes, do that and you should be seeing the same thing. If not then there is more digging to do…