chia-blockchain: [Bug] plots beyond ~4400 = harvester 100.0 load, cache_hit: false, plots check hangs before challenges

What happened?

Noted that for the last few releases, chia_harvester was pegging a thread continuously while farming.

Info:

  • System has >20k plots direct attached. Single harvester.
  • plot_refresh_callback completes in 15 seconds and proof checks are typically 0.4-1 sec.
  • Aside from chia_harvester constantly pegging its thread, all else appears to function normally.

Elaboration:

  • Reinstalled chia_blockchain from scratch, only importing keys and mainnet/wallet db’s. No change.
  • Experimented with varying numbers of plots and noted that at below ~4400 plots, chia_harvester no longer pegs a thread (dropped to 0.0 load). Added 200 plots back and load jumped back to 100.0 indefinitely.
  • Experimented with various harvester config settings (num_threads, parallel_reads, batch_size). No change.
  • Noted that upon startup, and with >4400 plots, the found_plot messages from harvester transition from cache_hit: True to cache_hit: False.
  • Also noted that attempting to run a chia plots check on any of the drives/plots with cache_hit: False results in an indefinite hang of that check before it issues a single challenge.
  • Rewards are tracking for my total plot count (not 4400), so while the cache_hit: False causes high harvester CPU usage and inability to check those plots, they are still successfully farming.

Possible causes:

  • This feels like high plot counts not playing nicely with plot_refresh / chia.plotting.cache, resulting in one of the harvester threads pegging indefinitely while attempting to cache some portion of plots over some maximum, and perhaps that same thread fails to respond to a plots check of those same plots?

Version

1.5.0

What platform are you using?

Linux

What ui mode are you using?

CLI

Relevant log output

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 21 (6 by maintainers)

Most upvoted comments

Okay so… turned out that the reason for all this are plots created via bladebit RAM plotter where the DiskProver serializes into 524.659 bytes which:

  • Obviously takes a very long time based on the number of those plots
  • Lets the cache grow like crazy so that we end up with a number of bytes which doesn’t fit into uint32 -> Value 5794656522 does not fit into uint32 while we seralize the length of the bytes.
  • Leads to refresh thread constantly working on the serialization and as soon as its done it fails to write for the reason above and then in the next refresh event it tries the same again. This seems to be the reason for the 100% peg.

The reason why the DiskProver serializes into such a huge blob is that those plots seem to have 65.536 C2 entries.

Table pointers from a plot in question with table_begin_pointers[10] - table_begin_pointers[9] -> 262.144:

table_begin_pointers = {std::vector<unsigned long long>} size=11
 [0] = {unsigned long long} 0
 [1] = {unsigned long long} 262144
 [2] = {unsigned long long} 14839185408
 [3] = {unsigned long long} 28822208512
 [4] = {unsigned long long} 42911924224
 [5] = {unsigned long long} 57272958976
 [6] = {unsigned long long} 72367734784
 [7] = {unsigned long long} 89824165888
 [8] = {unsigned long long} 107538284544
 [9] = {unsigned long long} 107540119552
 [10] = {unsigned long long} 107540381696

Table pointers from a normally working plot with table_begin_pointers[10] - table_begin_pointers[9] -> 176:

table_begin_pointers = {std::vector<unsigned long long>} size=11
 [0] = {unsigned long long} 0
 [1] = {unsigned long long} 252
 [2] = {unsigned long long} 14839436976
 [3] = {unsigned long long} 28822365051
 [4] = {unsigned long long} 42911861451
 [5] = {unsigned long long} 57273202401
 [6] = {unsigned long long} 72368924901
 [7] = {unsigned long long} 89827257426
 [8] = {unsigned long long} 107543532882
 [9] = {unsigned long long} 107545250830
 [10] = {unsigned long long} 107545251006

Im going to talk with @harold-b about this and will post an update once we figured this out.

我的系统每次启动起来是自动删除 C:\Users\Administrator.*文件夹。不存在提到的缓存问题。

It could still be a caching-related issue since it would create a new cache on the next startup (and the cache is then used while the harvester runs). Either way, we won’t know unless we can figure out a way to tell what those pegged harvester threads are doing.

Updated to 1.5.1 and cleared all settings, starting clean.

  • chia_harvester still remains at constant 100.0 load while farming with >~4k plots.
  • still see cache_hit: false on a large portion of plots.
  • chia plots check of previously troublesome ranges takes a long time to start challenges (with its process pegged at 100.0 during the delay of several minutes per 1k plots in the selected range to check), but does eventually begin, and completes without error.
  • confirmed with another large farmer that they too are seeing chia_harvester remain at 100.0 load while farming.