bleve: Scorch persister stuck after segment creation fails?

On low memory devices, we see frequent memory allocation errors when submitting indexing batches to scorch:

2019/11/03 08:18:50 [DVR] Fetching guide data for 262 stations in USA-WI48449-X @ 2019-11-18 2:00PM
2019/11/03 08:19:02 [DVR]   indexed 1761 airings (247 channels) [1s fetch, 10s index]
2019/11/03 08:19:03 [DVR]   indexed 112 movies (35 channels) [0s fetch, 0s index]
2019/11/03 08:19:03 [DVR] Fetching guide data for 262 stations in USA-WI48449-X @ 2019-11-18 8:00PM
2019/11/03 08:19:07 [DVR] Error indexing airings: error opening new segment at USA-WI48449-X.airings/store/00000000668c.zap, cannot allocate memory
2019/11/03 08:19:08 [DVR] Fetching guide data for 262 stations in USA-WI48449-X @ 2019-11-19 2:00AM
2019/11/03 08:19:12 [DVR] Error indexing airings: cannot allocate memory
2019/11/03 08:19:26 [IDX] Pruned 3839 expired airings from USA-WI48449-X in 10.116005734s.

Then, a few hours later we experience a full crash of the process. The stack shows that the persister is sitting inside pausePersisterForMergerCatchUp() and the Batch is still stuck in prepareSegment()

runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0xb68bd6 m=2 sigcode=4294967290

goroutine 0 [idle]:
runtime: unknown pc 0xb68bd6
stack: frame={sp:0x66c9a87c, fp:0x0} stack=[0x6649b034,0x66c9ac34)
66c9a7fc:  00b6bcf1  00000000  00000000  00000000 
66c9a80c:  00000000  00000000  00000000  00000000 
66c9a81c:  00000000  00000000  00000000  00000000 
66c9a82c:  00000000  00000000  00000000  00000000 
66c9a83c:  00000000  00000000  00000000  00000000 
66c9a84c:  00000000  00000000  00000000  00000000 
66c9a85c:  00000000  00000000  00000000  00000000 
66c9a86c:  01afd358  66c9b300  00000001  01afd358 
66c9a87c: <01afd358  00b6bc7b  01afd358  66c9b300 
66c9a88c:  00b6be9b  00000000  00000000  00000000 
66c9a89c:  00000000  66c9a8cd  010d770c  00000000 
66c9a8ac:  010cb153  66c90043  00b698ef  010cb120 
66c9a8bc:  00000005  4d5f434c  41535345  2f534547 
66c9a8cc:  6362696c  006f6d2e <github.com/blevesearch/bleve/index/scorch/segment/zap.mergeStoredAndRemap.func2+802>  00b6989d  00000000 
66c9a8dc:  00000000  00000043  00000000  00000000 
66c9a8ec:  00000000  00000000  00000000  00000000 
runtime: unknown pc 0xb68bd6
stack: frame={sp:0x66c9a87c, fp:0x0} stack=[0x6649b034,0x66c9ac34)
66c9a7fc:  00b6bcf1  00000000  00000000  00000000 
66c9a80c:  00000000  00000000  00000000  00000000 
66c9a81c:  00000000  00000000  00000000  00000000 
66c9a82c:  00000000  00000000  00000000  00000000 
66c9a83c:  00000000  00000000  00000000  00000000 
66c9a84c:  00000000  00000000  00000000  00000000 
66c9a85c:  00000000  00000000  00000000  00000000 
66c9a86c:  01afd358  66c9b300  00000001  01afd358 
66c9a87c: <01afd358  00b6bc7b  01afd358  66c9b300 
66c9a88c:  00b6be9b  00000000  00000000  00000000 
66c9a89c:  00000000  66c9a8cd  010d770c  00000000 
66c9a8ac:  010cb153  66c90043  00b698ef  010cb120 
66c9a8bc:  00000005  4d5f434c  41535345  2f534547 
66c9a8cc:  6362696c  006f6d2e <github.com/blevesearch/bleve/index/scorch/segment/zap.mergeStoredAndRemap.func2+802>  00b6989d  00000000 
66c9a8dc:  00000000  00000043  00000000  00000000 
66c9a8ec:  00000000  00000000  00000000  00000000 

goroutine 16 [chan receive, 519 minutes]:
github.com/blevesearch/bleve/index/scorch.(*Scorch).prepareSegment(0x25c4a80, 0x1047268, 0x2bbe500, 0x336e000, 0xff, 0x100, 0x2620e40, 0x0, 0x0, 0x0)
	github.com/blevesearch/bleve@v0.8.2-0.20191010234049-157461a2aeb6/index/scorch/scorch.go:425 +0x3e4
github.com/blevesearch/bleve/index/scorch.(*Scorch).Batch(0x25c4a80, 0x30acac0, 0x0, 0x0)
	github.com/blevesearch/bleve@v0.8.2-0.20191010234049-157461a2aeb6/index/scorch/scorch.go:361 +0x6ec
github.com/blevesearch/bleve.(*indexImpl).Batch(0x28fcdc0, 0x2620e60, 0x0, 0x0)
	github.com/blevesearch/bleve@v0.8.2-0.20191010234049-157461a2aeb6/index_impl.go:310 +0x94
github.com/fancybits/channels-server/dvr.(*Recorder).indexAirings(0x25ce8c0, 0x29e5490, 0xd, 0x30560a0, 0x309fd01, 0x0, 0x0, 0xd55233f0, 0xe, 0x1ae94c0, ...)


goroutine 50 [select, 2006 minutes]:
github.com/blevesearch/bleve/index/scorch.(*Scorch).pausePersisterForMergerCatchUp(0x25c4a80, 0x1310, 0x0, 0x112a, 0x0, 0x0, 0x0, 0x0, 0x2ac21b0, 0x112a, ...)
	github.com/blevesearch/bleve@v0.8.2-0.20191010234049-157461a2aeb6/index/scorch/persister.go:295 +0x2d0
github.com/blevesearch/bleve/index/scorch.(*Scorch).persisterLoop(0x25c4a80)
	github.com/blevesearch/bleve@v0.8.2-0.20191010234049-157461a2aeb6/index/scorch/persister.go:117 +0x5f4
created by github.com/blevesearch/bleve/index/scorch.(*Scorch).Open
	github.com/blevesearch/bleve@v0.8.2-0.20191010234049-157461a2aeb6/index/scorch/scorch.go:170 +0xc8


...


trap    0x6
error   0x0
oldmask 0x0
r0      0x0
r1      0x7fcd
r2      0x6
r3      0x7fcd
r4      0x6
r5      0x66c9b7c0
r6      0x2
r7      0x10c
r8      0x1
r9      0xe0
r10     0x2400540
fp      0x66c9a9bc
ip      0x10c
sp      0x66c9a87c
lr      0xb6bc7b
pc      0xb68bd6
cpsr    0x20000030
fault   0x0

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 15 (2 by maintainers)

Commits related to this issue

Most upvoted comments

I think we had identified some paths that have this exact behavior, some error happens that is essentially unrecoverable by scorch, and scorch will just retry it indefinitely. I don’t remember the details, but I think it was slightly less trivial to fix because some of the components are decoupled and it makes the problem harder to detect.