oj: Seeing segfaults and stuck processes following 3.13.3

Possibly related to code changed in https://github.com/ohler55/oj/pull/695

After we upgraded to 3.13.3 we started getting some new alerts from our infra:

Segfaults
Seeing stuck processes (no longer responding, spinning on CPU)

The backtraces point back to oj_calc_hash_key

gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fbe5f3ed535 in __GI_abort () at abort.c:79
#2  0x00007fbe5f9d375b in die () at error.c:664
#3  rb_bug_for_fatal_signal (default_sighandler=0x0, sig=sig@entry=11, ctx=ctx@entry=0x7fbe12011ac0, fmt=fmt@entry=0x7fbe5fc69f8b "Segmentation fault at %p") at error.c:664
#4  0x00007fbe5fb9243b in sigsegv (sig=11, info=0x7fbe12011bf0, ctx=0x7fbe12011ac0) at signal.c:946
#5  <signal handler called>
#6  locking_intern (c=0x7fbe5952f710,
    key=0x7fbddd93607e "wiki\":false,\"reviewable_id\":null,\"reviewable_score_count\":0,\"reviewable_score_pending_count\":0,\"topic_posts_count\":2,\"topic_filtered_posts_count\":2,\"topic_archetype\":\"regular\",\"category_slug\":\"android"..., len=4) at cache.c:210
#7  0x00007fbe5f0640aa in oj_calc_hash_key (pi=<optimized out>, parent=<optimized out>) at strict.c:47
#8  0x00007fbe5f02e13d in hash_set_value (pi=0x7fbe132fb680, parent=0x7fbe132fc8d8, value=0) at compat.c:158
#9  0x00007fbe5f04d8a7 in add_value (pi=0x7fbe132fb680, rval=<optimized out>) at parse.c:83
#10 0x00007fbe5f050212 in read_false (pi=0x7fbe132fb680) at parse.c:127
#11 oj_parse2 (pi=pi@entry=0x7fbe132fb680) at parse.c:752
#12 0x00007fbe5f050529 in protect_parse (pip=pip@entry=140454342407808) at parse.c:959
#13 0x00007fbe5fa6cf74 in rb_protect (proc=proc@entry=0x7fbe5f050520 <protect_parse>, data=data@entry=140454342407808, pstate=pstate@entry=0x7fbe132fb57c) at eval.c:1087
#14 0x00007fbe5f05067e in oj_pi_parse (argc=argc@entry=1, argv=argv@entry=0x7fbe132fb678, pi=pi@entry=0x7fbe132fb680, json=json@entry=0x0, len=len@entry=0, yieldOk=yieldOk@entry=0)
    at parse.c:1068
#15 0x00007fbe5f0436f5 in mimic_parse_core (argc=<optimized out>, argv=0x7fbe132ff398, bang=<optimized out>, self=<optimized out>) at mimic_json.c:595
#16 0x00007fbe5fc05b09 in vm_call_cfunc_with_frame (empty_kw_splat=<optimized out>, cd=0x7fbe48464390, calling=<optimized out>, reg_cfp=0x7fbe133fe4b8, ec=0x7fbe34ed9b50)
    at vm_insnhelper.c:2514
#17 vm_call_cfunc (ec=0x7fbe34ed9b50, reg_cfp=0x7fbe133fe4b8, calling=<optimized out>, cd=0x7fbe48464390) at vm_insnhelper.c:2539
#18 0x00007fbe5fc10f42 in vm_sendish (block_handler=<optimized out>, method_explorer=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_insnhelper.c:4023
#19 vm_exec_core (ec=0x7fbddd936082, initial=87400465847158) at insns.def:801

We are using Ruby 2.7.2p137 (we are going to upgrade very shortly to 2.7.4)

Issue is quite extensive, we opted to downgrade to 3.13.2 now to see if it resolves per:

https://github.com/discourse/discourse/commit/0183d51070d33abc66c8681b43a5ce7571333b13

The segfault is the primary issue, you can see the scale here, usually it is no segfaults for weeks across hundreds of processes.

Will update here to confirm if 3.13.2 resolves the issue.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 24 (18 by maintainers)

Links to this issue

Who’s at Fault: Tracking Down Segfaults in Production

Commits related to this issue

Revert "Build(deps): Bump oj from 3.13.2 to 3.13.3 (#14202)" This reverts commit 1a65f0bfbbd32887a3c90fdaa894487c21f8467a. New Oj gem has issues see: https://github.com/ohler55/oj/issues/699 — committed to discourse/discourse by SamSaffron 3 years ago
DEV: Update oj gem https://github.com/ohler55/oj/issues/699 was fixed back in September 2021. — committed to discourse/discourse by CvX 2 years ago
DEV: Update oj gem (#15713) https://github.com/ohler55/oj/issues/699 was fixed back in September 2021. — committed to discourse/discourse by CvX 2 years ago

Most upvoted comments

Thanks. Turn off the cache with Oj.default_options = { cache_keys: false, cache_strings: 0 }

End of the day here so will have to call it a night pretty soon but will pick it up tomorrow. If the cache turned off fixes the issue that at least narrows it down. I’ll look whether Ruby 2.7.x GC occurs during normal evaluations as well.

ohler55 on Sep 2, 2021

Released v3.13.4 for the fix.

ohler55 on Sep 4, 2021

We also had some SEGV after updating to 3.13.3 (cc @etiennebarrie @adrianna-chang-shopify), here’s the backtrace (very similar to Sam’s):

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f64c3a0e859 in __GI_abort () at abort.c:79
#2  0x000055a89436dd31 in die () at error.c:773
#3  rb_bug (fmt=fmt@entry=0x55a8945f8960 "try to mark T_NONE object") at error.c:773
#4  0x000055a89438f5a8 in gc_mark_ptr (objspace=0x7f64c326c000, obj=140070595921664) at gc.c:6114
#5  0x000055a894390ee0 in gc_mark_children (objspace=0x7f64c326c000, obj=140066757234120) at gc.c:6378
#6  0x000055a8943919f8 in gc_mark_stacked_objects (incremental=1, count=2147483647, objspace=0x7f64c326c000) at gc.c:6471
#7  gc_mark_stacked_objects_incremental (count=2147483647, objspace=0x7f64c326c000) at gc.c:6505
#8  gc_marks_rest (objspace=0x7f64c326c000) at gc.c:7448
#9  0x000055a894395b42 in gc_marks_continue (objspace=0x7f64c326c000, heap=0x7f64c326c028) at gc.c:7492
#10 0x000055a894395ddb in heap_prepare (heap=<optimized out>, objspace=0x7f64c326c000) at gc.c:2044
#11 heap_next_freepage (heap=<optimized out>, objspace=<optimized out>) at gc.c:2200
#12 ractor_cache_slots (cr=<optimized out>, cr=<optimized out>, objspace=<optimized out>) at gc.c:2219
#13 newobj_slowpath (wb_protected=1, cr=0x7f64c3377000, objspace=0x7f64c326c000, flags=8193, klass=140070569509720) at gc.c:2265
#14 newobj_slowpath_wb_protected (klass=140070569509720, flags=8193, objspace=0x7f64c326c000, cr=0x7f64c3377000) at gc.c:2284
#15 0x000055a894395fdd in newobj_of0 (cr=<optimized out>, wb_protected=1, flags=8193, klass=<optimized out>) at gc.c:2322
#16 newobj_of (wb_protected=1, v3=52, v2=52, v1=52, flags=8193, klass=<optimized out>) at gc.c:2333
#17 rb_class_allocate_instance (klass=<optimized out>) at gc.c:2488
#18 0x000055a894400113 in class_call_alloc_func (klass=140070569509720, allocator=<optimized out>) at object.c:2064
#19 rb_class_alloc (klass=140070569509720) at object.c:2036
#20 rb_class_new_instance_pass_kw (argc=2, argv=0x7f64c3277c60, klass=140070569509720) at object.c:2109
#21 0x000055a89451a37b in vm_call_cfunc_with_frame (ec=0x7f64c3271050, reg_cfp=0x7f64c3374928, calling=<optimized out>) at vm_insnhelper.c:2926
#22 0x000055a89452f360 in vm_sendish (block_handler=<optimized out>, method_explorer=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_callinfo.h:336
#23 vm_exec_core (ec=0x7f64c3271050, initial=140724985337088) at insns.def:789
#24 0x000055a894524c46 in rb_vm_exec (ec=<optimized out>, mjit_enable_p=<optimized out>) at vm.c:2163
#25 0x000055a894525682 in invoke_bmethod (ec=0x7f64c3271050, iseq=<optimized out>, self=140066684037480, me=0x7f64bab61770, type=<optimized out>, opt_pc=0, captured=<optimized out>) at vm.c:1293
#26 0x000055a894525e36 in vm_call_bmethod_body (argv=0x7ffd16c31e80, calling=0x7ffd16c31f40, ec=0x7f64c3271050) at vm_callinfo.h:329
#27 vm_call_bmethod (ec=0x7f64c3271050, cfp=0x7f64c3374b58, calling=0x7ffd16c31f40) at vm_insnhelper.c:3005
#28 0x000055a89452f360 in vm_sendish (block_handler=<optimized out>, method_explorer=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_callinfo.h:336
#29 vm_exec_core (ec=0x7f64c3271050, initial=140724985337088) at insns.def:789
#30 0x000055a894524c46 in rb_vm_exec (ec=<optimized out>, mjit_enable_p=<optimized out>) at vm.c:2163
#31 0x000055a894525682 in invoke_bmethod (ec=0x7f64c3271050, iseq=<optimized out>, self=140066684034840, me=0x7f63d341cab0, type=<optimized out>, opt_pc=0, captured=<optimized out>) at vm.c:1293
#32 0x000055a894529d03 in vm_call_bmethod_body (argv=0x7f64c3277948, calling=0x7ffd16c322a0, ec=0x7f64c3271050) at vm_callinfo.h:329
#33 vm_call0_body (ec=0x7f64c3271050, calling=0x7ffd16c322a0, argv=0x7f64c3277948) at vm_eval.c:205
#34 0x000055a89452c899 in vm_call0_cc (kw_splat=<optimized out>, cc=<optimized out>, argv=0x7f64c3277948, argc=0, id=2348191, recv=140066684034840, ec=0x7f64c3271050) at vm_eval.c:72
#35 rb_call0 (ec=0x7f64c3271050, recv=140066684034840, mid=2348191, argc=0, argv=0x7f64c3277948, call_scope=<optimized out>, self=52) at vm_eval.c:505
#36 0x000055a89452ce82 in send_internal (argc=0, argv=0x7f64c3277948, recv=140066684034840, scope=<optimized out>) at vm_eval.c:1245
#37 0x000055a89451a37b in vm_call_cfunc_with_frame (ec=0x7f64c3271050, reg_cfp=0x7f64c3374d18, calling=<optimized out>) at vm_insnhelper.c:2926
#38 0x000055a89452f360 in vm_sendish (block_handler=<optimized out>, method_explorer=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_callinfo.h:336
#39 vm_exec_core (ec=0x7f64c3271050, initial=140724985337088) at insns.def:789
#40 0x000055a89452547d in rb_vm_exec (ec=<optimized out>, mjit_enable_p=<optimized out>) at vm.c:2172
#41 0x000055a8945285df in invoke_block (captured=<optimized out>, opt_pc=<optimized out>, type=<optimized out>, cref=0x0, self=140070626928760, iseq=0x7f64b684bf80, ec=0x7f64c3271050) at vm.c:1263
#42 invoke_iseq_block_from_c (me=0x0, is_lambda=<optimized out>, cref=0x0, passed_block_handler=0, kw_splat=0, argv=0x7ffd16c326d8, argc=1, self=140070626928760, captured=<optimized out>, ec=0x7f64c3271050) at vm.c:1335
#43 invoke_block_from_c_bh (argc=<optimized out>, kw_splat=<optimized out>, passed_block_handler=<optimized out>, cref=<optimized out>, is_lambda=<optimized out>, force_blockarg=<optimized out>, argv=<optimized out>, block_handler=<optimized out>, ec=<optimized out>) at vm.c:1353
#44 vm_yield (argc=1, kw_splat=0, argv=0x7ffd16c326d8, ec=0x7f64c3271050) at vm.c:1398

Note that we’re on 3.0.2.

I personally doubt https://github.com/ohler55/oj/pull/695 is the cause, because we use rb_enc_interned_str() extensively already via msgpack-ruby, and we’ve never seen this.

The backtrace seem to indicate that some Ruby object is in an invalid state, I’m not so familiar with Oj’s codebase so not sure which it could be.

We’ll try to reproduce this with GC.stress = true.

casperisfine on Sep 2, 2021

Pushed changes to the ‘thread-protect’ branch. http://www.ohler.com/oj/doc/file.Options.html updated as well.

ohler55 on Sep 3, 2021

I have a backtrace from the same background process (running on a different machine) that segfaulted from the OP, however this was in a different location. I can’t see any reference to OJ in the full backtrace, but this occurred around the same time after upgrading to 3.13.3.

This looks like the cause was during GC.

(gdb) bt 25
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fd4d981d535 in __GI_abort () at abort.c:79
#2  0x00007fd4d9e03756 in die () at error.c:647
#3  rb_bug (fmt=fmt@entry=0x7fd4da06b6ae "try to mark T_NONE object") at error.c:647
#4  0x00007fd4d9e03b27 in gc_mark_ptr (objspace=0x7fd4d8c8a100, obj=140551651948288) at gc.c:5175
#5  0x00007fd4d9eb31eb in gc_mark (obj=<optimized out>, objspace=0x7fd4d8c8a100) at gc.c:5205
#6  gc_mark_values (values=<optimized out>, n=<optimized out>, objspace=0x7fd4d8c8a100) at gc.c:4713
#7  gc_mark_imemo (obj=140551651945640, objspace=0x7fd4d8c8a100) at gc.c:5251
#8  gc_mark_children (objspace=0x7fd4d8c8a100, obj=140551651945640) at gc.c:5334
#9  0x00007fd4d9ebab71 in gc_mark_stacked_objects (incremental=1, count=3693, objspace=0x7fd4d8c8a100) at gc.c:5506
#10 gc_mark_stacked_objects_incremental (count=3693, objspace=0x7fd4d8c8a100) at gc.c:5540
#11 gc_marks_step (slots=<optimized out>, objspace=0x7fd4d8c8a100) at gc.c:6445
#12 gc_marks_continue (heap=0x7fd4d8c8a128, objspace=0x7fd4d8c8a100) at gc.c:6505
#13 heap_prepare (heap=0x7fd4d8c8a128, objspace=0x7fd4d8c8a100) at gc.c:2020
#14 heap_get_freeobj_from_next_freepage (heap=<optimized out>, objspace=<optimized out>) at gc.c:2037
#15 heap_get_freeobj (heap=0x7fd4d8c8a128, objspace=0x7fd4d8c8a100) at gc.c:2076
#16 newobj_slowpath (wb_protected=1, objspace=0x7fd4d8c8a100, v3=0, v2=0, v1=0, flags=5, klass=140552148301240) at gc.c:2220
#17 newobj_slowpath_wb_protected (klass=140552148301240, flags=5, v1=0, v2=0, v3=0, objspace=0x7fd4d8c8a100) at gc.c:2232
#18 0x00007fd4d9fd5fa5 in str_alloc (klass=<optimized out>) at string.c:750
#19 str_new0 (klass=<optimized out>,
    ptr=0x7fd475d3c600 "{\"retry\":true,\"queue\":\"default\",\"class\":\"Jobs::UnpinTopic\",\"args\":[{\"topic_id\":5875,\"current_site_id\":\"akimbo26\"}],\"jid\":\"cd286b99ac003d8fe731fefe\",\"created_at\":1584936702.2181027}\r\n$18\r\n33141873599.2"..., len=180, termlen=1) at string.c:750
#20 0x00007fd4d9fe91e1 in str_substr (empty=1, len=<optimized out>, beg=140550486607360, str=<optimized out>) at string.c:2607
#21 rb_str_substr (str=str@entry=140551056786840, beg=beg@entry=0, len=<optimized out>) at string.c:2586
#22 0x00007fd4d9feeefd in rb_str_aref_m (argc=2, str=140551056786840, argv=0x7fd48fefb360) at string.c:4565
#23 rb_str_aref_m (str=140551056786840, argv=0x7fd48fefb360, argc=2) at string.c:4556
#24 rb_str_slice_bang (argc=2, argv=<optimized out>, str=140551056786840) at string.c:4873

If you need more information, I will have to try and run the background task under gdb and catch a fault as it occurs.

fitzy101 on Sep 2, 2021