oj: Seeing segfaults and stuck processes following 3.13.3
Possibly related to code changed in https://github.com/ohler55/oj/pull/695
After we upgraded to 3.13.3 we started getting some new alerts from our infra:
- Segfaults
- Seeing stuck processes (no longer responding, spinning on CPU)
The backtraces point back to oj_calc_hash_key
gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fbe5f3ed535 in __GI_abort () at abort.c:79
#2 0x00007fbe5f9d375b in die () at error.c:664
#3 rb_bug_for_fatal_signal (default_sighandler=0x0, sig=sig@entry=11, ctx=ctx@entry=0x7fbe12011ac0, fmt=fmt@entry=0x7fbe5fc69f8b "Segmentation fault at %p") at error.c:664
#4 0x00007fbe5fb9243b in sigsegv (sig=11, info=0x7fbe12011bf0, ctx=0x7fbe12011ac0) at signal.c:946
#5 <signal handler called>
#6 locking_intern (c=0x7fbe5952f710,
key=0x7fbddd93607e "wiki\":false,\"reviewable_id\":null,\"reviewable_score_count\":0,\"reviewable_score_pending_count\":0,\"topic_posts_count\":2,\"topic_filtered_posts_count\":2,\"topic_archetype\":\"regular\",\"category_slug\":\"android"..., len=4) at cache.c:210
#7 0x00007fbe5f0640aa in oj_calc_hash_key (pi=<optimized out>, parent=<optimized out>) at strict.c:47
#8 0x00007fbe5f02e13d in hash_set_value (pi=0x7fbe132fb680, parent=0x7fbe132fc8d8, value=0) at compat.c:158
#9 0x00007fbe5f04d8a7 in add_value (pi=0x7fbe132fb680, rval=<optimized out>) at parse.c:83
#10 0x00007fbe5f050212 in read_false (pi=0x7fbe132fb680) at parse.c:127
#11 oj_parse2 (pi=pi@entry=0x7fbe132fb680) at parse.c:752
#12 0x00007fbe5f050529 in protect_parse (pip=pip@entry=140454342407808) at parse.c:959
#13 0x00007fbe5fa6cf74 in rb_protect (proc=proc@entry=0x7fbe5f050520 <protect_parse>, data=data@entry=140454342407808, pstate=pstate@entry=0x7fbe132fb57c) at eval.c:1087
#14 0x00007fbe5f05067e in oj_pi_parse (argc=argc@entry=1, argv=argv@entry=0x7fbe132fb678, pi=pi@entry=0x7fbe132fb680, json=json@entry=0x0, len=len@entry=0, yieldOk=yieldOk@entry=0)
at parse.c:1068
#15 0x00007fbe5f0436f5 in mimic_parse_core (argc=<optimized out>, argv=0x7fbe132ff398, bang=<optimized out>, self=<optimized out>) at mimic_json.c:595
#16 0x00007fbe5fc05b09 in vm_call_cfunc_with_frame (empty_kw_splat=<optimized out>, cd=0x7fbe48464390, calling=<optimized out>, reg_cfp=0x7fbe133fe4b8, ec=0x7fbe34ed9b50)
at vm_insnhelper.c:2514
#17 vm_call_cfunc (ec=0x7fbe34ed9b50, reg_cfp=0x7fbe133fe4b8, calling=<optimized out>, cd=0x7fbe48464390) at vm_insnhelper.c:2539
#18 0x00007fbe5fc10f42 in vm_sendish (block_handler=<optimized out>, method_explorer=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_insnhelper.c:4023
#19 vm_exec_core (ec=0x7fbddd936082, initial=87400465847158) at insns.def:801
We are using Ruby 2.7.2p137 (we are going to upgrade very shortly to 2.7.4)
Issue is quite extensive, we opted to downgrade to 3.13.2 now to see if it resolves per:
https://github.com/discourse/discourse/commit/0183d51070d33abc66c8681b43a5ce7571333b13
The segfault is the primary issue, you can see the scale here, usually it is no segfaults for weeks across hundreds of processes.

Will update here to confirm if 3.13.2 resolves the issue.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 24 (18 by maintainers)
Links to this issue
Commits related to this issue
- Revert "Build(deps): Bump oj from 3.13.2 to 3.13.3 (#14202)" This reverts commit 1a65f0bfbbd32887a3c90fdaa894487c21f8467a. New Oj gem has issues see: https://github.com/ohler55/oj/issues/699 — committed to discourse/discourse by SamSaffron 3 years ago
- DEV: Update oj gem https://github.com/ohler55/oj/issues/699 was fixed back in September 2021. — committed to discourse/discourse by CvX 2 years ago
- DEV: Update oj gem (#15713) https://github.com/ohler55/oj/issues/699 was fixed back in September 2021. — committed to discourse/discourse by CvX 2 years ago
Thanks. Turn off the cache with
Oj.default_options = { cache_keys: false, cache_strings: 0 }End of the day here so will have to call it a night pretty soon but will pick it up tomorrow. If the cache turned off fixes the issue that at least narrows it down. I’ll look whether Ruby 2.7.x GC occurs during normal evaluations as well.
Released v3.13.4 for the fix.
We also had some SEGV after updating to
3.13.3(cc @etiennebarrie @adrianna-chang-shopify), here’s the backtrace (very similar to Sam’s):Note that we’re on 3.0.2.
I personally doubt https://github.com/ohler55/oj/pull/695 is the cause, because we use
rb_enc_interned_str()extensively already viamsgpack-ruby, and we’ve never seen this.The backtrace seem to indicate that some Ruby object is in an invalid state, I’m not so familiar with Oj’s codebase so not sure which it could be.
We’ll try to reproduce this with
GC.stress = true.Pushed changes to the ‘thread-protect’ branch. http://www.ohler.com/oj/doc/file.Options.html updated as well.
I have a backtrace from the same background process (running on a different machine) that segfaulted from the OP, however this was in a different location. I can’t see any reference to OJ in the full backtrace, but this occurred around the same time after upgrading to 3.13.3.
This looks like the cause was during GC.
If you need more information, I will have to try and run the background task under gdb and catch a fault as it occurs.