reverb: segmentation fault

Hello,

I am using reverb 0.4.0 in tf-agents 0.9.0 through the ReverbReplayBuffer and ReverbAddTrajectoryObserver. Ray actors push experience to the the reverb server. Currently though, it is configured to have just 1 actor, and experience pushing is completed in a blocking manner before agent training in the main loop.

I am seeing segmentation faults happening at random times, always within the main process that samples from reverb to train the agent.

I was wondering if you have some hints about why this might be happening, or where I can start debugging?

*** SIGSEGV received at time=1629367150 on cpu 4 ***
PC: @     0x7f328092219f  (unknown)  deepmind::reverb::(anonymous namespace)::LocalSamplerWorker::FetchSamples()
    @     0x7f32d0810980       2320  (unknown)
    @     0x7f328091a14f        144  deepmind::reverb::Sampler::RunWorker()
    @     0x7f32cd7cd039  (unknown)  execute_native_thread_routine
    @     0x7f2be000ec70  (unknown)  (unknown)
    @     0x7f3280965ca0  (unknown)  (unknown)
    @ 0x75058b4808ec8348  (unknown)  (unknown)
Segmentation fault (core dumped)

About this issue

Most upvoted comments

That note doesn’t apply to Reverb, since the iterator for Reverb is aware of updates to the service. Basically it should just work 😃

On Tue, Sep 7, 2021 at 11:05 AM Samarth Brahmbhatt @.***> wrote:

I’ll leave this open until we can figure out what’s going on or move everyone over to TrajectoryWriter. Thanks for the report and the repro, and for the additional details about parallel write/read (that should work just fine).

the reason why I thought concurrent write/read would not work is this blue note in the documentation of ReverbReplayBuffer.as_dataset() https://www.tensorflow.org/agents/api_docs/python/tf_agents/replay_buffers/ReverbReplayBuffer#as_dataset

[image: Screenshot from 2021-09-07 11-01-28] https://user-images.githubusercontent.com/2848070/132390408-d312e024-8d9a-4da2-b10c-a48ce7e92b29.png

If you want to test concurrent write/read revert the commit ebfba0ab7c474b3831279a07e5e65e8af98f4269 https://github.com/samarth-robo/reverb_segfault_repro/commit/ebfba0ab7c474b3831279a07e5e65e8af98f4269 .

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/deepmind/reverb/issues/66#issuecomment-914513080, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANWFGYBVAFNQESDT5FSUKTUAZH5RANCNFSM5COT4VOQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

I’ll leave this open until we can figure out what’s going on or move everyone over to TrajectoryWriter. Thanks for the report and the repro, and for the additional details about parallel write/read (that should work just fine).