ponyc: Segmentation fault when actor receives a reference to itself via a class created in a different actor.

_EDIT:_ Skip down to https://github.com/ponylang/ponyc/issues/1118#issuecomment-238431412 to see the most reduced example of the issue.

I gutted the HTTP server down to this, which I think is a reproduction of the seg fault in #937. I kept the original type names so that they could be somewhat related back to that package if necessary. Seems to have something to do with the partial function this~answer(). At least if I interrupt anything after that assignment, it the seg fault disappeared.

interface val ResponseHandler
  fun val apply(request: Payload val, response: Payload val): Any

interface val RequestHandler
  fun val apply(request: Payload): Any


primitive Handle
  fun val apply(request: Payload) =>
    (consume request).respond(Payload)


actor _ServerConnection
  let _handler: RequestHandler

  new create(h: RequestHandler) => _handler = h

  be dispatch(request: Payload) =>
    request.handler = recover this~answer() end
    _handler(consume request)

  be answer(request: Payload val, response: Payload val) =>
    None


class iso Payload
  var handler: (ResponseHandler | None) = None

  fun iso respond(response': Payload) =>
    try
      let h = (handler) as ResponseHandler
      h(consume this, consume response')
    end


actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 10_000) do
      t.do_it()
    end


actor Test
  be do_it() =>
    _ServerConnection(Handle).dispatch(Payload)

This could probably be further reduced, but I wanted to maintain at least a slight semblance to the original code… and it’s the middle of the night so I’m going to 😴.

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 57 (52 by maintainers)

Commits related to this issue

Fix segfault caused by unsafe garbage collection optimization (#4256) Prior to this commit, there was an unsafe optimization in the Pony runtime. The optimization is detailed in the ORCA paper on t... — committed to ponylang/ponyc by SeanTAllen 2 years ago

Most upvoted comments

@peterzandbergen - it definitely still needs to be fixed, and it’s still marked as high priority. Runtime segmentation faults are not acceptable .

jemc on Sep 28, 2016

To summarize what we discussed in the Zulip thread:

The current Pony runtime has a correctness bug due to what is usually a valid optimization, but in this case is not.

Specifically, the Pony runtime traces immutable (val) objects shallowly - that is, it skips tracing of fields within such objects. This saves time by reducing how much tracing has to happen, and it is described as a safe optimization in the ORCA paper, because the “outer” val object acts as an upper bound on the lifetime of the “inner” objects referred to by its fields.

While that optimization is safe within the limited scope of what was considered in the ORCA paper, the reasoning ignores the counting of actor references (which was outside the scope of the ORCA paper).

If a val object has references (either directly as its fields, or transitively as fields of its fields) to any actors, those actors need to be traced. Hence, for such an object we cannot keep this optimization in place.

But for val objects which are known via static analysis to not possibly refer to any actors, this optimization is safe and we’d like to keep it in place if possible, to keep the part of the benefit of this optimization for some workloads.

As such, we want to add a new kind of static analysis to the compiler that can classify any given data type as “definitely contains no actor references” or “may possibly contain an actor reference”. If we can mark an type with the internal designation contains_no_actors, then it is valid for that type to participate in the above mentioned optimization, and the compiler should generate a trace function for that type which uses the optimized path when immutable. Otherwise, it would need to take a new pessimistic path for the sake of correctness, tracing it at runtime so that any actors it may contain are traced.

To determine if a type should be marked as contains_no_actors:

If any field type is an actor type, or a composite type (tuple, union, intersection) referring to an actor type => return false.
If any field type refers (possibly within a composite type) to a type which is not marked contains_no_actors => return false.
- This implies we need to be able to do this analysis recursively, and it also implies we need a mechanism to break self-recursion.
- We can likely use a similar mechanism to that used in subtype checking; that is, we can have an “assumptions list” for things that we are temporarily assuming to be marked contains_no_actors, and every time we recurse into a type we push it onto that list, such that we will surely terminate and no type will mark itself as contains_no_actors without some cause which is not itself.
If the type under consideration is an abstract type (such as an interface or trait), and reachability analysis shows that the abstract type has in the reachable program subsumed any type which is not marked contains_no_actors => return false.
- This also implies recursive analysis - this time recursing into subsumed types instead of into fields.
Otherwise, the type has been shown to not possibly contain any actors => return true.

jemc on Jun 29, 2022

no one is suggesting not fixing. the concern is, can we find a way to fix without having a performance impact.

SeanTAllen on Sep 28, 2016

I think I got it. This is a premature free issue.

Full tracing of val objects sent in messages is deferred until a GC cycle on the owner. This means the _ServerConnection assigned to the Payload and then sent in a message (via the Payload) has a reference count greater than 0 but isn’t aware of it until the Payload is traced. By then, the _ServerConnection might have been GC’d, which of course results in bad stuff happening.

Praetonus on Sep 22, 2016

@SeanTAllen Thanks, and yes, with the Main and Test actors from the example above it, it should reproduce the seg fault. I’ll put a full version of the smallest example in this comment.

It’ll be interesting to see if this actually fixes the httpserver example. There was some other odd behavior that I noted in issue #937 which is where my investigation began, but this is certainly a piece of the puzzle if not the entire thing.

Here’s the most simplified yet complete code that I could find to reproduce the issue:

actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 10_000) do
      t.do_it()
    end


actor Test
  be do_it() => _ServerConnection.dispatch(Payload)


class iso Payload
  var handler: (_ServerConnection | None) = None


actor _ServerConnection
  be dispatch(p: Payload) =>
    p.handler = recover this end
    this.answer(consume p) // <-- Pass the Payload that references this _ServerConnection

  be answer(p: Payload val) => None

Notes:

In the more complete examples, the _ServerConnection is assigned to the Payload via a partial function binding instead of being assigned directly.
If the Payload object is created directly in dispatch() instead of at the call location in do_it(), there’s no seg fault.
If you assign None to the .handler field instead of the _ServerConnection object, there’s no seg fault.
The seg fault is intermittent. It’s more likely to happen using a Range with 10_000, than with 100.

Perelandric on Aug 10, 2016