ockam: Make Rust Errors more idiomatic

I wrote

 let mut forwarding_address = String::new();
 std::io::stdin().read_line(&mut forwarding_address)?;

Got

error[E0277]: `?` couldn't convert the error to `ockam::Error`
  --> examples/alice.rs:15:54
   |
15 |   std::io::stdin().read_line(&mut forwarding_address)?;
   |                                                      ^ the trait `From<std::io::Error>` is not implemented for `ockam::Error`
   |
   = note: the question mark operation (`?`) implicitly performs a conversion on the error value using the `From` trait
   = help: the following implementations were found:
             <ockam::Error as From<CredentialError>>
             <ockam::Error as From<NodeError>>
             <ockam::Error as From<OckamError>>
             <ockam::Error as From<ockam_core::routing::error::RouteError>>
           and 7 others
   = note: required by `from`

Update:

I think we need to take a deeper look at the design of our Errors so that they carry source, backtrace, cause etc. as they propagate within Ockam or surface to a user. We chose the code and domain approach because we’ll need to expose them over FFI. But that shouldn’t hurt the experience of Rust users. Rust Users should get source, backtrace, cause etc.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 24 (22 by maintainers)

Most upvoted comments

I’ve been thinking about this a bit, and feel like I should write down my thoughts in case I’m getting something massively wrong about the ockam codebase — totally possible, I’m very new. A lot of this is informed by my work on the stdlib and mozilla/application-services (which had some similar constraints, but, well, fewer).

I think a design like the std::io::Error type is probably most appropriate for this. This should allow us to avoid giving up the ability to expose things on the other side of the FFI cleanly (at least in some cases). In particular, the fact that in io::Error, you always have an ErrorKind — for our errors we could enforce always having an error code and a domain (which I would change to a &'static str, or even a &'static CStr, or something that wraps it anyway, for FFI compat).

It hopefully also would allow us to avoid allocation in many cases, see the internal io::Error::const_new function (or if https://github.com/rust-lang/rust/pull/87869 has landed, io::const_io_error!… unless it has been renamed, which seems possible), and even for the custom error case, eventually (perhaps not immediately), something based on unsize or stackbox could be used to have a Box<dyn Error+Send+Sync> that doesn’t use the heap (it’s also possible that heapless has options for non-heap-allocating boxed trait objects, I haven’t looked). It would also allow natural support for backtraces, of course, on platforms and configurations where they’re available.

Unfortunately, the approach you get with something like thiserror on an enum tends to be better if you don’t have to ferry user errors through it. It also grows fairly large over time, and is hard to expose to the FFI, except manually. We could probably design our own proc macro for this (mostly for assigning error codes, which is otherwise tedious) and I do still consider that an option (I haven’t made up my mind here), but… My experience is it’s a bit of a headache, and in https://github.com/mozilla/application-services became a bit unwieldy. This approach feels sloppy and bad to use from user code unless you curtail that, which can be very difficult (although it’s probably the cleanest and simplest to write, ironically).

Also, direct use of anyhow (or similar, such as eyre) from library crates isn’t great, since those types don’t implement Error (they can’t for coherence reasons), which means users can’t wrap them. While we could invent something similar (I’m pretty familiar with the internals, and have patches in anyhow’s low level guts), it wouldn’t get the ergonomics we need, and I’m unsure that we could do it in a way that maintians both the no_std support we want, as well as the ability to expose to the FFI.

All that said, one constraint I haven’t fully worked through is serialization. In #967 the errors were reworked due to the need to be deserializable, and that… complicates the design somewhat, especially the desire to have static data. Hrm.

Sorry that this is a bit disjointed, I felt like getting it down when I had the thoughts rather than waiting until morning (which may have been advisable).

(This is slightly related to #1562, which is partially about FFI error ergonomics)

Some further thoughts on this came to me over the weekend.

For one, I don’t really know what the use of the error domain is, besides helping out the other side of the FFI, perhaps.

The second takes some explanation:

One hesitation I have with the io::Error-like scheme is that it won’t work with ?. I alluded to this kind of issue before by mentioning that anyhow/eyre can’t implement Error for coherence reasons, but essentially the issue is:

In order for ? to work, some error type (lets call it MyError but obviously in this case it would be ockam_core::Error — I’m being vague just because anyhow::Error and eyre::Error hit the same issue) needs to be From<UserErrorType>. If you want to handle this for all possible user error types, you want to do this as

// This impl is referenced to later as (*1)
impl<E: std::error::Error> From<E> for MyError { ... snip ... }

Then, to allow users to interact with our errors nicely, we need to implement std::error::Error. This will let them put it in a type-erased error container like Box<dyn Error>, io::Error, or even {anyhow, eyre}::Error (or failure, from back in the day):

// This impl is referenced to later as (*2)
impl std::error::Error for MyError { ... snip ... }

Unfortunately, we can’t. Consider the case above where we invoke (*1) on MyError, that is, we’re calling From<MyError> for MyError. Well, there are actually two of these. There’s the (*1) we wrote above, and then there’s one in the stdlib. Every type T already impls From<T>. This is partially so that x.into() is a noop if you’re already the required type, and generally is useful for generic programming. The coherence issue is that you can’t have both of these.

(There are additional complexities too — From<E> for Box<dyn Error> existing tends not to help matters when trying to work around this issue…)

There are two solutions to this:

  1. Get rid of From<E: Error> (e.g. (*1)), the way io::Error does, and have users go through an Error::new-constructor which hurts ergonomics.
  2. Get rid of impl std::error::Error (e.g. (*2)), the way anyhow/eyre (or failure in the past…) do, and make it harder for your users to “get away” from your error type. I think there might be things we can do here, but we don’t really want to be in a space competing with anyhow/eyre on our error type — it would be needless.

I’ve been leaning towards 1, part of this is that I don’t know how we can ever have a From impl that assigns the correct Error domain, and assigns a reasonable Error code that the user chooses1

But the downside here is that it’s friction for anybody writing an ockam::Worker implementation — kind of a lot of friction. Users would have to quite frequently do .map_err(|e| ockam::Error::new(e, ...))? everywhere you need to return an error inside an implementation of your ockam::Worker, which kind of sucks2. However, the solution-space here could just be to tweak the Worker trait, or the API on Context that accepts a worker… Which makes me think that this is related to the problems in https://github.com/ockam-network/ockam/issues/1564 or at least could share part of a solution.

That is, one way you could imagine this working is:

  1. We add an type Error: std::error::Error + Send + Sync to ockam::Worker.
  2. Users write code for an ockam::Worker<Error = UserErrorType, ...>.
  3. We automatically rewrite it to use ockam::Error by converting the return values, which effectively type-erases the Error parameter.

This still has a few issues. The main one being that now we might end up with ockam::Errors that wrap ockam::Errors, which is kinda janky. In practice it’s not that big of a deal… but an approach based fully on adjusting the #[ockam::worker] (and possibly #[ockam::node]) proc macros might be able to do some magic under the hood to avoid it (faking specialization via autoref, for example).

I haven’t gotten that far in experimenting here (it basically came to me over the weekend that this approach might be solve more problems), but so far it seems promising.


1: I’m also not sure how to ensure users pick a good error code, though — even as it is we probably have cases where the error code contains a u32 payload, which depending on the value, can theoretically cause it to overlap with another error code. It’s not clear this can happen in practice, though, probably not.

I also have been imagining that maybe implementing a proc-macro that auto-assigns error domains / codes like #[ockam::error(code = ...)], similar to a thiserror-style thing. I’m not sure this is worth the trouble, though.

2: It’s not as bad for io::Error of course, because you don’t implement the io traits yourself very often, you just use them. Whereas, I get the impression that implementing ockam::Workers is a significant part of the well, “whole point”.

Multiple messages are great. Don’t hold back … your train of thought is insightful and I’m learning a lot. Same for thoughts above from @antoinevg & @adrianbenavides.

All that said, one constraint I haven’t fully worked through is serialization. In #967 the errors were reworked due to the need to be deserializable, and that… complicates the design somewhat, especially the desire to have static data. Hrm.

Having slept on it, I think there are two approaches here I hadn’t really considered yesterday (and it’s possible that the right solution will involve both in some capacity).

  1. Use heapless::String. Even if we have the stdlib, it doesn’t seem unreasonable to put a max length on the “error domain” string, although maybe I’m mistaken here.

  2. Have deserialization convert a &'static str into a String.

For example:

// This is a private inner type that isn't exposed outside of ockam_core.
enum Repr {
   Static { code: u32, domain: &'static str, ... },
   NonStatic { code: u32, domain: String, ... },
   // ...
}

When serializing Repr::Static, it would contain data independent of which variant it used, but it would deserialize as Repr::NonStatic. This means we’d need to write manual Serialize/Deserialize implementations, but I don’t think this is the end of the world.

A variant of this approach is to pack the &'static str and String into one bundle, similar to a Cow<'static, str>. (Other crates to see along these lines are https://github.com/thomcc/arcstr, and https://github.com/maciejhirsz/beef, although I don’t think of these we’d actually use).

I still have to work through my thoughts here on whether or not this actually buys us anything (aside from some perf), but it’s an interesting consideration I hadn’t thought of when making the writeup last night.

(Sorry for all the messages, this is both one of my first issues here, and also something I think is critically important to having a good-feeling API, so I’d rather over-share my thought process than under-share it. It also serves to document some of the logic for future users)

Yep, it’s a great and easily-overlooked example of Rust API design (and matklad’s whole blog is top-notch, I highly recommend it).

Great article about std::io::Error for everyone (like me!) who didn’t realize just how neat the implementation is:

https://matklad.github.io/2020/10/15/study-of-std-io-error.html

https://github.com/ockam-network/ockam/issues/1655#issuecomment-894653125 @mrinalwadhwa In your example, you are writing an application so I don’t think you should return ockam::Result but anyhow::Result

+1 to use anyhow when writing client applications.

https://github.com/ockam-network/ockam/issues/1655#issuecomment-896053001 So the big question is, without losing these advantages, how can we get features like:

  • backtraces / error-chaining
  • converting a numeric error code back to a string representation of its originating enum type
  • human-readable error messages

I agree with @antoinevg that the best thing for now is to wait. Given my recent experience (which you can read below), points 1 (backtraces) and 2 (u32 to strrepr) should be easier down the road if errors don’t contain parameters. Finally, point 3 (human-readable error messages) should be feasible if the core::fmt::Display trait is implemented, right?


Here’s my two cents on whether enum’s variants should contain parameters to build custom messages or not and how this impact the above three points.

In the project I’ve been working on for the past two years (close to 100k lines) we use thiserror. We don’t use any feature other than the Error and Display trait implementations to save us from writing some boilerplate code. That is, we don’t make use of the backtrace or source features.

We follow a similar approach to what’s done in the ockam crates. We have a struct like the following:

pub struct Error {
    code: ErrorCode,
    friendly_message: String,
    internal_message: String,
}

where:

  • ErrorCode: is a string code, where multiple errors can return the same code (project requirements)
  • internal_message: prints the Display implementation, which contains a detailed description of the error
  • friendly_message: prints the message we want to display to the final user (API), usually a simplified version of the internal_message

Then we have a bunch of enums with the errors of a given layer (api, commands, infrastructure, core, …) implementing the From<AbcdError> for Error, where each variant can have arbitrary parameters. For example:

#[derive(thiserror::Error, Debug)]
pub enum CustomError {
    #[error("score below threshold [expected={}][actual={}]", _0, _1)]
    ScoreBelowThreshold(u32, u32),
    #[error("unknown error")]
    Unknown,
}

We really struggled for a while deciding how to deal with error handling. We were also new to Rust. And, honestly, we are not 100% happy with this approach, but it has proved to be pretty solid so far. Having a good error handling strategy can make a huge difference in the developer experience, that’s for sure.

The biggest disadvantage we’ve found: variants with parameters make debugging easier but are limiting if you want to use thiserror’s source and backtrace features, for example. Having parameters also prevents us from converting an ErrorCode back to the original error. So unless it’s strictly necessary, I would use the parameters to build log messages and keep the errors variants without parameters.

Hey, I would like to help implement this, but I’m new to ockam. I see the error definition has an error code and domain string.

pub struct Error {
    code: u32,

    #[cfg(feature = "std")]
    domain: String,
}

What should those be when there is a std::io::Error?

Over the weekend I’ve been revisiting the error design, which has been fairly fruitful — I didn’t really understand the issues or design constraints last time… and I think the proposals I made would not have helped.

Or, they’d help a little, but would have left many of the same problems that exist in the current design[^1]. (In my defense, it was my first task).

[^1]: For example, the error codes would be just as bad. A proc-macro to help generating them would solve essentially none of their problems, even if it would be nifty. Backtrace/SpanTrace would be nice to have (they are are better than nothing), but they are not going be magic — our notion of a “call” for us is not the same as a stack frame, and its probably not a tracing span either… Etc

Anyway, after a lot more understanding of ockam, how it’s used, and its constraints, and a bunch of research… I think I have come to a good understanding about the use cases ockam_core::Error needs to support, and how to get there.

Note that… this is mostly ignoring things that I think are bigger issues that I intend to address in another write up (that I have in progress), like swallowing errors, serialization, and several more.

Anyway, here’s my current thinking about what ockam_core::Error should be good at:

ockam_core::Error requirements

ockam_core::Error has (at least) three important use cases:

  1. Communicating errors in a way that other code could interpret and interpret and handle. This should work across API and protocol boundaries, even to remote machines.

    The intent behind the current error design seems to address this case, as you need to know both of those things. The implementation is very hard to use in practice though, and having only this seems to cause problems.

    It is worth noting, that the current node implementation ignores errors from message handling after logging them. In order to support doing better things than this, the system probably needs to be able to distinguish between different severities of errors.

    That said, I have another writeup on this that I’m working on, so I won’t get into it here too much.

  2. Communicating an error in a way that someone who is not an expert in ockam[^2] can understand the error quickly, and if the error indicates a problem they need to fix, resolve the problem easily – even if they are .

    This also should work across remote machines, but probably doesn’t need to be that much more than decent error message.

    Needless to say, we ockam_core::Error’s current implementation does not help here.

  3. Communicating an error with sufficient context and detail so that a developer working with or debugging Ockam (or ockam-using code) can identify and fix the problem.

    Largely speaking, this needs to work very well on the same machine, but (at least in debug builds or when an environment variable is enabled), it should have some support for transmitting it remotely. (That said, I think most of the unserializable stuff here can be handled by converting to a string or similar).

    This is also only relevant for certain kinds of errors — routine I/O failures probably don’t need this kind of handling, at least by default

    As you probably already know, ockam_core::Error currently has no support for this.

[^2]: Perhaps they’re a developer getting started with it, someone who is operating a ockam-using tool, or invoking the Ockam CLI tool.

How to get there (… roughly)

(Caveat: I have a bit more of the design worked out than this, enough to have cleared up most holes, but still not 100% and probably am not summarizing it in enough details anyway)

I think all three of these have different solutions. My current thinking for how to handle the use cases above is (in reverse order):

  • For the 3rd of those — lots of context aimed at developers debugging Ockam — is the easiest[^3], since it’s what the Rust ecosystem has focused most on. Location::caller() and #[track_caller] to get the file/line/etc where the error is generated, and even collecting a Backtrace and/or SpanTrace at the site of capture if an env var is flipped. (Anyway, I’m aware that I’m handwaving this one, and my plan here is actually pretty cool, but I haven’t finished it and want to move on)

[^3]: Assuming fixing the log-and-ignore error handling we have in some places.

  • For the 2nd — error messages a technical non-expert can understand — probably requires either requesting an explicit error message. It might be enough to accept a cause error too. This seems an easy-ish one, and will help a lot especially moving forward, but if you can’t tell

  • Finally, the first despite “sort of working for that use case” has several very significant areas where it can be improved — I don’t think our crates should be in charge of assigning their own error codes for the most part (and I think we want to store real enums, meaning we get a human readable name for the code, etc).

    Fixing this without causing issues for FFI and/or targets that can’t rely on the message/metadata being present is… the hardest — see Appendix A for the current sketch of the design.

Anyway, thats all I have now.

Appendix: Replacing ockam_core::Error::code.

Basic idea: two new enums (at least) are added to ockam_core that describe “canonical error codes” and “canonical error origins”. Something like:

  • enum ockam_core::ErrorOrigin, which describes the component at the level of detail like Transport, Vault, Node, etc (probably things like Application and Unknown too).

    We can probably automatically assign this in most cases based on the module_path of the code that produced the error, which we should be able to get.

    Note that this explicitly is not at the level of granularity of crates, modules, or even conceptual components — (e.g, ockam_transport_tcp and ockam_transport_websocket are both ErrorOrigin::Transport)

  • enum ockam_core::ErrorKind, which looks a lot like std::io::ErrorKind, but with a set of errors likely to happen inside ockam, rather than a general set for all I/O.

The main idea behind these is that most code that checks errors will probably prefer writing checks like

use ockam_core::{ErrorKind, ErrorOrigin};
let is_transpor_timeout = (
    e.origin() == ockam_core::ErrorOrigin::Transport &&
    e.kind() == ockam_core::ErrorKind::Timeout
);
if is_transport_timeout { ... }

which then handles all transports, rather than having to figure out which error code number comes from TCP, which from WS, etc.

Nice to see love for dtolnay/thiserror and dtolnay/anyhow! IMO they’re excellent defaults when it comes to Rust error handling.

That said, Ockam has some special needs.

@mrinalwadhwa already mentioned the need to support FFI but our existing error handling strategy also has a number of important advantages for targeting embedded (no_std) platforms:

  1. The code: u32 field only consumes 4 bytes and the domain: String field is optional. This matters because we want to be able to target resource constrained platforms. 1
  2. We rely on explicit conversion to a universal ockam_core::Error type which allows us to have per-crate error definitions while avoiding the Box<dyn std::error::Error> pattern. This matters because no_std:
    • does not have access to the std::error::Error trait and
    • does not have access to a heap allocated Box type. 2

So the big question is, without losing these advantages, how can we get features like:

  • backtraces / error-chaining
  • converting a numeric error code back to a string representation of its originating enum type
  • human-readable error messages

It would be nice to also support these on no_std but, just as with FFI targets, it’s not the end of the world if we can’t have all of them.

i.e. Neither thiserror or anyhow 3 provide full no_std support but this is not necessarily a problem if it’s possible to integrate them with our existing error implementation in a way that won’t require a bunch of #[cfg(not(feature="std"))] boilerplate to avoid breaking no_std builds.

In general, the situation is not great when it comes to idiomatic unified error-handling and no_std and is unlikely to improve in the future:

So maybe we should also be looking at using a collection of simple crates & code to implement our wish list individually?

A couple that spring to mind are:

  • yaahc/displaydoc - provides thiserror-ish human-readable errors via a doc comment and string interpolation (Also has no_std support)
  • core-error/core-error - an attempt to provide a unified Error trait implementation for no_std
  • richardanaya/no_error - A simple error library for no_std + no_alloc Rust

Can anyone suggest some more?

Finally, Rust’s error-handling story is starting to solidify but there are still a few big changes coming (e.g. RFC-2504) so it would be nice to not lock the library or dependent apps into anything that’s going to be deprecated.


[1] One change I would like to see here is to define domain as a &'static str rather than String as, again, String needs alloc when we’re targeting no_std.

[2] We can get Box via alloc but requiring alloc for something as fundamental as Error handling is probably not a great idea.

[3] The anyhow crate does provide partial no_std support but this comes at the cost of relying on alloc.

As for extending Error information with stacktrace and additional data, that would be great, but I’m not very familiar with how that works and how that can or can’t be compatible with our Actor framework. There’s some research work to be done by someone

@mrinalwadhwa I’m not sure we should implement Fromstd::io::Error for ockam::Error. That may lead to worse errors experience. E.g. in this case when std::io::Error is thrown, from a function caller side I would expect error that says “Invalid forwarder address user input” or something like that, but if we implement default conversion from std::io::Error caller will get something like “IO error”, which is less informative.

@mcastorina thank you for picking this up 🙏 @jared-s @SanjoDeundiak @antoinevg what are your thoughts on good values here.

I think we need to take a deeper look at the design of our Errors so that they carry source, backtrace, cause etc. as they propagate within Ockam or surface to a user. We chose the code and domain approach because we’ll need to expose them over FFI. But that shouldn’t hurt the experience of Rust users. Rust Users should get source, backtrace, cause etc.