libzmq: PUB crash when SUB exceeded SNDHWM

Please use this template for reporting suspected bugs or requests for help.

Issue description

When all of these conditions are satisfied, the assertion failure from mtrie.cpp occurs:

  • A connection between a PUB socket and many SUB sockets.
  • A SUB socket subscribe/unsubscribe many prefixes.
  • Call zmq_getsockopt() with ZMQ_EVENTS for SUB sockets.
Assertion failed: erased == 1 (src/mtrie.cpp:297)
[1]    30266 abort (core dumped)  ./a.out

Environment

  • libzmq version (commit hash if unreleased): 4.2.0 and 4.2.3
  • OS: Ubuntu 16.04 LTS

Minimal test code / Steps to reproduce the issue

To reproduce this crash, we should prepare a PUB socket and many SUB sockets.

We will call this sequence (pseudo-code): pub.connect(sub) or sub.connect(pub); pub.getsockopt(ZMQ_EVENTS); sub.subscribe(prefix); sub.getsockopt(ZMQ_EVENTS); sub.unsubscribe(prefix); sub.getsockopt(ZMQ_EVENTS). There will be many prefixes to subscribe/unsubscribe.

Calling getsockopt(ZMQ_EVENTS) after SUB’s SUBSCRIBE/UNSUBSCRIBE, or PUB’s zmq_connect() will produce a crash due to the assertion failure in mtrie_t::rm_helper.

You can switch PUB<->SUB connection topology by the pub_to_sub variable.

#include "zmq.h"
#include <stdio.h>

// Set 1 or 0 to switch the PUB<->SUB connection topology.
static int pub_to_sub = 1;

void gen_topic(int n, char* topic)
{
    // Simple hash function to generate a subscription prefix from a number.
    n = (n * 2654435761);
    sprintf(topic, "%08x", n);
}

void getsockopt_events_within_many_subscriptions(void* sub)
{
    char topic[8];
    char opt[256];
    size_t opt_len = 256;

    for (int j = 0; j < 10000; ++j)
    {
        gen_topic(j, topic);
        zmq_setsockopt(sub, ZMQ_SUBSCRIBE, &topic, 8);
        // CRASH: Get ZMQ_EVENTS from a SUB socket.
        zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
    }
    for (int j = 0; j < 10000; ++j)
    {
        gen_topic(j, topic);
        zmq_setsockopt(sub, ZMQ_UNSUBSCRIBE, &topic, 8);
        // CRASH: Get ZMQ_EVENTS from a SUB socket.
        zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
    }
}

int main()
{
    printf("%d.%d.%d\n", ZMQ_VERSION_MAJOR, ZMQ_VERSION_MINOR, ZMQ_VERSION_PATCH);

    void *context = zmq_ctx_new();
    void *pub = zmq_socket(context, ZMQ_PUB);
    void *sub;

    char addr[256]; size_t addr_len = 256;
    char opt[256];  size_t opt_len  = 256;

    if (pub_to_sub)
    {
        // PUB->SUB
        for (int i = 0; i < 100; ++i)
        {
            sub = zmq_socket(context, ZMQ_SUB);

            zmq_bind(sub, "tcp://127.0.0.1:*");
            zmq_getsockopt(sub, ZMQ_LAST_ENDPOINT, addr, &addr_len);
            zmq_connect(pub, addr);

            getsockopt_events_within_many_subscriptions(sub);
        }
    }
    else
    {
        // SUB->PUB
        zmq_bind(pub, "tcp://127.0.0.1:*");
        zmq_getsockopt(pub, ZMQ_LAST_ENDPOINT, addr, &addr_len);
        for (int i = 0; i < 100; ++i)
        {
            sub = zmq_socket(context, ZMQ_SUB);

            zmq_connect(sub, addr);

            getsockopt_events_within_many_subscriptions(sub);

            // CRASH: Get ZMQ_EVENTS from the PUB socket.
            zmq_getsockopt(pub, ZMQ_EVENTS, opt, &opt_len);
        }
    }
}

What’s the actual result? (include assertion message & call stack if applicable)

$ gcc zmq_events_crash.c -L ~/usr/local/lib -lzmq && ./a.out
4.2.3
Assertion failed: erased == 1 (src/mtrie.cpp:297)
[1]    30266 abort (core dumped)  ./a.out

What’s the expected result?

$ gcc zmq_events_crash.c -L ~/usr/local/lib -lzmq && ./a.out
4.2.3
$ echo $?
0

When SUB sockets connect to the PUB socket, this crash doesn’t happen.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 60 (29 by maintainers)

Commits related to this issue

Most upvoted comments

@bluca Maybe I find some time tomorrow to add sufficient tests, so that we can discuss consistency of mtrie behaviour.

At the moment my impression is that the assertion is too strict within mtrie, but it may well be worth an assertion at the call site. I did not dig into the larger picture yet.

I renamed the branch 😉