dubbo: "No provider available in XXXX" RpcException after server IO Busy

Dubbo Version

dubbo-2.5.9。But we modified the default SPI of the Transporter, using netty4:

@SPI("netty4")
public interface Transporter{
    
}

The issue

The service registry has current provider information of the service, but when the service consumer invokes the service, it reports the RpcException of “No provider available in XXXX”.

Our analysis process

Step 1

We use the gcore command to dump the memory of jvm, and use Eclipse MAT analysis it, find a large number of connections with abnormal state. IN com.alibaba.dubbo.remoting.transport.netty4.NettyClient, closed=false but channel.ch.state=4. This caused DubboInvoker fail to get available providers. The order of judgment is as follows:

/*
  com.alibaba.dubbo.rpc.cluster.support.AvailableCluster
*/
    public <T> Invoker<T> join(Directory<T> directory) throws RpcException {
        return new AbstractClusterInvoker<T>(directory) {
            public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
                for (Invoker<T> invoker : invokers) {
                    if (invoker.isAvailable()) {//ㄟ(▔▽▔)ㄏ
                        return invoker.invoke(invocation);
                    }
                }
                throw new RpcException("No provider available in " + invokers);
            }
        };
    }
/*
  com.alibaba.dubbo.registry.integration.RegistryDirectory
*/
    public boolean isAvailable() {
        if (isDestroyed()) {
            return false;
        }
        Map<String, Invoker<T>> localUrlInvokerMap = urlInvokerMap;
        if (localUrlInvokerMap != null && localUrlInvokerMap.size() > 0) {
            for (Invoker<T> invoker : new ArrayList<Invoker<T>>(localUrlInvokerMap.values())) {
                if (invoker.isAvailable()) {//ㄟ(▔▽▔)ㄏ
                    return true;
                }
            }
        }
        return false;
    }
/*
  com.alibaba.dubbo.rpc.protocol.dubbo.DubboInvoker
*/
    public boolean isAvailable() {
        if (!super.isAvailable())
            return false;
        for (ExchangeClient client : clients) {
            if (client.isConnected() && !client.hasAttribute(Constants.CHANNEL_ATTRIBUTE_READONLY_KEY)) {//ㄟ(▔▽▔)ㄏ
                return true;
            }
        }
        return false;
    }
/*
  com.alibaba.dubbo.remoting.transport.netty4.NettyChannel
*/
    public boolean isConnected() {
        return channel.isActive();//ㄟ(▔▽▔)ㄏ
    }
/*
  io.netty.channel.socket.nio.NioSocketChannel
*/
    public boolean isActive() {
        java.nio.channels.SocketChannel ch = this.javaChannel();
        return ch.isOpen() && ch.isConnected();//ㄟ(▔▽▔)ㄏ
    }
/*
  sun.nio.ch.SocketChannelImpl
*/
    public boolean isConnected() {
        synchronized(this.stateLock) {
            return this.state == 2;//ㄟ(▔▽▔)ㄏ
        }
    }

At the same time, this abnormal close=false state also causes the HeartBeatTask to fail to handle reconnection.

/*
  com.alibaba.dubbo.remoting.exchange.support.header.HeartBeatTask
*/
    public void run() {
        try {
            long now = System.currentTimeMillis();
            for (Channel channel : channelProvider.getChannels()) {
                if (channel.isClosed()) {//ㄟ(▔▽▔)ㄏ
                    continue;
                }
                //SKIPPPP
        } catch (Throwable t) {
            logger.warn("Unhandled exception when heartbeat, cause: " + t.getMessage(), t);
        }
    }

Step 2

We analyzed the server’s nmon file and found DISKBUSY=100%, CPU=100%, NETWORK≈0 for up to ten minutes before the problem occurred.

Step 3

Further analysis of Dubbo’s heartbeat reconnection mechanism:

If the read time on the channel exceeds 3 heartbeats (default 1 minute) is not updated, the consumer will reconnect the connection of NettyClient. Reconnect process will first use disconnect() to close the existing connection and establish a new connection via connect().

/*
  com.alibaba.dubbo.remoting.exchange.support.header.HeartBeatTask
*/
    if (lastRead != null && now - lastRead > heartbeatTimeout) {
        logger.warn("Close channel " + channel
                + ", because heartbeat read idle time out: " + heartbeatTimeout + "ms");
        if (channel instanceof Client) {
            try {
                ((Client) channel).reconnect();//ㄟ(▔▽▔)ㄏ
            } catch (Exception e) {
                //do nothing
            }
        } else {
            channel.close();
        }
    }
/*
  com.alibaba.dubbo.remoting.transport.AbstractClient
*/
    public void reconnect() throws RemotingException {
        disconnect();//ㄟ(▔▽▔)ㄏ
        connect();
    }
    
    public void disconnect() {
        connectLock.lock();
        try {
            destroyConnectStatusCheckCommand();
            try {
                Channel channel = getChannel();
                if (channel != null) {
                    channel.close();//ㄟ(▔▽▔)ㄏ
                }
            } catch (Throwable e) {
                logger.warn(e.getMessage(), e);
            }
            try {
                doDisConnect();
            } catch (Throwable e) {
                logger.warn(e.getMessage(), e);
            }
        } finally {
            connectLock.unlock();
        }
    }
/*
  com.alibaba.dubbo.remoting.transport.netty4.NettyChannel
*/
    public void close() {
        try {
            super.close();
        } catch (Exception e) {
            logger.warn(e.getMessage(), e);
        }
        try {
            removeChannelIfDisconnected(channel);
        } catch (Exception e) {
            logger.warn(e.getMessage(), e);
        }
        try {
            attributes.clear();
        } catch (Exception e) {
            logger.warn(e.getMessage(), e);
        }
        try {
            if (logger.isInfoEnabled()) {
                logger.info("Close netty channel " + channel);
            }
            channel.close();//ㄟ(▔▽▔)ㄏ
        } catch (Exception e) {
            logger.warn(e.getMessage(), e);
        }
    }

We look at Netty4’s disconnect() processing, which is quite different from Netty3.:

In Netty3, close() sends an OUT event before returning the Futrue object, and the OUT event directly sets the state of the channel to ST_CLOSED.。

/*
  org.jboss.netty.channel.Channels
*/
    public static ChannelFuture close(Channel channel) {
        ChannelFuture future = channel.getCloseFuture();
        channel.getPipeline().sendDownstream(new DownstreamChannelStateEvent(channel, future, ChannelState.OPEN, Boolean.FALSE));
        return future;
    }

Netty4 puts the OUT event into the IO thread and immediately returns the Future object. At this point, it is determined that the channnel state is still available, that is, channel.isActive() returns true.

/*
  io.netty.channel.AbstractChannelHandlerContext
*/
    public ChannelFuture close(final ChannelPromise promise) {
        if (!this.validatePromise(promise, false)) {
            return promise;
        } else {
            final AbstractChannelHandlerContext next = this.findContextOutbound();
            EventExecutor executor = next.executor();
            if (executor.inEventLoop()) {
                next.invokeClose(promise);
            } else {
                safeExecute(executor, new OneTimeTask() {
                    public void run() {
                        next.invokeClose(promise);
                    }
                }, promise, (Object)null);
            }
            return promise;
        }
    }

Our Conclusions

AbstractClient.reconnect() expects to reestablish the connection immediately after disconnecting. This mode does not cause logic problems under Netty3. However, under Netty4, the asynchronous processing mode makes it easy to execute the sequence of AbstractClient.disconnect()–>AbstractClient.connect()–>run(){next.invokeClose(promise)} Especially when there is a transient exception in server IO.

We have simulated the phenomenon of closed=false but channel.ch.state=4 through simple code, and can revisit the scene in the test environment by limiting the server IO.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 3
  • Comments: 18 (12 by maintainers)

Most upvoted comments

Version

dubbo-2.5.9。我们修改了Transporter的默认SPI,使用netty4:

@SPI("netty4")
public interface Transporter{

}

Issue

服务注册中心有服务的提供者信息,且提供者配置都正常,但服务消费者调用服务时,报"No provider available in XXXX"的RpcException。

Analysis

Step 1

通过gcore命令dump了jvm的内存,使用mat分析发现大量异常状态的连接,即com.alibaba.dubbo.remoting.transport.netty4.NettyClient中,closed=false但channel.ch.state=4。这就造成DubboInvoker在判断服务提供方是否可用时,都不可用。其判断顺序如下:

/*
  com.alibaba.dubbo.rpc.cluster.support.AvailableCluster
*/
    public  Invoker join(Directory directory) throws RpcException {
        return new AbstractClusterInvoker(directory) {
            public Result doInvoke(Invocation invocation, List> invokers, LoadBalance loadbalance) throws RpcException {
                for (Invoker invoker : invokers) {
                    if (invoker.isAvailable()) {//ㄟ(▔▽▔)ㄏ
                        return invoker.invoke(invocation);
                    }
                }
                throw new RpcException("No provider available in " + invokers);
            }
        };
    }
/*
  com.alibaba.dubbo.registry.integration.RegistryDirectory
*/
    public boolean isAvailable() {
        if (isDestroyed()) {
            return false;
        }
        Map> localUrlInvokerMap = urlInvokerMap;
        if (localUrlInvokerMap != null && localUrlInvokerMap.size() > 0) {
            for (Invoker invoker : new ArrayList>(localUrlInvokerMap.values())) {
                if (invoker.isAvailable()) {//ㄟ(▔▽▔)ㄏ
                    return true;
                }
            }
        }
        return false;
    }
/*
  com.alibaba.dubbo.rpc.protocol.dubbo.DubboInvoker
*/
    public boolean isAvailable() {
        if (!super.isAvailable())
            return false;
        for (ExchangeClient client : clients) {
            if (client.isConnected() && !client.hasAttribute(Constants.CHANNEL_ATTRIBUTE_READONLY_KEY)) {//ㄟ(▔▽▔)ㄏ
                return true;
            }
        }
        return false;
    }
/*
  com.alibaba.dubbo.remoting.transport.netty4.NettyChannel
*/
    public boolean isConnected() {
        return channel.isActive();//ㄟ(▔▽▔)ㄏ
    }
/*
  io.netty.channel.socket.nio.NioSocketChannel
*/
    public boolean isActive() {
        java.nio.channels.SocketChannel ch = this.javaChannel();
        return ch.isOpen() && ch.isConnected();//ㄟ(▔▽▔)ㄏ
    }
/*
  sun.nio.ch.SocketChannelImpl
*/
    public boolean isConnected() {
        synchronized(this.stateLock) {
            return this.state == 2;//ㄟ(▔▽▔)ㄏ
        }
    }

同时,这种异常连接因为closed=false也造成HeartBeatTask无法处理重连

/*
  com.alibaba.dubbo.remoting.exchange.support.header.HeartBeatTask
*/
    public void run() {
        try {
            long now = System.currentTimeMillis();
            for (Channel channel : channelProvider.getChannels()) {
                if (channel.isClosed()) {//ㄟ(▔▽▔)ㄏ
                    continue;
                }
                //SKIPPPP
        } catch (Throwable t) {
            logger.warn("Unhandled exception when heartbeat, cause: " + t.getMessage(), t);
        }
    }

Step 2

分析服务器的nmon文件,发现出现问题前时刻DISKBUSY=100%,CPU=100%,NETWORK≈0。

Step 3

进一步分析Dubbo的心跳重连机制:

如果channel上读取时间超过3个heartbeat(默认1分钟)未更新,消费者会重新建立NettyClient底层连接,重建连接会先吊用disconnect()关闭已有连接,并通过connect()建立新连接。

/*
  com.alibaba.dubbo.remoting.exchange.support.header.HeartBeatTask
*/
    if (lastRead != null && now - lastRead > heartbeatTimeout) {
        logger.warn("Close channel " + channel
                + ", because heartbeat read idle time out: " + heartbeatTimeout + "ms");
        if (channel instanceof Client) {
            try {
                ((Client) channel).reconnect();//ㄟ(▔▽▔)ㄏ
            } catch (Exception e) {
                //do nothing
            }
        } else {
            channel.close();
        }
    }
/*
  com.alibaba.dubbo.remoting.transport.AbstractClient
*/
    public void reconnect() throws RemotingException {
        disconnect();//ㄟ(▔▽▔)ㄏ
        connect();
    }

    public void disconnect() {
        connectLock.lock();
        try {
            destroyConnectStatusCheckCommand();
            try {
                Channel channel = getChannel();
                if (channel != null) {
                    channel.close();//ㄟ(▔▽▔)ㄏ
                }
            } catch (Throwable e) {
                logger.warn(e.getMessage(), e);
            }
            try {
                doDisConnect();
            } catch (Throwable e) {
                logger.warn(e.getMessage(), e);
            }
        } finally {
            connectLock.unlock();
        }
    }
/*
  com.alibaba.dubbo.remoting.transport.netty4.NettyChannel
*/
    public void close() {
        try {
            super.close();
        } catch (Exception e) {
            logger.warn(e.getMessage(), e);
        }
        try {
            removeChannelIfDisconnected(channel);
        } catch (Exception e) {
            logger.warn(e.getMessage(), e);
        }
        try {
            attributes.clear();
        } catch (Exception e) {
            logger.warn(e.getMessage(), e);
        }
        try {
            if (logger.isInfoEnabled()) {
                logger.info("Close netty channel " + channel);
            }
            channel.close();//ㄟ(▔▽▔)ㄏ
        } catch (Exception e) {
            logger.warn(e.getMessage(), e);
        }
    }

查看Netty4的disconnect()处理,与Netty3有较大的区别:

Netty3中close()返回Futrue对象前发送OUT事件,OUT事件将channel的state状态直接置为ST_CLOSED。

/*
  org.jboss.netty.channel.Channels
*/
    public static ChannelFuture close(Channel channel) {
        ChannelFuture future = channel.getCloseFuture();
        channel.getPipeline().sendDownstream(new DownstreamChannelStateEvent(channel, future, ChannelState.OPEN, Boolean.FALSE));
        return future;
    }

Netty4中将OUT事件放到IO线程处理,并立即返回Future对象,此时判断channnel状态依然为可用,即channel.isActive()返回true。

/*
  io.netty.channel.AbstractChannelHandlerContext
*/
    public ChannelFuture close(final ChannelPromise promise) {
        if (!this.validatePromise(promise, false)) {
            return promise;
        } else {
            final AbstractChannelHandlerContext next = this.findContextOutbound();
            EventExecutor executor = next.executor();
            if (executor.inEventLoop()) {
                next.invokeClose(promise);
            } else {
                safeExecute(executor, new OneTimeTask() {
                    public void run() {
                        next.invokeClose(promise);
                    }
                }, promise, (Object)null);
            }
            return promise;
        }
    }

初步猜测

AbstractClient.reconnect()期望的是断开连接后立刻重建连接,这种模式在Netty3下,不会出现逻辑问题。但在Netty4下,采用异步处理的模式,很容易出现AbstractClient.disconnect()–>AbstractClient.connect()–>run(){next.invokeClose(promise)}的执行顺序,特别是在服务器IO出现短暂异常时。

我们已经通过简单代码来模拟出现closed=false但channel.ch.state=4的现象,并且可以在测试环境通过限制服务器IO的方式重现场景。