dubbo: "No provider available in XXXX" RpcException after server IO Busy
Dubbo Version
dubbo-2.5.9。But we modified the default SPI of the Transporter, using netty4:
@SPI("netty4")
public interface Transporter{
}
The issue
The service registry has current provider information of the service, but when the service consumer invokes the service, it reports the RpcException of “No provider available in XXXX”.
Our analysis process
Step 1
We use the gcore command to dump the memory of jvm, and use Eclipse MAT analysis it, find a large number of connections with abnormal state. IN com.alibaba.dubbo.remoting.transport.netty4.NettyClient, closed=false but channel.ch.state=4. This caused DubboInvoker fail to get available providers. The order of judgment is as follows:
/*
com.alibaba.dubbo.rpc.cluster.support.AvailableCluster
*/
public <T> Invoker<T> join(Directory<T> directory) throws RpcException {
return new AbstractClusterInvoker<T>(directory) {
public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
for (Invoker<T> invoker : invokers) {
if (invoker.isAvailable()) {//ㄟ(▔▽▔)ㄏ
return invoker.invoke(invocation);
}
}
throw new RpcException("No provider available in " + invokers);
}
};
}
/*
com.alibaba.dubbo.registry.integration.RegistryDirectory
*/
public boolean isAvailable() {
if (isDestroyed()) {
return false;
}
Map<String, Invoker<T>> localUrlInvokerMap = urlInvokerMap;
if (localUrlInvokerMap != null && localUrlInvokerMap.size() > 0) {
for (Invoker<T> invoker : new ArrayList<Invoker<T>>(localUrlInvokerMap.values())) {
if (invoker.isAvailable()) {//ㄟ(▔▽▔)ㄏ
return true;
}
}
}
return false;
}
/*
com.alibaba.dubbo.rpc.protocol.dubbo.DubboInvoker
*/
public boolean isAvailable() {
if (!super.isAvailable())
return false;
for (ExchangeClient client : clients) {
if (client.isConnected() && !client.hasAttribute(Constants.CHANNEL_ATTRIBUTE_READONLY_KEY)) {//ㄟ(▔▽▔)ㄏ
return true;
}
}
return false;
}
/*
com.alibaba.dubbo.remoting.transport.netty4.NettyChannel
*/
public boolean isConnected() {
return channel.isActive();//ㄟ(▔▽▔)ㄏ
}
/*
io.netty.channel.socket.nio.NioSocketChannel
*/
public boolean isActive() {
java.nio.channels.SocketChannel ch = this.javaChannel();
return ch.isOpen() && ch.isConnected();//ㄟ(▔▽▔)ㄏ
}
/*
sun.nio.ch.SocketChannelImpl
*/
public boolean isConnected() {
synchronized(this.stateLock) {
return this.state == 2;//ㄟ(▔▽▔)ㄏ
}
}
At the same time, this abnormal close=false state also causes the HeartBeatTask to fail to handle reconnection.
/*
com.alibaba.dubbo.remoting.exchange.support.header.HeartBeatTask
*/
public void run() {
try {
long now = System.currentTimeMillis();
for (Channel channel : channelProvider.getChannels()) {
if (channel.isClosed()) {//ㄟ(▔▽▔)ㄏ
continue;
}
//SKIPPPP
} catch (Throwable t) {
logger.warn("Unhandled exception when heartbeat, cause: " + t.getMessage(), t);
}
}
Step 2
We analyzed the server’s nmon file and found DISKBUSY=100%, CPU=100%, NETWORK≈0 for up to ten minutes before the problem occurred.
Step 3
Further analysis of Dubbo’s heartbeat reconnection mechanism:
If the read time on the channel exceeds 3 heartbeats (default 1 minute) is not updated, the consumer will reconnect the connection of NettyClient. Reconnect process will first use disconnect() to close the existing connection and establish a new connection via connect().
/*
com.alibaba.dubbo.remoting.exchange.support.header.HeartBeatTask
*/
if (lastRead != null && now - lastRead > heartbeatTimeout) {
logger.warn("Close channel " + channel
+ ", because heartbeat read idle time out: " + heartbeatTimeout + "ms");
if (channel instanceof Client) {
try {
((Client) channel).reconnect();//ㄟ(▔▽▔)ㄏ
} catch (Exception e) {
//do nothing
}
} else {
channel.close();
}
}
/*
com.alibaba.dubbo.remoting.transport.AbstractClient
*/
public void reconnect() throws RemotingException {
disconnect();//ㄟ(▔▽▔)ㄏ
connect();
}
public void disconnect() {
connectLock.lock();
try {
destroyConnectStatusCheckCommand();
try {
Channel channel = getChannel();
if (channel != null) {
channel.close();//ㄟ(▔▽▔)ㄏ
}
} catch (Throwable e) {
logger.warn(e.getMessage(), e);
}
try {
doDisConnect();
} catch (Throwable e) {
logger.warn(e.getMessage(), e);
}
} finally {
connectLock.unlock();
}
}
/*
com.alibaba.dubbo.remoting.transport.netty4.NettyChannel
*/
public void close() {
try {
super.close();
} catch (Exception e) {
logger.warn(e.getMessage(), e);
}
try {
removeChannelIfDisconnected(channel);
} catch (Exception e) {
logger.warn(e.getMessage(), e);
}
try {
attributes.clear();
} catch (Exception e) {
logger.warn(e.getMessage(), e);
}
try {
if (logger.isInfoEnabled()) {
logger.info("Close netty channel " + channel);
}
channel.close();//ㄟ(▔▽▔)ㄏ
} catch (Exception e) {
logger.warn(e.getMessage(), e);
}
}
We look at Netty4’s disconnect() processing, which is quite different from Netty3.:
In Netty3, close() sends an OUT event before returning the Futrue object, and the OUT event directly sets the state of the channel to ST_CLOSED.。
/*
org.jboss.netty.channel.Channels
*/
public static ChannelFuture close(Channel channel) {
ChannelFuture future = channel.getCloseFuture();
channel.getPipeline().sendDownstream(new DownstreamChannelStateEvent(channel, future, ChannelState.OPEN, Boolean.FALSE));
return future;
}
Netty4 puts the OUT event into the IO thread and immediately returns the Future object. At this point, it is determined that the channnel state is still available, that is, channel.isActive() returns true.
/*
io.netty.channel.AbstractChannelHandlerContext
*/
public ChannelFuture close(final ChannelPromise promise) {
if (!this.validatePromise(promise, false)) {
return promise;
} else {
final AbstractChannelHandlerContext next = this.findContextOutbound();
EventExecutor executor = next.executor();
if (executor.inEventLoop()) {
next.invokeClose(promise);
} else {
safeExecute(executor, new OneTimeTask() {
public void run() {
next.invokeClose(promise);
}
}, promise, (Object)null);
}
return promise;
}
}
Our Conclusions
AbstractClient.reconnect() expects to reestablish the connection immediately after disconnecting. This mode does not cause logic problems under Netty3. However, under Netty4, the asynchronous processing mode makes it easy to execute the sequence of AbstractClient.disconnect()–>AbstractClient.connect()–>run(){next.invokeClose(promise)} Especially when there is a transient exception in server IO.
We have simulated the phenomenon of closed=false but channel.ch.state=4 through simple code, and can revisit the scene in the test environment by limiting the server IO.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 3
- Comments: 18 (12 by maintainers)
Version
dubbo-2.5.9。我们修改了Transporter的默认SPI,使用netty4:
Issue
服务注册中心有服务的提供者信息,且提供者配置都正常,但服务消费者调用服务时,报"No provider available in XXXX"的RpcException。
Analysis
Step 1
通过gcore命令dump了jvm的内存,使用mat分析发现大量异常状态的连接,即com.alibaba.dubbo.remoting.transport.netty4.NettyClient中,closed=false但channel.ch.state=4。这就造成DubboInvoker在判断服务提供方是否可用时,都不可用。其判断顺序如下:
同时,这种异常连接因为closed=false也造成HeartBeatTask无法处理重连
Step 2
分析服务器的nmon文件,发现出现问题前时刻DISKBUSY=100%,CPU=100%,NETWORK≈0。
Step 3
进一步分析Dubbo的心跳重连机制:
查看Netty4的disconnect()处理,与Netty3有较大的区别:
初步猜测
AbstractClient.reconnect()期望的是断开连接后立刻重建连接,这种模式在Netty3下,不会出现逻辑问题。但在Netty4下,采用异步处理的模式,很容易出现AbstractClient.disconnect()–>AbstractClient.connect()–>run(){next.invokeClose(promise)}的执行顺序,特别是在服务器IO出现短暂异常时。我们已经通过简单代码来模拟出现closed=false但channel.ch.state=4的现象,并且可以在测试环境通过限制服务器IO的方式重现场景。