lisk-sdk: Application halts clean up when processing block

Expected behavior

The app should clean up the all the modules gracefully.

Actual behavior

Cleaning up of block processing gets stuck, due to which the application never restarts and continues logging message: Waiting for block processing to finish...

Steps to reproduce

Unclear. Occurred once in snapshotting process.

Which version(s) does this affect? (Environment, OS, etc…)

1.X

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 20 (18 by maintainers)

Most upvoted comments

To check if pg-promise v8.5.4 solves the problem.

@diego-G Version 8.5.3 started supporting timeouts on the low level, and handle broken connections better.

So upgrading to the latest driver, plus configuring that timeout might be able to help here.

Plus, version 8.5.4 now properly releases dead connections, which is important for a process that runs for a long time.

@nazarhussain F.Y.I.

@nazarhussain not sure what else to add here, except to drop a link: Query timeout in pg-promise.

I’m keeping an eye on this one 😉

@SargeKhan We got DB connection error and triggered the cleanup:

https://github.com/LiskHQ/lisk/blob/9fc3ea2873756f79fb4403c8038aa897be87b147/db/index.js#L88-L95

That triggered async.eachSeries to cleanup all modules

https://github.com/LiskHQ/lisk/blob/9fc3ea2873756f79fb4403c8038aa897be87b147/app.js#L764-L768

While blocks module is not going to cleanup unless the active sql query get executed and its not going to finish and connection been dropped.

https://github.com/LiskHQ/lisk/blob/9fc3ea2873756f79fb4403c8038aa897be87b147/modules/blocks.js#L248-L256

So it’s a race condition. It can happen to any other module as well in future. So probably there are three solutions;

  1. Either set a timeout in module cleanup process, and skip any module cleanup if taking more time
  2. If there is DB connection error, simply log fatal error and exit the process
  3. Add a timeout for atomic block write in its promise chain

There is no query timeout feature available directly in pg-promise so we have to use bluebird method timeout in start of the promise chain. Just add a timeout entry after tx call.

https://github.com/LiskHQ/lisk/blob/9fc3ea2873756f79fb4403c8038aa897be87b147/modules/blocks/chain.js#L501-L508

@vitaly-t Can provide more input on it.