redash: Redis decode_responses rq incompatibility
Redash is setting the flag decode_responses for its Redis connection. This is incompatible with the rq library - https://github.com/rq/rq/issues/1188.
We are having some issues, and I suspect this might be the underlying cause. Primarily:
- scheduled queries look like they’re being executed multiple times
- occasionally a query gets stuck. The job running the query disappears but the Redis entry remains, consuming a space in the query pool.
The second issue eventually consumes the entire query pool and prevents any query from running. When this happens, we have to manually flush the Redis cache.
We are running Redash v10.1.0 from the redash/redash:10.1.0.b50633 docker image.
We found https://github.com/getredash/redash/issues/5801, which looks like the same issue we are having. We have upgraded rq to 1.10.1 and rq-scheduler to 0.10.0 in our environment. This does seem to have alleviated the stuck queries issue. It doesn’t seem to have changed the scheduled queries multiple executions. However, we are now seeing the error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 1: invalid start byte
repeated throughout the logs. The stack trace here matches the linked issue in the rq library. I am not sure if upgrading rq caused these errors - I more suspect that upgrading rq has improved the error logging and is surfacing an existing issue.
About this issue
- Original URL
- State: open
- Created 10 months ago
- Comments: 18 (16 by maintainers)
imo waiting for 60s for a test isn’t that big of a deal, I’d be fine with that. Messing with the internals of Redis sounds more error-prone and in my opinion doesn’t give enough lift that it’s worth doing instead.
I’ve had a chance to look into this. I found one place where the wrong connection was being passed to an RQ function leading to the UnicodeDecodeError. https://github.com/getredash/redash/pull/6539
While looking for the problem, I noticed that the enqueue_query method is creating a Redis pipeline to watch a value, but the pipeline isn’t being reset. I don’t think this is related to the issue at hand, but seems like something that should be fixed. I’ve created a PR https://github.com/getredash/redash/pull/6540 that calls reset inside a finally block. The other option is to change line 39 -
with redis_connection.pipeline() as pipe:, but that does mean another layer of nesting and the code is already quite nested in that function.I believe the actual source of the problem is the HardLimitingWorker.
This is performing a heartbeat on the Worker, but not on the Job. Once a query is picked off the queue for execution, RQ is giving the job a default timeout of 60s. Normally this would be updated every time the job heartbeat is called. Nothing happens immediately after this timeout expires. The next time the RQ
StartedJobRegistry.cleanupmethod is called, the job status will be set to failed. The job will continue running, and will still list the job (execute_query) on the Redash Admin RQ status page. However the query will not appear on the query tab of the Admin RQ status page. It will also look like it has stopped executing to a user who has the query open, possibly leading them to click refresh and start another copy of the query.The
StartedJobRegistry.cleanupmethod is called a result of callingget_job_ids, and somewhat ironically it looks like this happens in two places in Redash -Rather than adding a call to the job heartbeat from within the HardLimitingWorker, I believe the best fix in this case is to remove the HardLimitingWorker entirely. It looks like this was cloned from an early version of RQ, and its purpose is to hard-kill a job that runs for too long without respecting its time limit. This functionality has since been baked directly into RQ. It was added in version 1.2.1, and has undergone a number of iterations of bugfixes.
I have managed to reproduce this locally for ad-hoc queries. I’m not certain if this is also the cause of the scheduled queries running multiple times yet, but will give the fixes a go and get back to you with the results.
This does indeed probably stem from Redis, but I agree that it’s probably unrelated to the original error. AFAIK Redis by default returns bytes, and you have to decode with utf-8 to get
strtype, so that’s what the error in the tests is complaining about.