Cold shutdown causes task loss #8583
Replies: 1 comment
-
I have the same problem when using celery==5.2.7 with prefork workers. task_acks_late = True [INFO/MainProcess] Task subtask_execution_lib.tasks.cpu_task.cpu_task["my_task_id"] received [uwsgi] | - [uwsgi-daemons] stopping daemon (pid: 36411): /bin/celery -A worker_a worker --pool=prefork -Ofair --loglevel=info -n celery@local worker: Cold shutdown (MainProcess) In my case it remains in started state until I manually revoke |
Beta Was this translation helpful? Give feedback.
-
Hey y'all.
Recently I started adding more telemetry and monitoring around my celery cluster and I've noticed that some of the tasks are being lost in the event of shutdown. My celery stack is deployed to AWS using ECS Fargate as a platform. Also, I'm using
REMAP_SIGTERM=SIGQUIT
setting to make sure my celery workers are handling signals properly (I want them to shutdown usingcold
method). My workers are started using following command:celery report:
So, what exactly I've noticed. While going through the logs, I've spotted events where celery was receiving shutdown event
worker: Cold shutdown (MainProcess)
(for example due to scaling activity, release or spot instance being interrupted). Usually, when this happens, I can also seeRestoring 1 unacknowledged message(s)
log that happens right after. I run all of my tasks withacks_late=True
andreject_on_worker_lost=True
. However, even with that setting, when cold shutdown event is received, some tasks are being lost. I can definitely see events where celery picks up task (Received task
log occurs fromcelery.worker.strategy
logger). Once it is picked up, after few seconds there's a shutdown event. I would expect task to stop immediately and let celery return it back to ready queue. Also, from time to time I can notice that task continues to run and even finishes, so celery stores that result in database, however, it does not remove it fromunacked
queue which can cause it to run twice (or more). Anyone noticed similar behavior and possibly can provide some guidance how to fix this?Beta Was this translation helpful? Give feedback.
All reactions