Description
[SIP-143] Proposal for Global Async Task Framework
Motivation
Note: This replaces [SIP-141] Global Async Queries 2.0 which aimed at completing [SIP-39] Global Async Query Support.
Proposed Change
Superset currently has varied and and often times opaque solutions for executing async operations (all require Celery):
- SQL Lab supports async query execution, and utilizes long polling for checking for results
- Thumbnails and Alerts & Reports are executed as background Celery tasks that are scheduled by Celery Beat
- Chart queries can be executed async by enabling the GAQ feature flag, and supports both WebSocket and long polling
- Scheduled cron jobs (added via
CeleryConfig
insuperset_config.py
)
Currently none of the above support deduplication or cancellation, or even viewing which tasks are queued/executing. Especially troublesome is executing long running queries synchronously in the webworkers: this can lead to the web worker becoming unresponsive if many long running queries are running simultaneously. This has lead to many orgs having to extend the webserver timeouts so as to not time out long running queries. Moving these to the async workers will both free up the webworkers, and make it possible to decrease webworker timeouts significantly, while simultaneously being able to support arbitrarily long query execution times.
In addition, beyond sharing Celery as the execution backend, there is limited sharing of code, like utils or ORM models, which has lead to significant code duplication. This both increases the risk of regressions, limits reusability of good functionality and adds significant maintenance burden.
For this reason this SIP recommends adding a new Global Async Task Framework (GATF), which will introduce the following:
- A new ORM model with a menu which makes it possible to view and cancel queued or executing tasks. Admins will have access to all tasks, while regular users will only be able to view tasks they have spawned. This model will be ephemeral by nature, i.e. the task entries will be removed once they are completed.
- Add locking for all tasks to ensure deduplication. This applies particularly to async chart queries and thumbnails, which currently can cause significant resource waste.
- Deprecate long polling in both chart and SQL Lab queries - going forward only WebSockets would be supported.
New or Changed Public Interfaces
Model
A new ORM model will be introduced for async tasks with a string based identifier. When a new task is created, an entry is added to the table if it's not already there. For instance, for thumbnails, we would use the digest as the identifier. And for chart queries, we would use the cache key and so on. If the entry is already there, we consider the task already locked, and don't schedule a new one. The model will look as follows:
class AsyncTask(Base):
__tablename__ = "async_tasks"
id = Column(Integer, primary_key=True)
task_id = Column(String(256), unique=True, nullable=False, index=True)
task_type = Column(Enum(..., name="task_status"), nullable=False)
task_name = Column(String(256), nullable=False)
status = Column(Enum("PENDING", "IN_PROGRESS", "SUCCESS", "REVOKED", "FAILURE", name="task_status"), nullable=False)
created_at = Column(DateTime, nullable=False)
updated_at = Column(DateTime, nullable=False)
ended_at = Column(DateTime, nullable=False)
error = Column(String, nullable=True)
state = Column(Text, nullable=True) # JSON serializable
As per SIP-43, we'll introduce at least a DAO for this (maybe also a set of commands). For abstracting the main GATF logic, a new decorator will be introduced, which wraps the task in a thread that can be killed as needed (the final implementation will look different, this is just to give an understanding of the main logic):
TASK_SLEEP_INTERVAL_SEC = 5
def async_task(f: Callable[..., Any]):
@wraps(f: Callable[..., Any])
def wrapper(*args, **kwargs):
task_id = kwargs.get("task_id")
if not task_id:
raise ValueError("task_id is required for cancelable tasks")
task = AsyncTask.query.filter_by(id=task_id).one_or_none()
if task is None:
raise Exception(f"Task not found: {task_id}")
if task.status != TaskStatus.PENDING:
raise Exception("Task {task_id} is already in progress, current status: {task.status}")
task.status = TaskStatus.IN_PROGRESS
db.session.commit()
cancel_event = threading.Event()
def monitor_status():
while not cancel_event.is_set():
task = AsyncTask.query.filter_by(id=task_id).one_or_none()
if task is None:
cancel_event.set()
break
if task.status == TaskStatus.REVOKED:
cancel_event.set()
task.delete()
db.session.commit()
break
time.sleep(TASK_SLEEP_INTERVAL_SEC)
monitor_thread = threading.Thread(target=monitor_status)
monitor_thread.start()
try:
f(*args, cancel_event=cancel_event, **kwargs)
except Exception as e:
task.delete()
db.session.commit()
monitor_thread.join()
raise e
task.delete()
db.session.commit()
monitor_thread.join()
return wrapper
and when used, the task will just be decorated as follows:
@celery_app.task(name="my_task")
@async_task
def my_task(task_id: str, cancel_event: threading.Event) -> None:
# add logic here that checks cancel_event periodically
Notification method
We propose making WebSockets the sole mechanism for broadcasting task completion. This means that we will remove long polling support from async chart queries, and replace long polling in SQL Lab with WebSockets.
Frontend changes
Charts will display the current task status, and have a simple mechanism for cancelling queries if needed:
New dependencies
None - however, going forward, the WebSocket server will be mandatory for both SQL Lab and async chart queries.
Migration Plan and Compatibility
Phase 1 - Thumbnails
As a first step, we migrate Thumbnails to GATF, as they tend to be fairly long running tasks that currently lack deduplication. In addition, the main thumbnail rendering functionality is fairly straight forward, and will not require extensive changes. Migrating Thumbnails will require implementing all interfaces, like the GATF ORM model, UI, decorators/context managers. At the end of this phase, thumbnail rendering will be deduplicated, and Admins will be able to both see and cancel running Thumbnail tasks via the Async task list view.
Phase 2 - GAQ
In the second phase, we clean up GAQ by simplifying the architecture (see details about redundant form data cache keys etc from SIP-141) and remove long polling support. We will also migrate the GAQ metadata from Redis to the new ORM model.
At the end of this phase, we will have removed long polling from GAQ, and will have both chart query deduplication and cancellation support.
Phase 3 - The rest
In the final phase, we migrate the remaining async tasks to GATF. This mainly covers Alerts & Reports, but also any tasks that are triggered via Celery Beat that implement the provided context managers/decorators, like cache warmup, log rotation etc. At the end of this phase, and it will be possible to cancel any queued or running async task via the UI.
Rejected Alternatives
SIP-39 and SIP-141 were rejected in favor of making a more general purpose Async Task Framework.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status