Skip to content

Knowing when library has been reloaded #729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rgaudin opened this issue Feb 4, 2025 · 5 comments · May be fixed by #734
Open

Knowing when library has been reloaded #729

rgaudin opened this issue Feb 4, 2025 · 5 comments · May be fixed by #734

Comments

@rgaudin
Copy link
Member

rgaudin commented Feb 4, 2025

This is related to #728 but independent.

Use case we have is that kiwix-serve is behind a varnish cache for library.kiwix.org because it's not able to handle the load.
Our cache sure has a time-based expiration for all its entries but it's not relevant here.

Because we frequently publish new ZIM files, we frequently (at most once per hour ATM) regenerate the library XML file.
When we do, we want to invalidate our cache entries related to the Catalog.

Our problem is that we don't know when kiwix-serve has actually reloaded the library and is ready to serve new data.

If we invalidate cache too soon, then chances are an incoming request happens before the refresh and we'd be storing old data into the cache instead of the new one.

To workaround this, we are now waiting 10s after writing the XML file on disk and purging the cache. Of course, that's arbitrary, ugly and fragile.

How could we be informed that the library has been reloaded?

@evrial
Copy link

evrial commented Feb 23, 2025

No library - no problem.
#735

@rgaudin
Copy link
Member Author

rgaudin commented Feb 24, 2025

@evrial:

  • library allows customizing/overriding a number of metadata. Without it we'd be stuck with in-ZIM metadata.
  • watching a directory would absolutely not solve this issue. An external program would not be able to know when kiwix-serve has detected and reloaded for a specific ZIM. For the cache example above, there would be no improvement.

@Optimus-NP
Copy link

HI @rgaudin @evrial @kelson42 ,

Added my thoughts in this comment. Requesting your response.

@Optimus-NP
Copy link

Proposal: Automating Cache Purge with Kafka in Kiwix

Overview

This proposal outlines a Kafka-based architecture to streamline the process of purging the cache when the Kiwix server reloads its library. Instead of invoking the purge API manually from library-maintain.py, this approach leverages Kafka for event-driven communication, enhancing scalability and reliability.

Architecture Design

Components

  1. Kiwix Server

    • Reloads the library path upon updates.
    • Publishes a LibraryReloaded event to a Kafka topic.
  2. Kafka Broker

    • Acts as a messaging layer for event-driven communication.
    • Stores the LibraryReloaded event in a dedicated topic.
  3. Cache Purge Daemon

    • Subscribes to the LibraryReloaded topic.
    • Listens for updates and triggers the cache purge automatically.

Workflow

  1. Library Update

  2. Kafka Event Emission

    • Kiwix server publishes a LibraryReloaded message to the Kafka topic kiwix.sever.events.
  3. Daemon Processing

    • A background daemon subscribes to Kafka topic kiwix.server.events.
    • On receiving a LibraryReloaded message, it executes the cache purge process.
  4. Automatic Cache Purge

    • The daemon clears outdated cache files or triggers a specific API internally.

Implementation Details

Kafka Setup

  • Deploy a Kafka broker with a kiwix.sever.events topic.
  • Configure retention policies to avoid excessive storage.

Kiwix Server Integration

  • Modify the server to publish a message upon successful library reload:
    void produce_message(const std::string& broker, const std::string& topic, const std::string& message) {
        rd_kafka_conf_t* conf = rd_kafka_conf_new();
        rd_kafka_t* rk;
        char errstr[512];
    
        rk = rd_kafka_new(RD_KAFKA_PRODUCER, conf, errstr, sizeof(errstr));
        rd_kafka_topic_t* rkt = rd_kafka_topic_new(rk, topic.c_str(), NULL);
        rd_kafka_produce(rkt, RD_KAFKA_PARTITION_UA, RD_KAFKA_MSG_F_COPY,
                         const_cast<char*>(message.c_str()), message.size(),
                         NULL, 0, NULL);
        rd_kafka_flush(rk, 1000);
        rd_kafka_topic_destroy(rkt);
        rd_kafka_destroy(rk);
    }
    

Cache Purge Daemon

  • Subscribe to the Kafka topic and trigger cache purge:
    consumer = KafkaConsumer('kiwix.server.events', bootstrap_servers='localhost:9092')
    for message in consumer:
        if b'LibraryReloaded' in message.value:
            purge_cache()  # Custom cache purge function
    

Benefits

  • Decoupling: Eliminates direct dependencies between library-maintain.py and the purge process.
  • Scalability: Kafka allows multiple services to listen for updates without modifying the core logic.
  • Reliability: Ensures cache purge happens even if the initial process fails due to transient issues.

Conclusion

This Kafka-based approach improves the efficiency of cache purging by making it event-driven, reducing manual intervention, and enhancing scalability. By implementing this architecture, Kiwix can achieve a more resilient and automated workflow for managing library updates.

@kelson42
Copy link
Contributor

@Optimus-NP We can not rely on another architecture element (like Kafka) to do that. It should work only with Varnish and Kiwix Server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants