Skip to content

Update Kiwix-serve cache strategy #650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kelson42 opened this issue Dec 11, 2021 · 20 comments · Fixed by #833
Closed

Update Kiwix-serve cache strategy #650

kelson42 opened this issue Dec 11, 2021 · 20 comments · Fixed by #833
Assignees
Milestone

Comments

@kelson42
Copy link
Collaborator

kelson42 commented Dec 11, 2021

Kiwix serve delivers many types of content:

  1. ZIM content
  2. Kiwix-serve static content
  3. Kiwix-serve dynamic content (API, search, etc.)

For each type, here is how we want to cache:

  1. ZIM content: cache as long as the (concerned) content does not change
  2. Kiwix-serve static content: cache as long as we don't run a different version of kiwix-serve
  3. Kiwix-serve dynamic content: cache as long as the (concerned) library has not been updated

Which implies 3 different strategies...

ZIM content

ZIM content does give us no freedom about the URLs calls and we have no way to garanty that what is available at the URL at two different times is the same content. Therefore the content can be cached but must be revalidate (with Etag issued from the ZIM ID). To avoid a systematic revalidation we can set a max-age of one hour like Cache-Control: max-age=3600, must-revalidate.

Kiwix-serve static content

This is the easier case, we need to "cache busting", for example by adding ?UNIQUE_ID at the end of the URL of any of the resources. This UNIQUE_ID should be fixed at compilation time. Cache-Control: max-age=31536000, immutable then should be enough (so no other related HTTP headers, like Etag, Last-Modified or Expires).

Kiwix-serve dynamic content

The dynamic content is consistent and can be cached as long as the library is not updated or the server restarted. I would propose to force systematically revalidation with an Etag which is changed each time the internal library is changed. The cache policy would look like Cache-Control: max-age=0, must-revalidate.

@kelson42
Copy link
Collaborator Author

kelson42 commented Dec 11, 2021

@veloman-yunkan I want to make a new attempt to sortout this problem of cache. Because the problem is broader than the API or Etag, I have closed #598, but thank you very much again for your explanation which are perfectly correct I think now. Would you be able please to assess this ticket and tell me if you disagree with something? Or maybe you would have an other approach to do it better?

@kelson42
Copy link
Collaborator Author

kelson42 commented Feb 3, 2022

While you are waiting and interacting with @mgautierfr on #679 and I suspect it might take weeks ; I would really appreciate to have you looking at this ticket.

@veloman-yunkan
Copy link
Collaborator

While you are waiting and interacting with @mgautierfr on #679 and I suspect it might take weeks ; I would really appreciate to have you looking at this ticket.

@kelson42 OK, will do.

@veloman-yunkan
Copy link
Collaborator

== Kiwix-serve static content ==

This is the easier case, we need to "cache busting", for example by adding ?UNIQUE_ID at the end of the URL of any of the resources. This UNIQUE_ID should be fixed at compilation time. Cache-Control: max-age=31536000, immutable then should be enough (so no other related HTTP headers, like Etag, Last-Modified or Expires).

This presents some maintenance challenge. kiwix-serve static resources are referenced in other static resources. For example, static/templates/index.html references, among others, static/skin/isotope.pkgd.min.js, static/skin/iso6391To3.js, static/skin/index.js:

<script src="{{root}}/skin/isotope.pkgd.min.js" defer></script>
<script src="{{root}}/skin/iso6391To3.js"></script>
<script type="text/javascript" src="{{root}}/skin/index.js" defer></script>

By adopting this scheme, whenever we update a static resource and change its unique id, we must make sure that all relative URLs pointing to that resource are updated too. Without automated frontend testing we will need to be careful in order to prevent mismatches between the resource ids and the cross references in sibling resources.

I will try to figure out how to circumvent this difficulty.

@kelson42
Copy link
Collaborator Author

kelson42 commented Feb 6, 2022

By adopting this scheme, whenever we update a static resource and change its unique id, we must make sure that all relative URLs pointing to that resource are updated too. Without automated frontend testing we will need to be careful in order to prevent mismatches between the resource ids and the cross references in sibling resources.

From this comment, I'm not 100% sure we have a common understanding about how to do this "cache busting". Whatever the value of ?UNIQUE_ID (in the script nodes URLs for example), kiwix-serve will deliver the proper content because the js resource name stays the same (everything before the question mark). The only question is if the browser cache will handle things properly (redownload or use cache version)?

From the technical side the should probably be a random string given by Meson to the compiler (like the version number) and it should be then set as template value and replaced accordingly. It does not really matter if we have a new UNIQUE_ID even if a (js) resource does not have changed beetween two compilations, the overhead from the user perspective will be neglictable. Doing something better would IMO complexify the code too much.

@veloman-yunkan
Copy link
Collaborator

@kelson42 I never said that the solution won't work. My point was that a straightforward implementation will add some maintenance headache. For example, static/templates/index.html contains the following line:

<script type="text/javascript" src="{{root}}/skin/index.js" defer></script>

We can change it to

<script type="text/javascript" src="{{root}}/skin/index.js?v1" defer></script>

and then, when we modify static/skin/index.js, change it again to

<script type="text/javascript" src="{{root}}/skin/index.js?v2" defer></script>

and so on. I am looking for an approach that will work automatically.

P.S. I reread my previous comment and found the sentence which suggested that I doubted the viability of the proposed solution. I wrote

Without automated frontend testing we will need to be careful in order to prevent mismatches between the resource ids and the cross references in sibling resources.

Indeed, if in the server we simply ignore the ?UNIQUE_ID token, then nothing will work worse compared to what we have now (but forgetting to update the ?UNIQUE_ID component for a modified resource will still lead to the issue nicely summarized by @mgautierfr).

@veloman-yunkan
Copy link
Collaborator

I am looking for an approach that will work automatically.

What about the following scheme?

The quasi-revision¹ of a resource can be obtained automatically by the following command:

git log --oneline static/skin/STATIC_RESOURCE_PATH|wc -l

When new commits affecting static/skin/STATIC_RESOURCE_PATH are added this number only grows and can serve as a base for deriving the UNIQUE_ID token for that resource.

We can have a script that preprocesses the templates in static/templates/ by appending thus obtained ?UNIQUE_ID suffixes to URLs starting with {{root}}/skin/.


@kelson42
Copy link
Collaborator Author

kelson42 commented Feb 6, 2022

When new commits affecting static/skin/STATIC_RESOURCE_PATH are added this number only grows and can serve as a base for deriving the UNIQUE_ID token for that resource.

OK

We can have a script that preprocesses the templates in static/templates/ by appending thus obtained ?UNIQUE_ID suffixes to URLs starting with {{root}}/skin/.

I think it is better to to that in meson and C++. Your approach seems less clean and robust to me... but you are the dev. and as long as it works and @mgautierfr agree with it, this is fine to me.

@veloman-yunkan
Copy link
Collaborator

The quasi-revision¹ of a resource can be obtained automatically by the following command:

git log --oneline static/skin/STATIC_RESOURCE_PATH|wc -l

When new commits affecting static/skin/STATIC_RESOURCE_PATH are added this number only grows and can serve as a base for deriving the UNIQUE_ID token for that resource.

We can have a script that preprocesses the templates in static/templates/ by appending thus obtained ?UNIQUE_ID suffixes to URLs starting with {{root}}/skin/.

Relying on git for obtaining the UNIQUE_ID of a resource will create obstacles when building libkiwix from downloaded (rather than cloned) source. A workaround is to add to the repository a file with revisions of all static resources and update it in an automatic way using the above method.

@veloman-yunkan
Copy link
Collaborator

The solution to static resource versioning must not be fooled by a git rebase operation. This requirement cancels the approach set forth in my previous comment. Instead, we will have to use a resource unique id derived from some hash of the resource content.

@veloman-yunkan
Copy link
Collaborator

== ZIM content ==

ZIM content does give us no freedom about the URLs calls and we have no way to garanty that what is available at the URL at two different times is the same content. Therefore the content can be cached but must be revalidate (with Etag issued from the ZIM ID). To avoid a systematic revalidation we can set a max-age of one hour like Cache-Control: max-age=3600, must-revalidate.

How should this interact with the decoration of text/html content? If Etag is issued from the ZIM ID only, then a change in the taskbar templates will not invalidate the cache and we may arrive at a situation where out-dated taskbar code will work in combination with updated front-end code served via the /skin endpoint. Even simpler, restarting the server with a different value for the --urlRootLocation option will break the taskbar because of the cached responses with the old value of {{root}} in

<form class="kiwixsearch" method="GET" action="{{root}}/search" id="kiwixsearchform">
{{#hascontent}}<input type="hidden" name="content" value="{{content}}" />{{/hascontent}}
<label for="kiwixsearchbox">&#x1f50d;</label>
<input autocomplete="off" class="ui-autocomplete-input" id="kiwixsearchbox" name="pattern" type="text" title="Search '{{title}}'" aria-label="Search '{{title}}'">
</form>
</div>
<input type="checkbox" id="kiwix_button_show_toggle">
<label for="kiwix_button_show_toggle"><img src="{{root}}/skin/caret.png" alt=""></label>
<div class="kiwix_button_cont">
{{#withlibrarybutton}}
<a id="kiwix_serve_taskbar_library_button" title="Go to welcome page" aria-label="Go to welcome page" href="{{root}}/"><button>&#x1f3e0;</button></a>
{{/withlibrarybutton}}
{{#hascontent}}
<a id="kiwix_serve_taskbar_home_button" title="Go to the main page of '{{title}}'" aria-label="Go to the main page of '{{title}}'" href="{{root}}/{{content}}/"><button>{{title}}</button></a>
<a id="kiwix_serve_taskbar_random_button" title="Go to a randomly selected page" aria-label="Go to a randomly selected page"
href="{{root}}/random?content={{#urlencoded}}{{{content}}}{{/urlencoded}}"><button>&#x1F3B2;</button></a>

@veloman-yunkan
Copy link
Collaborator

This may be a good time to once again reconsider #394

@kelson42
Copy link
Collaborator Author

@veloman-yunkan Your two last comments are pertinent IMO. In this ticket I simply have chosen to ignore the fact that part (taskbar) of each article is chrome and is conceptually different... and that treating it in an undifferentiated manner would lead to unexpected behaviours. This is just a limitation of the current design IMO.

See my comment at #394 (comment). #394 is part of the current milestone 10.1.0 and could be implemented ASAP. Considering that @mgautierfr will be pretty busy in the next weeks with many tickets related to compilation/CI/CD, you are currently in first row to run this effort. This makes even more sense if you feel impaired by #394 for this ticket. So, if you want, go for it.

@kelson42 kelson42 modified the milestones: 10.1.0, 10.2.0, 10.3.0 Mar 23, 2022
@kelson42 kelson42 modified the milestones: 10.2.0, 10.3.0 Apr 7, 2022
@rgaudin
Copy link
Member

rgaudin commented Apr 22, 2022

Stepping into the discussion as we are revamping library.kiwix.org's frontend (reverse proxy). As discussed with @kelson42, we'd want kiwix-serve to respond properly and not hack its responses into the proxy.

From what I understand, here's to do:

  • Use iframe instead of introduce taskbar. #394 must be implemented ; with an iframe pointing to /raw.
  • A unique library hash must be set/computed on each library refresh. Could be a UUID (assuming every restart/reload is library change) or a digest of the book IDs (better caching but more costly to generate).
  • /catalog/ and /search|suggest/ requests for multi-ZIM to be ETag'd with the library hash ID plus no-cache.
  • /catalog/ and /search|suggest/ requests for single-ZIM (v2/illustration for instance) to be ETag'd with the BookID plus a long max-age.
  • /raw/ responses to be ETag'd with the BookID plus a long max-age.
  • /skin|external/ can follow two paths:
    • each resource is hashed and linked-to using the query-param trick. Ensures resources that have not been updated across versions are kept in cache but setup/maintenance is important. Either no ETag or reuse of the cache with long cache. Static resource versioning #712 already started this.
    • responses to be ETag'd with the kiwix-serve/libkiwix version plus long cache. Good start and very easy to implement IMO. Could be a first step.
  • / (homepage) to be ETag'd with kiwix-serve/libkiwix version plus no-cache.
  • content pages (have we settled on changing the API and only serve from /content/?) would be ETag'd with the library ID plus short cache.

⚠️ Important notes

  • --urlRootLocation is not to be worried about. Priority goes to end users. Switching locations is not our problem
  • --nodatealias will not be the hell it could have been thanks to Use iframe instead of introduce taskbar. #394. Pages under a no-date-alias or not would both include an iframe to the period-including URL under /raw/.

@kelson42
Copy link
Collaborator Author

@rgaudin Thx, it clarifies the behaviour with non-static content and seems correct me. Hopefuly for @veloman-yunkan and @mgautierfr as well?

@kelson42 kelson42 modified the milestones: 10.2.0, 10.3.0 Apr 23, 2022
@stale
Copy link

stale bot commented Jul 10, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jul 10, 2022
@kelson42
Copy link
Collaborator Author

@veloman-yunkan Now, that the fix to have an iframe based taskbar runs the last revirw ozeration, could we go fotward with this ticket?

@stale stale bot removed the stale label Jul 29, 2022
@veloman-yunkan
Copy link
Collaborator

@veloman-yunkan Now, that the fix to have an iframe based taskbar runs the last revirw ozeration, could we go fotward with this ticket?

@kelson42 Yrs, qe cuolf 😄

@kelson42
Copy link
Collaborator Author

kelson42 commented Oct 6, 2022

@veloman-yunkan Looks like we are now back to this ticket. Do you need to split it in smaller pieces?

@veloman-yunkan
Copy link
Collaborator

@kelson42 I am starting to work on it and will file subtickets as needed.

This was referenced Oct 7, 2022
@kelson42 kelson42 modified the milestones: 12.1.0, 12.0.0 Oct 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants