Skip to content

Static resource versioning #712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 2, 2022
Merged

Static resource versioning #712

merged 12 commits into from
May 2, 2022

Conversation

veloman-yunkan
Copy link
Collaborator

@veloman-yunkan veloman-yunkan commented Feb 14, 2022

Addresses the static resource part of #650

HTML templates are now preprocessed by a new script kiwix-resources that detects links marked with ?KIWIXCACHEID and substitutes cacheid=CACHEID in lieu of the KIWIXCACHEID placeholder. The CACHEID value of a resource is obtained by taking the first 8 characters of its SHA1 hashsum hex digest. Inter-resource dependency is correctly handled by computing the cacheid for the preprocessed (rather than the source) version of a resource.

@codecov
Copy link

codecov bot commented Feb 14, 2022

Codecov Report

Merging #712 (0225a7e) into master (f90cc39) will decrease coverage by 0.08%.
The diff coverage is 100.00%.

❗ Current head 0225a7e differs from pull request most recent head 1b81ccc. Consider uploading reports for the commit 1b81ccc to get more accurate results

@@            Coverage Diff             @@
##           master     #712      +/-   ##
==========================================
- Coverage   61.25%   61.16%   -0.09%     
==========================================
  Files          58       58              
  Lines        3802     3796       -6     
  Branches     2112     2112              
==========================================
- Hits         2329     2322       -7     
- Misses       1472     1473       +1     
  Partials        1        1              
Impacted Files Coverage Δ
src/server/internalServer.cpp 82.80% <100.00%> (+0.25%) ⬆️
src/server/response.cpp 91.22% <0.00%> (-0.89%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f90cc39...1b81ccc. Read the comment docs.

@kelson42 kelson42 changed the title {WIP] Static resource versioning [WIP] Static resource versioning Feb 14, 2022
@mgautierfr mgautierfr force-pushed the unittests_for_404_html_page branch 2 times, most recently from 80ecc77 to 34d069e Compare February 16, 2022 13:25
Base automatically changed from unittests_for_404_html_page to master February 16, 2022 13:38
@veloman-yunkan veloman-yunkan force-pushed the static_resource_versioning branch from ddabc4d to 93bfd2a Compare February 16, 2022 15:27
@veloman-yunkan veloman-yunkan marked this pull request as ready for review February 16, 2022 15:29
@veloman-yunkan
Copy link
Collaborator Author

@mgautierfr @kelson42

Before merging this PR I would like to functionally cover with unit-tests entire code that produces responses from mustache templates so that we have a guarantee that nothing breaks. However, since that can be a lot of work, I think that this PR should be reviewed now because if the taken approach must be changed then an early decision can be helpful.

@kelson42
Copy link
Collaborator

@veloman-yunkan Fine to me. Appreciate your approach to improve tests before merging features.

Copy link
Member

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the global idea.

I've few comments:

  • static/skin/index.js also have link to resources we should cache. We must not only process the templates but also other resources
  • Instead of /skin/foo.png&<md5id>, I would prefer a /skin/foo.png&cacheid=<md5id>. It is a bit more verbose and convey correctly the intent of the "random numbers" at the end.
  • I wonder if we should also store the md5id somewhere in the cpp code if we have to generate links to resource on the cpp side. I don't know if it is really useful, just questioning.

@veloman-yunkan veloman-yunkan force-pushed the static_resource_versioning branch from 93bfd2a to 44df851 Compare February 26, 2022 19:55
@veloman-yunkan
Copy link
Collaborator Author

@mgautierfr

  • static/skin/index.js also have link to resources we should cache. We must not only process the templates but also other resources

I introduced an explicit KIWIXCACHEID placeholder in URLs where the cacheid must be injected. It seemed to be the easiest way to safely inject the cacheid in any type of a resource. The downside is that the URLs must now be manually marked (when adding in files under `static/' new URLs to static resources).

  • I wonder if we should also store the md5id somewhere in the cpp code if we have to generate links to resource on the cpp side. I don't know if it is really useful, just questioning.

If we ever need that then we can generate the link via a mustache template; hence no cacheid will need to be stored in the source code (excluding the generated file kiwixlib-resources.cpp).

Copy link
Member

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've few comments, mainly (early) optimization and simpler python code.
But it's working and agree with this PR

Comment on lines 65 to 72
with open(srcpath, 'r') as source:
for line in source:
ppline = preprocess_line(line)
if ppline != line:
modified_line_count += 1
preprocessed_lines.append(ppline)
return "".join(preprocessed_lines), modified_line_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why process each line separately ?
It should be possible to pass all the content to re.sub(and re.subn returns the number of occurrences replaced)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a very slightly different way. But I doubt that the result is a net optimization (if that was the point of this suggestion).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not a optimization. More a simplification of the code. This code is not really a critical path, I prefer to have simpler code

@mgautierfr
Copy link
Member

I introduced an explicit KIWIXCACHEID placeholder in URLs where the cacheid must be injected. It seemed to be the easiest way to safely inject the cacheid in any type of a resource. The downside is that the URLs must now be manually marked (when adding in files under `static/' new URLs to static resources).

Yes. There is a downside but we can handle it.

@stale
Copy link

stale bot commented Mar 30, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Mar 30, 2022
@mgautierfr
Copy link
Member

@veloman-yunkan do you plan to fix my comment ?
As I said, they are not blocker. If you don't plan to work on this anymore I will rebase on master and merge.

@veloman-yunkan
Copy link
Collaborator Author

@veloman-yunkan do you plan to fix my comment ? As I said, they are not blocker. If you don't plan to work on this anymore I will rebase on master and merge.

@mgautierfr As I have stated earlier in this PR I expected more work to be done related to testing. Some of the sought coverage improvement has already been merged into master by a few PRs extracted from #679 but there are still some pages that are not tested. If you feel comfortable having this PR merged before achieving 100% coverage of the functionally affected areas, then I can clean it up for final review.

@stale stale bot removed the stale label Mar 30, 2022
@kelson42
Copy link
Collaborator

@veloman-yunkan I'm supportive to 100% code coverage for this feature... Even if this takes a bit longer. That said, 1 month without update is a really long time. I understand the current situation is complex and probably exceptional (many fronts, many big PRs, etc.), that said, this is the kind of situation to avoid. Preferable are few PRs which move quickly forward to their respective goals.

@veloman-yunkan veloman-yunkan force-pushed the static_resource_versioning branch 2 times, most recently from 611e31a to e41b422 Compare April 3, 2022 19:33
@veloman-yunkan
Copy link
Collaborator Author

veloman-yunkan commented Apr 3, 2022

The current implementation contains a bug: it doesn't take into account the dependence of static resources on each other. If resource A refers to B and B refers to C, then a change in C would result in its cache id being updated in the preprocessed version of B. However the cache id of B won't change since the cache id is derived from the source rather than from the preprocessed output. Will fix that issue next week.

@veloman-yunkan
Copy link
Collaborator Author

  • I wonder if we should also store the md5id somewhere in the cpp code if we have to generate links to resource on the cpp side. I don't know if it is really useful, just questioning.

@mgautierfr In #744 I ran against a situation where this must be addressed (see #744 (comment))

@veloman-yunkan veloman-yunkan force-pushed the static_resource_versioning branch from 6b52abc to d60a1a4 Compare April 6, 2022 17:22
@veloman-yunkan veloman-yunkan changed the title [WIP] Static resource versioning Static resource versioning Apr 6, 2022
@veloman-yunkan veloman-yunkan force-pushed the static_resource_versioning branch from d60a1a4 to 46f0996 Compare April 6, 2022 17:31
@kelson42 kelson42 modified the milestones: 10.3.0, 10.2.0 Apr 23, 2022
Copy link
Member

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm globally ok with the code of this PR.
The questioned commit is acceptable (at least, while we have very few use case for search problem. If we have more need to generate link, we will need a proper solution)

But, I recently read this page https://medium.com/@hutyxxx/how-does-the-browser-know-what-to-cache-edfd5d63c239. At cache busting section it says that using a query-string is generally discouraged (Mainly because it seems that some proxies deactivate cache for resource with a query string). As we use a lot kiwix-serve behind a proxy, it would be nice to be sure we are ok
@rgaudin can you confirm/test/check this ?
If we don't go with a query string, we will have to change the url foo.js?cacheid=abcdef to foo-abcdef.js and store a mapping cacheid_url -> generic_url to allow the server to answer the request (and it may help us to fix the questioned commit)

@rgaudin
Copy link
Member

rgaudin commented Apr 25, 2022

But, I recently read this page https://medium.com/@hutyxxx/how-does-the-browser-know-what-to-cache-edfd5d63c239. At cache busting section it says that using a query-string is generally discouraged (Mainly because it seems that some proxies deactivate cache for resource with a query string). As we use a lot kiwix-serve behind a proxy, it would be nice to be sure we are ok @rgaudin can you confirm/test/check this ? If we don't go with a query string, we will have to change the url foo.js?cacheid=abcdef to foo-abcdef.js and store a mapping cacheid_url -> generic_url to allow the server to answer the request (and it may help us to fix the questioned commit)

I think this might be outdated information. Haven't tested with Squid which reportedly changed that back in 2008 but I've tested with varnish (the one we use) we use and query-stringed paths are cached properly. Otherwise would have been a shame considering our illustrations are only available though query strings.

I prefer query string as it leaves paths properly named and permanent-friendly (although resources don't have same expectations as content).

@mgautierfr mgautierfr force-pushed the static_resource_versioning branch from fa804b8 to 97aff7d Compare April 26, 2022 09:47
@mgautierfr
Copy link
Member

mgautierfr commented Apr 26, 2022

Then we can merge
I've rebased on master (and remove the [???] in the commit message.

Edit :

I will consider making them if the feedback provided on the undrafted version of this PR doesn't lead to changes that cancel the opportunity for such improvements.

Or we wait for @veloman-yunkan to do the change in the python code (or argue about them :) )

return line
def preprocess_text(s):
if 'KIWIXCACHEID' in s:
s = re.sub(RESOURCE_WITH_CACHEID_URL_PATTERN, set_cacheid, s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re.subn also return the number of substitution made, it would avoid us to compare the processed_content with the original content.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also approached this change as a simplification. Returning a single value instead of a pair of values that must be consistent with each other is inline with that goal. A simple comparison of the source content and the result of transforming it makes preprocess_text() easier to enhance should such a need arise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And this part is not critical at all. Perfect is the enemy of good. Let's move on.

@veloman-yunkan
Copy link
Collaborator Author

During rebase & fixup the commit message of the commit currently titled "An optimization (?) of resource preprocessing" must be rewritten.

Copy link
Member

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

During rebase & fixup the commit message of the commit currently titled "An optimization (?) of resource preprocessing" must be rewritten.

Please rebase-fixup and rewrite the message. Then we can merge.

Introduced a new unit-test which will ensure that static resources of
kiwix-serve have the cache ids applied to them in the links embedded into
the HTML code.

At this point there are no cache ids. The new unit-test will help to
visualize how they come into existence.
In template resources (found under static/templates), strings of the
form "PATH/TO/STATIC/RESOURCE?KIWIXCACHEID" are expanded into
"PATH/TO/STATIC/RESOURCE?cacheid=CACHEIDVAL" where CACHEIDVAL is a
8-digit hexadecimal hash digest of the file at
static/PATH/TO/STATIC/RESOURCE.
The story of search_results.css

static/skin/search_results.css was extracted from
static/templates/no_search_result.html before the latter was dropped.

static/templates/no_search_result.html in turn seems to be a copied and
edited version of static/templates/search_result.html.

In the context of exploratory work on the internationalization of
kiwix-serve (PR #679) I noticed duplication of inline CSS across those
two templates and intended to eliminated it. That goal was not fully
accomplished (static/templates/search_result.html remained untouched)
because by that time PR #679 grew too big and the efforts were diverted
into splitting it into smaller ones. Thus search_results.css slipped
into one of those small PRs, without making much sense because nothing
really justifies preserving custom CSS in the "Fulltext search unavailable"
error page.

At the same time, it served as the only case where a link to a cacheable
resource is generated in C++ code (rather than found in a template).
This poses certain problems to the handling of cache-ids. A workaround
is to expel the URL into a template so that it is processed by
`kiwix-resources`. This commit merely demonstrates that solution. But
whether it should be preserved (or rather the "Fulltext search
unavailable" page should be deprived of CSS) is questionable.
kiwix-resources preprocesses all resources rather than only templates. At
this point this doesn't change anything since only (some) template resources
contain KIWIXCACHEID placeholders. But this enhancement opens the door
to the preprocessing of static/skin/index.js (after preprocessing is
able to handle relative links, which comes in the next commit).
... but only if they contain "/skin/" as a substring.
If during an earlier build a resource was symlinked in the build
directory (because it wasn't modified by preprocessing) and later
changes are made to the resource that result in its preprocessing no
longer being a no-op, then the preprocessing is performed (in place) on
the original resource directly (via the symlink). Therefore any symlinks
must be removed before preprocessing a resource.
Formatted string literals appeared in Python 3.6. Some CI platforms
still use older versions of Python.
The current implementation of resource preprocessing contains a bug
(with respect to the problem that it tries to solve): it doesn't take
into account the dependence of static resources on each other. If
resource A refers to B and B refers to C, then a change in C would
result in its cache id being updated in the preprocessed version of B.
However the cache id of B won't change since the cache id is derived
from the source rather than from the preprocessed output.

This commit is the first step towards addressing the described issue.

Now cache-id of a resource is computed on demand rather than precomputed
for all resources. The only thing remaining is to compute the cache-id
from the preprocessed content.
The cache-id of resources now includes dependency information. This commit
illustrates that property with the changed cache-id of skin/index.js which
depends on skin/{download,hash,magnet,bittorent}.png.

The implementation is not fool-proof - cyclic dependency between
resources is not detected and will lead to infinite recursion.
Now the whole content of a resource is preprocessed with a single
invocation of `re.sub()` rather than line-by-line.

Also, the function `get_preprocessed_resource()` returns a single value
rather than a (preprocessed_content, modification_count) pair; the
situation when the preprocessed resource is identical to the source
version is signalled by a return value of None.
@veloman-yunkan veloman-yunkan force-pushed the static_resource_versioning branch from 0225a7e to 1b81ccc Compare May 2, 2022 16:50
@veloman-yunkan
Copy link
Collaborator Author

Please rebase-fixup and rewrite the message. Then we can merge.

Done

@kelson42 kelson42 merged commit 26eccb5 into master May 2, 2022
@kelson42 kelson42 deleted the static_resource_versioning branch May 2, 2022 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants