-
-
Notifications
You must be signed in to change notification settings - Fork 62
Static resource versioning #712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #712 +/- ##
==========================================
- Coverage 61.25% 61.16% -0.09%
==========================================
Files 58 58
Lines 3802 3796 -6
Branches 2112 2112
==========================================
- Hits 2329 2322 -7
- Misses 1472 1473 +1
Partials 1 1
Continue to review full report at Codecov.
|
80ecc77
to
34d069e
Compare
ddabc4d
to
93bfd2a
Compare
Before merging this PR I would like to functionally cover with unit-tests entire code that produces responses from mustache templates so that we have a guarantee that nothing breaks. However, since that can be a lot of work, I think that this PR should be reviewed now because if the taken approach must be changed then an early decision can be helpful. |
@veloman-yunkan Fine to me. Appreciate your approach to improve tests before merging features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the global idea.
I've few comments:
static/skin/index.js
also have link to resources we should cache. We must not only process the templates but also other resources- Instead of
/skin/foo.png&<md5id>
, I would prefer a/skin/foo.png&cacheid=<md5id>
. It is a bit more verbose and convey correctly the intent of the "random numbers" at the end. - I wonder if we should also store the md5id somewhere in the cpp code if we have to generate links to resource on the cpp side. I don't know if it is really useful, just questioning.
93bfd2a
to
44df851
Compare
I introduced an explicit
If we ever need that then we can generate the link via a mustache template; hence no cacheid will need to be stored in the source code (excluding the generated file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've few comments, mainly (early) optimization and simpler python code.
But it's working and agree with this PR
scripts/kiwix-resources
Outdated
with open(srcpath, 'r') as source: | ||
for line in source: | ||
ppline = preprocess_line(line) | ||
if ppline != line: | ||
modified_line_count += 1 | ||
preprocessed_lines.append(ppline) | ||
return "".join(preprocessed_lines), modified_line_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why process each line separately ?
It should be possible to pass all the content to re.sub
(and re.subn
returns the number of occurrences replaced)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in a very slightly different way. But I doubt that the result is a net optimization (if that was the point of this suggestion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not a optimization. More a simplification of the code. This code is not really a critical path, I prefer to have simpler code
Yes. There is a downside but we can handle it. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
@veloman-yunkan do you plan to fix my comment ? |
@mgautierfr As I have stated earlier in this PR I expected more work to be done related to testing. Some of the sought coverage improvement has already been merged into master by a few PRs extracted from #679 but there are still some pages that are not tested. If you feel comfortable having this PR merged before achieving 100% coverage of the functionally affected areas, then I can clean it up for final review. |
@veloman-yunkan I'm supportive to 100% code coverage for this feature... Even if this takes a bit longer. That said, 1 month without update is a really long time. I understand the current situation is complex and probably exceptional (many fronts, many big PRs, etc.), that said, this is the kind of situation to avoid. Preferable are few PRs which move quickly forward to their respective goals. |
611e31a
to
e41b422
Compare
The current implementation contains a bug: it doesn't take into account the dependence of static resources on each other. If resource A refers to B and B refers to C, then a change in C would result in its cache id being updated in the preprocessed version of B. However the cache id of B won't change since the cache id is derived from the source rather than from the preprocessed output. Will fix that issue next week. |
e41b422
to
d150264
Compare
@mgautierfr In #744 I ran against a situation where this must be addressed (see #744 (comment)) |
6b52abc
to
d60a1a4
Compare
d60a1a4
to
46f0996
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm globally ok with the code of this PR.
The questioned commit is acceptable (at least, while we have very few use case for search problem. If we have more need to generate link, we will need a proper solution)
But, I recently read this page https://medium.com/@hutyxxx/how-does-the-browser-know-what-to-cache-edfd5d63c239. At cache busting section it says that using a query-string is generally discouraged (Mainly because it seems that some proxies deactivate cache for resource with a query string). As we use a lot kiwix-serve behind a proxy, it would be nice to be sure we are ok
@rgaudin can you confirm/test/check this ?
If we don't go with a query string, we will have to change the url foo.js?cacheid=abcdef
to foo-abcdef.js
and store a mapping cacheid_url -> generic_url
to allow the server to answer the request (and it may help us to fix the questioned commit)
I think this might be outdated information. Haven't tested with Squid which reportedly changed that back in 2008 but I've tested with varnish (the one we use) we use and query-stringed paths are cached properly. Otherwise would have been a shame considering our illustrations are only available though query strings. I prefer query string as it leaves paths properly named and permanent-friendly (although resources don't have same expectations as content). |
fa804b8
to
97aff7d
Compare
Edit :
Or we wait for @veloman-yunkan to do the change in the python code (or argue about them :) ) |
return line | ||
def preprocess_text(s): | ||
if 'KIWIXCACHEID' in s: | ||
s = re.sub(RESOURCE_WITH_CACHEID_URL_PATTERN, set_cacheid, s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re.subn
also return the number of substitution made, it would avoid us to compare the processed_content with the original content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also approached this change as a simplification. Returning a single value instead of a pair of values that must be consistent with each other is inline with that goal. A simple comparison of the source content and the result of transforming it makes preprocess_text()
easier to enhance should such a need arise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. And this part is not critical at all. Perfect is the enemy of good. Let's move on.
During rebase & fixup the commit message of the commit currently titled "An optimization (?) of resource preprocessing" must be rewritten. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
During rebase & fixup the commit message of the commit currently titled "An optimization (?) of resource preprocessing" must be rewritten.
Please rebase-fixup and rewrite the message. Then we can merge.
Introduced a new unit-test which will ensure that static resources of kiwix-serve have the cache ids applied to them in the links embedded into the HTML code. At this point there are no cache ids. The new unit-test will help to visualize how they come into existence.
In template resources (found under static/templates), strings of the form "PATH/TO/STATIC/RESOURCE?KIWIXCACHEID" are expanded into "PATH/TO/STATIC/RESOURCE?cacheid=CACHEIDVAL" where CACHEIDVAL is a 8-digit hexadecimal hash digest of the file at static/PATH/TO/STATIC/RESOURCE.
The story of search_results.css static/skin/search_results.css was extracted from static/templates/no_search_result.html before the latter was dropped. static/templates/no_search_result.html in turn seems to be a copied and edited version of static/templates/search_result.html. In the context of exploratory work on the internationalization of kiwix-serve (PR #679) I noticed duplication of inline CSS across those two templates and intended to eliminated it. That goal was not fully accomplished (static/templates/search_result.html remained untouched) because by that time PR #679 grew too big and the efforts were diverted into splitting it into smaller ones. Thus search_results.css slipped into one of those small PRs, without making much sense because nothing really justifies preserving custom CSS in the "Fulltext search unavailable" error page. At the same time, it served as the only case where a link to a cacheable resource is generated in C++ code (rather than found in a template). This poses certain problems to the handling of cache-ids. A workaround is to expel the URL into a template so that it is processed by `kiwix-resources`. This commit merely demonstrates that solution. But whether it should be preserved (or rather the "Fulltext search unavailable" page should be deprived of CSS) is questionable.
kiwix-resources preprocesses all resources rather than only templates. At this point this doesn't change anything since only (some) template resources contain KIWIXCACHEID placeholders. But this enhancement opens the door to the preprocessing of static/skin/index.js (after preprocessing is able to handle relative links, which comes in the next commit).
... but only if they contain "/skin/" as a substring.
If during an earlier build a resource was symlinked in the build directory (because it wasn't modified by preprocessing) and later changes are made to the resource that result in its preprocessing no longer being a no-op, then the preprocessing is performed (in place) on the original resource directly (via the symlink). Therefore any symlinks must be removed before preprocessing a resource.
Formatted string literals appeared in Python 3.6. Some CI platforms still use older versions of Python.
The current implementation of resource preprocessing contains a bug (with respect to the problem that it tries to solve): it doesn't take into account the dependence of static resources on each other. If resource A refers to B and B refers to C, then a change in C would result in its cache id being updated in the preprocessed version of B. However the cache id of B won't change since the cache id is derived from the source rather than from the preprocessed output. This commit is the first step towards addressing the described issue. Now cache-id of a resource is computed on demand rather than precomputed for all resources. The only thing remaining is to compute the cache-id from the preprocessed content.
The cache-id of resources now includes dependency information. This commit illustrates that property with the changed cache-id of skin/index.js which depends on skin/{download,hash,magnet,bittorent}.png. The implementation is not fool-proof - cyclic dependency between resources is not detected and will lead to infinite recursion.
Now the whole content of a resource is preprocessed with a single invocation of `re.sub()` rather than line-by-line. Also, the function `get_preprocessed_resource()` returns a single value rather than a (preprocessed_content, modification_count) pair; the situation when the preprocessed resource is identical to the source version is signalled by a return value of None.
0225a7e
to
1b81ccc
Compare
Done |
Addresses the static resource part of #650
HTML templates are now preprocessed by a new script
kiwix-resources
that detects links marked with?KIWIXCACHEID
and substitutescacheid=CACHEID
in lieu of theKIWIXCACHEID
placeholder. TheCACHEID
value of a resource is obtained by taking the first 8 characters of its SHA1 hashsum hex digest. Inter-resource dependency is correctly handled by computing the cacheid for the preprocessed (rather than the source) version of a resource.