-
Notifications
You must be signed in to change notification settings - Fork 39
Enhanced Non UTF8 HTML Support #261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.
Files not reviewed (1)
- go.mod: Language not supported
Comments suppressed due to low confidence (2)
pkg/models/url.go:376
- The removal of the idna.ToASCII conversion for hostnames (present in the original code) may lead to issues with non-ASCII domain names. Consider reintroducing the IDNA conversion to ensure proper handling of internationalized hostnames.
switch urlCopy.Host {
internal/pkg/postprocessor/extractor/html.go:121
- Filtering out any URL containing a '%' may inadvertently exclude valid percent-encoded URLs. Consider revising this filter to avoid excluding legitimate URLs.
if strings.Contains(url, "%") ||
if err == nil { | ||
addRawAsset(escapedLink) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] When URL unquoting fails in the script tag extraction block, the error is silently ignored. Consider logging or handling this error to aid debugging without suppressing potentially important information.
if err == nil { | |
addRawAsset(escapedLink) | |
} | |
if err != nil { | |
log.Warnf("Failed to unquote script link %q: %v", scriptLink, err) | |
continue | |
} | |
addRawAsset(escapedLink) |
Copilot uses AI. Check for mistakes.
_, err := u.body.Seek(0, io.SeekStart) | ||
if err != nil { | ||
panic(err) | ||
slog.Warn("failed to rewind body", "error", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An error here shouldn't happen, right? Why removing the panic?
Thank you! Continue in #370. :) |
Copied from #253:
Follows the recommended tips and trick to provide better support for different character encodings and character sets in HTML pages.
This adds additional test cases in
html_test.go
, makes changes tohtml.go
to improve maintainability and robustness where applicable and adds methods tomodels/url.go
to handle different character encodings. Test cases have also been changed inhtml_test.go
to consider the assets that are extracted rather than purely the number of them that are extracted.I'm open to any feedback
I also tried to take into consideration the comments mentioned on that PR to not manually use
idna
to convert the URLs and hostnames to ASCII. I've also tried to include the changes I noticed on commit 93f6658 and e2b245b tohtml.go
which would have caused a merge conflict on my original PR.Attempts to close #169.