Skip to content

Enhanced Non UTF8 HTML Support #261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

akshithio
Copy link

Copied from #253:

Follows the recommended tips and trick to provide better support for different character encodings and character sets in HTML pages.

This adds additional test cases in html_test.go, makes changes to html.go to improve maintainability and robustness where applicable and adds methods to models/url.go to handle different character encodings. Test cases have also been changed in html_test.go to consider the assets that are extracted rather than purely the number of them that are extracted.

I'm open to any feedback


I also tried to take into consideration the comments mentioned on that PR to not manually use idna to convert the URLs and hostnames to ASCII. I've also tried to include the changes I noticed on commit 93f6658 and e2b245b to html.go which would have caused a merge conflict on my original PR.

Attempts to close #169.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • go.mod: Language not supported
Comments suppressed due to low confidence (2)

pkg/models/url.go:376

  • The removal of the idna.ToASCII conversion for hostnames (present in the original code) may lead to issues with non-ASCII domain names. Consider reintroducing the IDNA conversion to ensure proper handling of internationalized hostnames.
switch urlCopy.Host {

internal/pkg/postprocessor/extractor/html.go:121

  • Filtering out any URL containing a '%' may inadvertently exclude valid percent-encoded URLs. Consider revising this filter to avoid excluding legitimate URLs.
if strings.Contains(url, "%") ||

Comment on lines +283 to 285
if err == nil {
addRawAsset(escapedLink)
}
Copy link
Preview

Copilot AI Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] When URL unquoting fails in the script tag extraction block, the error is silently ignored. Consider logging or handling this error to aid debugging without suppressing potentially important information.

Suggested change
if err == nil {
addRawAsset(escapedLink)
}
if err != nil {
log.Warnf("Failed to unquote script link %q: %v", scriptLink, err)
continue
}
addRawAsset(escapedLink)

Copilot uses AI. Check for mistakes.

_, err := u.body.Seek(0, io.SeekStart)
if err != nil {
panic(err)
slog.Warn("failed to rewind body", "error", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error here shouldn't happen, right? Why removing the panic?

@CorentinB CorentinB requested a review from NGTmeaty April 10, 2025 13:36
@akshithio akshithio mentioned this pull request Apr 21, 2025
yzqzss added a commit that referenced this pull request May 19, 2025
some code comes from #261
@yzqzss yzqzss closed this Jun 27, 2025
@yzqzss
Copy link
Collaborator

yzqzss commented Jun 27, 2025

Thank you!

Continue in #370. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle non-UTF8 HTML pages
3 participants