Enhanced Non UTF8 HTML Support #261

akshithio · 2025-04-07T19:29:12Z

Copied from #253:

Follows the recommended tips and trick to provide better support for different character encodings and character sets in HTML pages.

This adds additional test cases in html_test.go, makes changes to html.go to improve maintainability and robustness where applicable and adds methods to models/url.go to handle different character encodings. Test cases have also been changed in html_test.go to consider the assets that are extracted rather than purely the number of them that are extracted.

I'm open to any feedback

I also tried to take into consideration the comments mentioned on that PR to not manually use idna to convert the URLs and hostnames to ASCII. I've also tried to include the changes I noticed on commit 93f6658 and e2b245b to html.go which would have caused a merge conflict on my original PR.

Attempts to close #169.

Copilot

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

Files not reviewed (1)

go.mod: Language not supported

Comments suppressed due to low confidence (2)

pkg/models/url.go:376

The removal of the idna.ToASCII conversion for hostnames (present in the original code) may lead to issues with non-ASCII domain names. Consider reintroducing the IDNA conversion to ensure proper handling of internationalized hostnames.

switch urlCopy.Host {

internal/pkg/postprocessor/extractor/html.go:121

Filtering out any URL containing a '%' may inadvertently exclude valid percent-encoded URLs. Consider revising this filter to avoid excluding legitimate URLs.

if strings.Contains(url, "%") ||

Copilot · 2025-04-10T13:23:37Z

internal/pkg/postprocessor/extractor/html.go

+						if err == nil {
+							addRawAsset(escapedLink)
 						}


[nitpick] When URL unquoting fails in the script tag extraction block, the error is silently ignored. Consider logging or handling this error to aid debugging without suppressing potentially important information.

Suggested change

if err == nil {

addRawAsset(escapedLink)

}

if err != nil {

log.Warnf("Failed to unquote script link %q: %v", scriptLink, err)

continue

}

addRawAsset(escapedLink)

CorentinB · 2025-04-10T13:25:50Z

pkg/models/url.go

 	_, err := u.body.Seek(0, io.SeekStart)
 	if err != nil {
-		panic(err)
+		slog.Warn("failed to rewind body", "error", err)


An error here shouldn't happen, right? Why removing the panic?

some code comes from #261

yzqzss · 2025-06-27T16:01:32Z

Thank you!

Continue in #370. :)

akshithio added 2 commits April 7, 2025 15:21

feat: non utf-8 html support

abc1ecd

fix: keep changes from dev/v2 merge

54ba22e

CorentinB requested a review from Copilot April 10, 2025 13:22

CorentinB assigned akshithio Apr 10, 2025

Copilot AI reviewed Apr 10, 2025

View reviewed changes

CorentinB reviewed Apr 10, 2025

View reviewed changes

CorentinB requested a review from NGTmeaty April 10, 2025 13:36

akshithio mentioned this pull request Apr 21, 2025

EPUB Support #260

Open

yzqzss added a commit that referenced this pull request May 19, 2025

WIP commit

d515056

some code comes from #261

yzqzss closed this Jun 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhanced Non UTF8 HTML Support #261

Enhanced Non UTF8 HTML Support #261

akshithio commented Apr 7, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 10, 2025

Uh oh!

CorentinB Apr 10, 2025

Uh oh!

yzqzss commented Jun 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Enhanced Non UTF8 HTML Support #261

Enhanced Non UTF8 HTML Support #261

Conversation

akshithio commented Apr 7, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

CorentinB Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

yzqzss commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yzqzss commented Jun 27, 2025 •

edited

Loading