-
Notifications
You must be signed in to change notification settings - Fork 4.5k
fix leak token when input url as raw html #1179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughA conditional check was added after fetching page content to identify URLs starting with "raw:" or "raw://". Such URLs are internally replaced with "RAW html" and influence the Changes
Assessment against linked issues
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -360,7 +363,7 @@ async def arun( | |||
pdf_data=pdf_data, | |||
verbose=config.verbose, | |||
is_raw_html=True if url.startswith("raw:") else False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use the preserved flag when calling aprocess_html
The is_raw_html
parameter should use the flag set above, not re-inspect the mutated URL. Update this line to:
- is_raw_html=True if url.startswith("raw:") else False,
+ is_raw_html=is_raw_html,
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
is_raw_html=True if url.startswith("raw:") else False, | |
is_raw_html=is_raw_html, |
🧰 Tools
🪛 Ruff (0.11.9)
365-365: Use bool(...)
instead of True if ... else False
Replace with `bool(...)
(SIM210)
🪛 Pylint (3.3.7)
[refactor] 365-365: The if expression can be replaced with 'bool(test)'
(R1719)
🤖 Prompt for AI Agents
In crawl4ai/async_webcrawler.py at line 365, the is_raw_html parameter is
incorrectly set by re-inspecting the mutated URL instead of using the preserved
flag defined earlier. Modify the is_raw_html argument to use the existing
preserved flag variable directly rather than checking the URL prefix again.
if url.startswith("raw:") or url.startswith("raw://"): | ||
url = "RAW html" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Preserve raw HTML flag before mutating URL
The current logic replaces the url
without capturing whether it originated from raw HTML, and then you recompute is_raw_html
against the mutated URL (always false). Instead, compute a flag first and then set the display url
.
Apply this diff:
- if url.startswith("raw:") or url.startswith("raw://"):
- url = "RAW html"
+ # Detect raw HTML input and prepare display URL
+ is_raw_html = url.startswith("raw:") or url.startswith("raw://")
+ url = "Raw HTML" if is_raw_html else url
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if url.startswith("raw:") or url.startswith("raw://"): | |
url = "RAW html" | |
# Detect raw HTML input and prepare display URL | |
is_raw_html = url.startswith("raw:") or url.startswith("raw://") | |
url = "Raw HTML" if is_raw_html else url |
🤖 Prompt for AI Agents
In crawl4ai/async_webcrawler.py around lines 338 to 340, the code replaces the
url string before checking if it is raw HTML, causing the raw HTML flag to be
lost. To fix this, first compute a boolean flag indicating if the original url
starts with "raw:" or "raw://", then assign the display url to "RAW html" if
needed, preserving the raw HTML flag for later use.
Summary
Please include a summary of the change and/or which issues are fixed.
eg:
Fixes #123
(Tag GitHub issue numbers in this format, so it automatically links the issues with your PR)List of files changed and why
eg: quickstart.py - To update the example as per new changes
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Checklist:
Summary by CodeRabbit
Bug Fixes
Style