-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Fix TypeError when keep_data_attributes=False by ensuring list concat… #1234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe changes focus on standardizing string quoting to double quotes, improving code formatting, and enhancing defensive programming. The main functional update ensures that the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ContentScrapingStrategy
participant remove_unwanted_attributes
User->>ContentScrapingStrategy: scrap(html, keep_attrs, keep_data_attributes)
ContentScrapingStrategy->>ContentScrapingStrategy: _process_element(element, keep_attrs, keep_data_attributes)
alt keep_attrs or keep_data_attributes is bool
ContentScrapingStrategy->>ContentScrapingStrategy: Convert to list if not already
end
ContentScrapingStrategy->>remove_unwanted_attributes: remove_unwanted_attributes(element, keep_attrs + keep_data_attributes)
remove_unwanted_attributes-->>ContentScrapingStrategy: Processed element
ContentScrapingStrategy-->>User: Return processed data
Assessment against linked issues
Assessment against linked issues: Out-of-scope changesNo out-of-scope changes were found. Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
crawl4ai/content_scraping_strategy.py (1)
277-351
: Consider refactoring to reduce complexityWhile the quote consistency changes are good, this method has high complexity as noted by static analysis (21 local variables, 13 branches). Consider breaking this into smaller helper methods.
For example, you could extract header processing and row processing into separate methods:
+ def _extract_table_headers(self, table: Tag) -> list: + """Extract headers from table with colspan handling.""" + headers = [] + thead_rows = table.select("thead tr") + if thead_rows: + header_cells = thead_rows[0].select("th") + for cell in header_cells: + text = cell.get_text().strip() + colspan = int(cell.get("colspan", 1)) + headers.extend([text] * colspan) + else: + first_row = table.select("tr:first-child") + if first_row: + for cell in first_row[0].select("th, td"): + text = cell.get_text().strip() + colspan = int(cell.get("colspan", 1)) + headers.extend([text] * colspan) + return headers + + def _extract_table_rows(self, table: Tag, headers: list) -> list: + """Extract rows from table with colspan handling.""" + # Implementation here...
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
crawl4ai/content_scraping_strategy.py
(16 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
crawl4ai/content_scraping_strategy.py
[refactor] 277-277: Too many local variables (21/15)
(R0914)
[refactor] 277-277: Too many branches (13/12)
(R0912)
🔇 Additional comments (11)
crawl4ai/content_scraping_strategy.py (11)
52-52
: LGTM: Quote consistency improvementGood practice to standardize on double quotes for string literals.
160-160
: LGTM: Trailing comma additionAdding the trailing comma improves maintainability and reduces diff noise when adding new items to the list.
213-276
: LGTM: CSS selector quote consistencyThe standardization to double quotes in CSS selectors and string literals improves code consistency. The logic for table data detection remains unchanged and correct.
793-814
: Excellent bug fix for type safetyThis change properly addresses the TypeError when
keep_attrs
orkeep_data_attributes
are passed as boolean values instead of lists. The implementation correctly handles all edge cases:
False
→ empty list[]
True
or other truthy single values → wrapped in list[value]
- Already lists → unchanged
None
or other falsy values → empty list[]
This ensures safe concatenation with
IMPORTANT_ATTRS + keep_attrs
.
899-901
: LGTM: Error message formatting improvementThe multi-line formatting improves readability of the error message.
950-952
: LGTM: Consistent error logging formatThe improved formatting for error logging maintains consistency with the project's logging patterns.
998-998
: LGTM: Quote consistency in table exclusion checkConsistent with the overall double-quote standardization throughout the file.
1428-1428
: LGTM: Mathematical calculation formattingThe spacing around the exponentiation operator improves readability.
1439-1443
: LGTM: Multi-line formatting for complex expressionsBreaking the complex sum expression across multiple lines significantly improves readability.
1582-1592
: LGTM: Improved error logging formattingThe multi-line formatting for error messages improves code readability and maintainability.
1712-1712
: LGTM: Dictionary formatting consistencySingle-line dictionary formatting is appropriate for simple structures and maintains consistency.
Summary
Fixes
#1226
: This PR resolves aTypeError
that occurs whenkeep_data_attributes=False
is passed in theCrawlerRunConfig
. The issue was due to a list concatenation with abool
, which is now safely handled by convertingFalse
to an empty list before concatenation.List of Files Changed and Why
path/to/affected_file.py
– Added type-check and logic to ensurekeep_attrs
andkeep_data_attributes
are always treated as lists, preventing runtime errors during attribute removal.test_keep_attrs_fix.py
– Added a test script to verify that the fix works for edge cases likeFalse
,True
,None
, and missing keys.How Has This Been Tested?
keep_attrs
andkeep_data_attributes
.Checklist
Summary by CodeRabbit
Bug Fixes
Style