-
Notifications
You must be signed in to change notification settings - Fork 15
update RulesRegistry to allow multiple rules in a single Page Object #101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -248,14 +238,6 @@ def search(self, **kwargs) -> List[ApplyRule]: | |||
|
|||
""" | |||
|
|||
# Short-circuit operation if "use" is the only search param used, since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked scrapy-poet code, and it seems that currently fast lookups in the registry are not needed, as matching is implemented in a different way. So, it seems that's fine to drop this optimization. We still should mention this change in the changelog though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the algorithmic complexity change, I think it makes sense to consider it a blocker for 0.6.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i.e. bundle all changes like this in 0.6.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked scrapy-poet code, and it seems that currently fast lookups in the registry are not needed,
I think one optimization we need is fast lookups for returning the rules for a given request (particularly its URL). Currently, it's looking through all the rules (ref):
def overrides_for(self, request: Request) -> Mapping[Callable, Callable]:
overrides: Dict[Callable, Callable] = {}
for instead_of, matcher in self.matcher.items():
rule_id = matcher.match(request.url)
if rule_id is not None:
overrides[instead_of] = self.rules[rule_id].use
return
Although I'm optimizing this in the scrapinghub/scrapy-poet#88 PR since it needs a refactor anyways for the new to_return
functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the algorithmic complexity change, I think it makes sense to consider it a blocker for 0.6.0
Does this mean that we'll need to consider bundling this change into 0.6.0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it's looking through all the rules
I think it's not that bad. self.matcher
(not a great name, by the way) is dictionary of {generic_cls: URLMatcher}
. So, the loop happens not against all rules, but against all generic (overridable) classes. For example, if you have GenericProductPage and GenericProductListPage, and there are page objects for 1000 websites, the loop would do only 2 iterations, not 1000 or 2000.
Picking a right rule is handled by url-matcher, and it should be optimized pretty well; it has a domain index, so only rules which are defined for this particular domain are iterated internally. In most cases, there would be very few of them.
So, based on this, I'm not concerned about impact on scrapy-poet.
But I do think we need to mention this algorithmic change in the web-poet changelog, just in case, as it's another backwards incompatible change in RulesRegistry: O(1) lookups using registry.search(use=MyPage)
are no longer supported, they're O(N) now.
Does this mean that we'll need to consider bundling this change into 0.6.0?
Yes, because of the above - to avoid making another backwards incompatible change in RulesRegistry in 0.7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, it looks like the code for URL matching which is currently in scrapy-poet could be a better fit in web-poet. Finding a rule in registry based on URL and instead_of / to_return argument is not specific to Scrapy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the changelog in 9558701
self.matcher
(not a great name, by the way) is dictionary of{generic_cls: URLMatcher}
. So, the loop happens not against all rules, but against all generic (overridable) classes.
Right, I missed out on the fact that it's pooling all of the URL patterns that a given PO supports.
By the way, it looks like the code for URL matching which is currently in scrapy-poet could be a better fit in web-poet. Finding a rule in registry based on URL and instead_of / to_return argument is not specific to Scrapy.
I agree. I had to subclass web_poet.RulesRegistry
in scrapy-poet's registry in scrapinghub/scrapy-poet#88 to avoid re-inventing some of its features like search. We can move the matching functionality web-poet later on.
Co-authored-by: Mikhail Korobov <[email protected]>
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #101 +/- ##
==========================================
- Coverage 99.82% 99.82% -0.01%
==========================================
Files 18 18
Lines 586 579 -7
==========================================
- Hits 585 578 -7
Misses 1 1
|
6a82ade
to
9558701
Compare
Thanks @BurnzZ! |
Resolves #86.
Related to the prior work in #96.
This unblocks the WIP in scrapinghub/scrapy-poet#88 when considering use cases that needs multiple rules in a single PO.
It's not necessary to include this in the
0.6.0
release.