update RulesRegistry to allow multiple rules in a single Page Object #101

BurnzZ · 2022-11-02T07:00:04Z

Resolves #86.

Related to the prior work in #96.

This unblocks the WIP in scrapinghub/scrapy-poet#88 when considering use cases that needs multiple rules in a single PO.

It's not necessary to include this in the 0.6.0 release.

kmike · 2022-11-02T13:35:56Z

web_poet/rules.py

@@ -248,14 +238,6 @@ def search(self, **kwargs) -> List[ApplyRule]:

        """

-        # Short-circuit operation if "use" is the only search param used, since


I've checked scrapy-poet code, and it seems that currently fast lookups in the registry are not needed, as matching is implemented in a different way. So, it seems that's fine to drop this optimization. We still should mention this change in the changelog though.

Because of the algorithmic complexity change, I think it makes sense to consider it a blocker for 0.6.0

(i.e. bundle all changes like this in 0.6.0)

I've checked scrapy-poet code, and it seems that currently fast lookups in the registry are not needed,

I think one optimization we need is fast lookups for returning the rules for a given request (particularly its URL). Currently, it's looking through all the rules (ref):

def overrides_for(self, request: Request) -> Mapping[Callable, Callable]: overrides: Dict[Callable, Callable] = {} for instead_of, matcher in self.matcher.items(): rule_id = matcher.match(request.url) if rule_id is not None: overrides[instead_of] = self.rules[rule_id].use return

Although I'm optimizing this in the scrapinghub/scrapy-poet#88 PR since it needs a refactor anyways for the new to_return functionality.

Because of the algorithmic complexity change, I think it makes sense to consider it a blocker for 0.6.0

Does this mean that we'll need to consider bundling this change into 0.6.0?

Currently, it's looking through all the rules

I think it's not that bad. self.matcher (not a great name, by the way) is dictionary of {generic_cls: URLMatcher}. So, the loop happens not against all rules, but against all generic (overridable) classes. For example, if you have GenericProductPage and GenericProductListPage, and there are page objects for 1000 websites, the loop would do only 2 iterations, not 1000 or 2000.

Picking a right rule is handled by url-matcher, and it should be optimized pretty well; it has a domain index, so only rules which are defined for this particular domain are iterated internally. In most cases, there would be very few of them.

So, based on this, I'm not concerned about impact on scrapy-poet.

But I do think we need to mention this algorithmic change in the web-poet changelog, just in case, as it's another backwards incompatible change in RulesRegistry: O(1) lookups using registry.search(use=MyPage) are no longer supported, they're O(N) now.

Does this mean that we'll need to consider bundling this change into 0.6.0?

Yes, because of the above - to avoid making another backwards incompatible change in RulesRegistry in 0.7.

By the way, it looks like the code for URL matching which is currently in scrapy-poet could be a better fit in web-poet. Finding a rule in registry based on URL and instead_of / to_return argument is not specific to Scrapy.

Updated the changelog in 9558701

self.matcher (not a great name, by the way) is dictionary of {generic_cls: URLMatcher}. So, the loop happens not against all rules, but against all generic (overridable) classes.

Right, I missed out on the fact that it's pooling all of the URL patterns that a given PO supports.

By the way, it looks like the code for URL matching which is currently in scrapy-poet could be a better fit in web-poet. Finding a rule in registry based on URL and instead_of / to_return argument is not specific to Scrapy.

I agree. I had to subclass web_poet.RulesRegistry in scrapy-poet's registry in scrapinghub/scrapy-poet#88 to avoid re-inventing some of its features like search. We can move the matching functionality web-poet later on.

web_poet/rules.py

Co-authored-by: Mikhail Korobov <[email protected]>

codecov · 2022-11-02T15:26:37Z

Codecov Report

Merging #101 (afd2581) into master (57cc01a) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head afd2581 differs from pull request most recent head 9558701. Consider uploading reports for the commit 9558701 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
- Coverage   99.82%   99.82%   -0.01%     
==========================================
  Files          18       18              
  Lines         586      579       -7     
==========================================
- Hits          585      578       -7     
  Misses          1        1

Impacted Files	Coverage Δ
web_poet/rules.py	`100.00% <100.00%> (ø)`

…ity change

kmike · 2022-11-07T18:49:49Z

Thanks @BurnzZ!

update RulesRegistry to allow multiple rules in a single Page Object

4040370

BurnzZ requested review from kmike and Gallaecio November 2, 2022 07:00

Gallaecio approved these changes Nov 2, 2022

View reviewed changes

BurnzZ mentioned this pull request Nov 2, 2022

Supporting to_return in web-poet rules scrapinghub/scrapy-poet#88

Merged

7 tasks

kmike reviewed Nov 2, 2022

View reviewed changes

web_poet/rules.py Outdated Show resolved Hide resolved

Update RulesRegistry.get_rules() to return copy of list of rules

8b216fe

Co-authored-by: Mikhail Korobov <[email protected]>

kmike approved these changes Nov 2, 2022

View reviewed changes

update CHANGELOG to mention change in '.search()' method time complex…

9558701

…ity change

BurnzZ force-pushed the multiple-rules-po branch from 6a82ade to 9558701 Compare November 3, 2022 01:53

kmike merged commit 878029d into master Nov 7, 2022

BurnzZ deleted the multiple-rules-po branch November 21, 2022 02:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

update RulesRegistry to allow multiple rules in a single Page Object #101

update RulesRegistry to allow multiple rules in a single Page Object #101

Uh oh!

BurnzZ commented Nov 2, 2022 •

edited by Gallaecio

Loading

Uh oh!

kmike Nov 2, 2022

Uh oh!

kmike Nov 2, 2022

Uh oh!

kmike Nov 2, 2022

Uh oh!

BurnzZ Nov 2, 2022

Uh oh!

BurnzZ Nov 2, 2022

Uh oh!

kmike Nov 2, 2022 •

edited

Loading

Uh oh!

kmike Nov 2, 2022

Uh oh!

BurnzZ Nov 3, 2022

Uh oh!

Uh oh!

codecov bot commented Nov 2, 2022 •

edited

Loading

Uh oh!

kmike commented Nov 7, 2022

Uh oh!

Uh oh!

		@@ -248,14 +238,6 @@ def search(self, **kwargs) -> List[ApplyRule]:

		"""

		# Short-circuit operation if "use" is the only search param used, since

update RulesRegistry to allow multiple rules in a single Page Object #101

update RulesRegistry to allow multiple rules in a single Page Object #101

Uh oh!

Conversation

BurnzZ commented Nov 2, 2022 • edited by Gallaecio Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmike Nov 2, 2022

Choose a reason for hiding this comment

Uh oh!

kmike Nov 2, 2022

Choose a reason for hiding this comment

Uh oh!

kmike Nov 2, 2022

Choose a reason for hiding this comment

Uh oh!

BurnzZ Nov 2, 2022

Choose a reason for hiding this comment

Uh oh!

BurnzZ Nov 2, 2022

Choose a reason for hiding this comment

Uh oh!

kmike Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmike Nov 2, 2022

Choose a reason for hiding this comment

Uh oh!

BurnzZ Nov 3, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kmike commented Nov 7, 2022

Uh oh!

Uh oh!

BurnzZ commented Nov 2, 2022 •

edited by Gallaecio

Loading

kmike Nov 2, 2022 •

edited

Loading

codecov bot commented Nov 2, 2022 •

edited

Loading