Skip to content

update RulesRegistry to allow multiple rules in a single Page Object #101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 7, 2022

Conversation

BurnzZ
Copy link
Contributor

@BurnzZ BurnzZ commented Nov 2, 2022

Resolves #86.

Related to the prior work in #96.

This unblocks the WIP in scrapinghub/scrapy-poet#88 when considering use cases that needs multiple rules in a single PO.

It's not necessary to include this in the 0.6.0 release.

@BurnzZ BurnzZ requested review from kmike and Gallaecio November 2, 2022 07:00
@@ -248,14 +238,6 @@ def search(self, **kwargs) -> List[ApplyRule]:

"""

# Short-circuit operation if "use" is the only search param used, since
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked scrapy-poet code, and it seems that currently fast lookups in the registry are not needed, as matching is implemented in a different way. So, it seems that's fine to drop this optimization. We still should mention this change in the changelog though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the algorithmic complexity change, I think it makes sense to consider it a blocker for 0.6.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i.e. bundle all changes like this in 0.6.0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked scrapy-poet code, and it seems that currently fast lookups in the registry are not needed,

I think one optimization we need is fast lookups for returning the rules for a given request (particularly its URL). Currently, it's looking through all the rules (ref):

    def overrides_for(self, request: Request) -> Mapping[Callable, Callable]:
        overrides: Dict[Callable, Callable] = {}
        for instead_of, matcher in self.matcher.items():
            rule_id = matcher.match(request.url)
            if rule_id is not None:
                overrides[instead_of] = self.rules[rule_id].use
        return 

Although I'm optimizing this in the scrapinghub/scrapy-poet#88 PR since it needs a refactor anyways for the new to_return functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the algorithmic complexity change, I think it makes sense to consider it a blocker for 0.6.0

Does this mean that we'll need to consider bundling this change into 0.6.0?

Copy link
Member

@kmike kmike Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it's looking through all the rules

I think it's not that bad. self.matcher (not a great name, by the way) is dictionary of {generic_cls: URLMatcher}. So, the loop happens not against all rules, but against all generic (overridable) classes. For example, if you have GenericProductPage and GenericProductListPage, and there are page objects for 1000 websites, the loop would do only 2 iterations, not 1000 or 2000.

Picking a right rule is handled by url-matcher, and it should be optimized pretty well; it has a domain index, so only rules which are defined for this particular domain are iterated internally. In most cases, there would be very few of them.

So, based on this, I'm not concerned about impact on scrapy-poet.

But I do think we need to mention this algorithmic change in the web-poet changelog, just in case, as it's another backwards incompatible change in RulesRegistry: O(1) lookups using registry.search(use=MyPage) are no longer supported, they're O(N) now.

Does this mean that we'll need to consider bundling this change into 0.6.0?

Yes, because of the above - to avoid making another backwards incompatible change in RulesRegistry in 0.7.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, it looks like the code for URL matching which is currently in scrapy-poet could be a better fit in web-poet. Finding a rule in registry based on URL and instead_of / to_return argument is not specific to Scrapy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the changelog in 9558701

self.matcher (not a great name, by the way) is dictionary of {generic_cls: URLMatcher}. So, the loop happens not against all rules, but against all generic (overridable) classes.

Right, I missed out on the fact that it's pooling all of the URL patterns that a given PO supports.

By the way, it looks like the code for URL matching which is currently in scrapy-poet could be a better fit in web-poet. Finding a rule in registry based on URL and instead_of / to_return argument is not specific to Scrapy.

I agree. I had to subclass web_poet.RulesRegistry in scrapy-poet's registry in scrapinghub/scrapy-poet#88 to avoid re-inventing some of its features like search. We can move the matching functionality web-poet later on.

@codecov
Copy link

codecov bot commented Nov 2, 2022

Codecov Report

Merging #101 (afd2581) into master (57cc01a) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head afd2581 differs from pull request most recent head 9558701. Consider uploading reports for the commit 9558701 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
- Coverage   99.82%   99.82%   -0.01%     
==========================================
  Files          18       18              
  Lines         586      579       -7     
==========================================
- Hits          585      578       -7     
  Misses          1        1              
Impacted Files Coverage Δ
web_poet/rules.py 100.00% <100.00%> (ø)

@BurnzZ BurnzZ force-pushed the multiple-rules-po branch from 6a82ade to 9558701 Compare November 3, 2022 01:53
@kmike kmike merged commit 878029d into master Nov 7, 2022
@kmike
Copy link
Member

kmike commented Nov 7, 2022

Thanks @BurnzZ!

@BurnzZ BurnzZ deleted the multiple-rules-po branch November 21, 2022 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider removing the limitation of only having a single PO in the registry
3 participants