Description
As a scrapy user, I often want to extract the text content of an element. The default option in parsel is to either use the ::text
pseudo-element or XPath text()
. Both options have the downside that they return all text nodes as individual elements. When the element contains child elements, this creates unwanted behavior. E.g.:
<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
>>> response.css('h2::text').extract()
['This is the ', ' trend!']
>>> response.css('.post_info::text').extract()
['Published by newbie', 'on Sept 17']
With a basic understanding of XML and XPath, this behavior is expected. But it requires overhead to work around it, and it often creates frustrations with new users. There is a series of questions on stackoverflow as well as on the scrapy bug tracker:
- https://stackoverflow.com/questions/33088402/extracting-text-within-em-tag-in-scrapy
- https://stackoverflow.com/questions/23156780/how-can-i-get-all-the-plain-text-from-a-website-with-scrapy
- https://stackoverflow.com/questions/39511122/extract-nested-tags-with-other-text-data-as-string-in-scrapy
- Line break is splitting the result of text() scrapy#3488
lxml.html
has the convenience method .text_content()
that collects all of the text content of an element. Somethings similar could be added to the Selector
and SelectorList
classes. I could imagine two ways to approach the required API:
- Either, there could be additional
.extract_text()
/.get_text()
methods. This seems clean and easy to use, but would lead to potentially convoluted method names like.extract_first_text()
(or.extract_text_first()
?). - Or add a parameter to
.extract*()
/.get()
, similar to the proposal in Add format_as to extract() methods #101. This could be.extract(format_as='text')
. This is less intrusive, but maybe less easy to discover.
Would such an addition be welcome? I could prepare a patch.