Add option to retrieve text content

As a scrapy user, I often want to extract the text content of an element. The default option in parsel is to either use the `::text` pseudo-element or XPath `text()`. Both options have the downside that they return all text nodes as individual elements. When the element contains child elements, this creates unwanted behavior. E.g.:

```html
<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
```

```python
>>> response.css('h2::text').extract()
['This is the ', ' trend!']
>>> response.css('.post_info::text').extract()
['Published by newbie', 'on Sept 17']
```

With a basic understanding of XML and XPath, this behavior is expected. But it requires overhead to work around it, and it often creates frustrations with new users. There is a series of questions on stackoverflow as well as on the scrapy bug tracker:

* <https://stackoverflow.com/questions/33088402/extracting-text-within-em-tag-in-scrapy>
* <https://stackoverflow.com/questions/23156780/how-can-i-get-all-the-plain-text-from-a-website-with-scrapy>
* <https://stackoverflow.com/questions/39511122/extract-nested-tags-with-other-text-data-as-string-in-scrapy>
* <https://github.com/scrapy/scrapy/issues/3488>

`lxml.html` has the convenience method `.text_content()` that collects all of the text content of an element. Somethings similar could be added to the `Selector` and `SelectorList` classes. I could imagine two ways to approach the required API:

* Either, there could be additional `.extract_text()`/`.get_text()` methods. This seems clean and easy to use, but would lead to potentially convoluted method names like `.extract_first_text()` (or `.extract_text_first()`?).
* Or add a parameter to `.extract*()`/`.get()`, similar to the proposal in #101. This could be `.extract(format_as='text')`. This is less intrusive, but maybe less easy to discover.

Would such an addition be welcome? I could prepare a patch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option to retrieve text content #128

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add option to retrieve text content #128

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions