Skip to content

GetInnerText() performace #55

Open
@GeneThomas

Description

@GeneThomas

Bug Report

I am writing, what I would think is a fairly simple usage of AngleSharp[.Css], I am extracting a html table of covid-19 cases etc.. by country. The headers [or other cells] can contain html <br>. INode.Text() [an extension] and INode.TextContent() remove the <br> returning values like “TotalCases”. My implementation parses the 3000ish cells in 4.6 seconds. Using AngleSharp.Css’s ElementExtensions’s string GetInnerText(this IElement element); takes over 8 minutes makeing it unusable.

I assume you must implement Css’s display:none and visibility:hidden. I do not require that functionality, as I  do not require an implementation of Javascript. If GetInnerText()  can not be sped up a reasonable solution would be to use something like my code with your implementation of html entities such as © etc..

The attached project’s interesting code is in AngleSharpCssSpeedFault.cs.
AngleSharpCssSpeedFault.zip

The last method InnerText(IElement) has a #if to switch between the two implementations of InnerText().

Prerequisites

Run the attached solution.

Description

see above

Steps to Reproduce

  1. Run the solution
  2. Change the #if in the last method InnerText()
  3. Run the solutino again.

Possible Solution

Use my InnerText() but add the expanding of all html & entities as that is missing.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions