Stack overflow exception in trimAdjacentBlankLines() method due to inefficient regex in ExtractedTextFormatter

**Bug description**
When using **TikaDocumentReader**, I often encountered the following error:  
```
java
Caused by: java.lang.StackOverflowError: null  
    at java.base/java.util.regex.Pattern$Caret.match(Pattern.java:3896)  
    at java.base/java.util.regex.Pattern$Curly.match1(Pattern.java:4597)  
    at java.base/java.util.regex.Pattern$Curly.match(Pattern.java:4546)  
    at java.base/java.util.regex.Pattern$Dollar.match(Pattern.java:3996)  
    at java.base/java.util.regex.Pattern$Caret.match(Pattern.java:3906)  
    at java.base/java.util.regex.Pattern$GroupHead.match(Pattern.java:4969).  
```
After debugging, I found that the issue lies in the non-optimized regular expression from the trimAdjacentBlankLines() method in the ExtractedTextFormatter class. In my case, the problem occasionally occurred even with a files with small number of empty lines (~150) with default VM stack settings -Xmx8192m.

**Steps to reproduce**
For testing the occurrence of this error, I created an **XLSX file** with a large number of empty rows. I am attaching it to this issue.
[Stack_overflow_exception_test.xlsx](https://github.com/user-attachments/files/18801719/Stack_overflow_exception_test.xlsx)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stack overflow exception in trimAdjacentBlankLines() method due to inefficient regex in ExtractedTextFormatter #2247

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stack overflow exception in trimAdjacentBlankLines() method due to inefficient regex in ExtractedTextFormatter #2247

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions