Skip to content

feat: render math equations in .docx documents #1160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 28, 2025

Conversation

sathinduga
Copy link
Contributor

This PR address the issue #289: Add support for mathematical formulas in DOCX conversion.

Before this work, if we try to convert a .docx file with math equations, it was rendering as blank spaces. This is a concerning issue when working with engineering, mathematics, and other scientific documents.

Screenshot 2025-03-27 at 2 18 40 PM

With this update, we will convert the OMML math equations present in .docx document to LaTeX and wrap them with $ / $$ accordingly to represent equations in markdown format (for both inline and block equations).

Screenshot 2025-03-27 at 2 18 57 PM

This is done by pre-processing the .docx document before sending it to mammoth to get the html. I created this in a way to add any other pre-processing steps in the future if needed.

Test case to validate equations also included.

Special acknowledgment to xiilei/dwml for the initial work on OMML rendering to LaTeX.

@sathinduga
Copy link
Contributor Author

@microsoft-github-policy-service agree

@sathinduga
Copy link
Contributor Author

@afourney FYI: Since issue #289 is waiting unresolved for a long time I went ahead and worked on this.

There is another PR created for this purpose but it went inactive for a longer period. Also, the markitdown lib went through a major restructuring from this version to 0.1.0. Therefor it needed big changes in this PR to resolve merge conflicts.

I took a different path to resolve math equations (with all python approach), which are being used by a couple of major python document transformation libs.

@afourney
Copy link
Member

@afourney FYI: Since issue #289 is waiting unresolved for a long time I went ahead and worked on this.

There is another PR created for this purpose but it went inactive for a longer period. Also, the markitdown lib went through a major restructuring from this version to 0.1.0. Therefor it needed big changes in this PR to resolve merge conflicts.

I took a different path to resolve math equations (with all python approach), which are being used by a couple of major python document transformation libs.

Thanks for this! That is correct. There was a big refactor from 0.0.x to 0.1.x. I will kick off the CI tests tonight, and try to test/review tomorrow. This is important feature.

Let's also invite/credit the original PR author to review. It's not their fault I haven't merge it yet... it's just a matter of timing.

@sathinduga
Copy link
Contributor Author

Thanks for this! That is correct. There was a big refactor from 0.0.x to 0.1.x. I will kick off the CI tests tonight, and try to test/review tomorrow. This is important feature.

Let's also invite/credit the original PR author to review. It's not their fault I haven't merge it yet... it's just a matter of timing.

Sure, happy to have more eyes on this to make sure everything looks good. FYI @marromlam

But just to be clear, this approach used python for rendering, the other PR worked on this issue used xsl files approach - a different one.

@afourney
Copy link
Member

afourney commented Mar 28, 2025

First of all, this is super cool. And I'm inclined to merge it ASAP.

But the adapted code is under the Apache 2 license:

https://github.com/xiilei/dwml/blob/master/LICENSE

I need to figure out what I need to do to distribute a modification here (until now, we've not hosted any 3rd party-derived content directly). Probably I need to add another acknowledgments file or something. Let me look into it.

@afourney
Copy link
Member

@sathinduga Can you please fill in the TODO in this file: https://gist.github.com/afourney/4ae6af3d5b3aaf329705d04c6cf182b4 and add it to the root of the MarkItDown repo? I tried, but do not have permissions to push to your fork.

Once that's in place, I think we're good to go.

@sathinduga
Copy link
Contributor Author

@afourney added as a .md file. let me know if you prefer it to be .txt.

@afourney
Copy link
Member

Looks good to me. I want to do a little more testing before merge -- as soon as I get a chance -- but I think it's basically all good otherwise.

@sathinduga
Copy link
Contributor Author

Reformatted following with black, because of pre-commit got failed earlier.

reformatted packages/markitdown/src/markitdown/converter_utils/docx/math/latex_dict.py
reformatted packages/markitdown/src/markitdown/converters/_docx_converter.py
reformatted packages/markitdown/src/markitdown/converter_utils/docx/math/omml.py
reformatted packages/markitdown/tests/test_module_misc.py

Added reformatting modifications to the ThirdPartyNotices.

@afourney afourney merged commit 3fcd48c into microsoft:main Mar 28, 2025
3 checks passed
@afourney
Copy link
Member

Merged! Thanks for the contribution.

@marromlam
Copy link

marromlam commented Apr 2, 2025

Nice approach @sathinduga! However this would require a lot of effort to get it to convert as many formulae as mine. For example with this docx:

  • This PR : Test eq $A→πr^{2}$
  • Mine : Test eq $A\stackrel{yields}{\to }\pi {r}^{2}$

@marromlam
Copy link

equation.docx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants