Is there any way to extract plain text from pdfs without including tables (i.e. the text in tables)?

Is there any way to extract plain text from pdfs without including tables (i.e. the text in tables) using python?

Below is a sample page

Sample page from the pdf

And below is the text after processing

Required text after processing

I have tried converting the pdf to plain text and extracted tables from pdf using tabula, but I'm unable to remove those tables from the txt file using regex due to formatting issues.

Comparison between converted plain text and extracted table

Comparison between converted plain text and extracted table

Below is the idea on using regular expressions to distinguish between tables and text, supposing that the table structure is also extracted from pdf:

Regex example:
Table in pdf:
| A | B |
| C | D |

Table as txt:
A B
C D

regex from extracted table(created dynamically):
r'A\\\\ B\\s\*C\\\\ D\\s\*'


Read more here: https://stackoverflow.com/questions/64416449/is-there-any-way-to-extract-plain-text-from-pdfs-without-including-tables-i-e

Content Attribution

This content was originally published by user14437741 at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: