Is there any way to extract plain text from pdfs without including tables (i.e. the text in tables) using python?
Below is a sample page
And below is the text after processing
I have tried converting the pdf to plain text and extracted tables from pdf using
tabula, but I'm unable to remove those tables from the txt file using
regex due to formatting issues.
Comparison between converted plain text and extracted table
Below is the idea on using regular expressions to distinguish between tables and text, supposing that the table structure is also extracted from pdf:
Regex example: Table in pdf: | A | B | | C | D | Table as txt: A B C D regex from extracted table(created dynamically): r'A\\\\ B\\s\*C\\\\ D\\s\*'