this is pretty sick (and accurate): parse your pdfs (with tables) using Tabula!!!
https://pypi.org/project/tabula-py/
i would first split the PDF into its individual pages, then use tabula on the required pages (it can parse multiple tables too!!)
To split your pdf into individual pages:
from PyPDF2 import PdfFileWriter, PdfFileReader
"""This outputs each selected page as a separate PDF page, with the page number as a prefix."""
dir_path = 'Desktop/abc/'
filename = 'abc.pdf'
pdf_reader = PdfFileReader(open(dir_path + filename, 'rb'))
for i in range(start_page, end_page):
output = PdfFileWriter()
output.addPage(pdf_reader.getPage(i-1))
output_filepath = dir_path + str(i) + '_' + filename
with open(output_filepath, 'wb') as outputStream:
output.write(outputStream)
To extract text from your single pdf page:
from PyPDF2 import PdfFileWriter, PdfFileReader
filename = 'abc.pdf'
pdf_reader = PdfFileReader(open(filename, 'rb'))
pageObj = pdf_reader.getPage(0)
text = pageObj.extractText()
text = text.replace('\n', '').replace('\t', '')
##if the above doesn't work, try pdfminer.six for python 3+
## below example from https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/
import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
def extract_text_from_pdf(pdf_path):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
with open(pdf_path, 'rb') as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close()
if text:
return text
if __name__ == '__main__':
print(extract_text_from_pdf('abc.pdf'))
To extract a single table from your single pdf page:
import tabula
pdf_filepath = '1_abc.pdf'
df = tabula.read_pdf(pdf_filepath)
To extract multiple tables from your single pdf page:
import tabula
pdf_filepath = '1_abc.pdf'
df = tabula.read_pdf(pdf_filepath, multiple_tables=True)
too amazing - tabula parses tables almost perfectly!
read here for other sick features: https://blog.chezo.uno/tabula-py-now-able-to-extract-remote-pdf-and-multiple-tables-at-once-6108e24ac07c
BONUS
Camelot sounds really promising - might give it a try!
https://hackernoon.com/announcing-camelot-a-python-library-to-extract-tabular-data-from-pdfs-605f8e63c2d5
some other links for reference if you really want…
https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/
https://automatetheboringstuff.com/chapter13/