Support Portal

for ProcessRobot and WinAutomation

Start a new topic

Extract Text From PDF

Hello,

I am trying to extract a text from a PDF using Extract Text From PDF action in WinAutomation. The PDF is readable, not scanned. It is composed of tables (address, header etc.) and common text in paragraphs (really nothing unusual). But the result is very mixed text, words are completely in different order then in the original PDF :(

Do you have any idea how to do it in other way with good results? Thank you. Jiri


1 person has this question

 yes same question.

ADMIN

Hello Jiří Mráz,


Kindly note that the extract from pdf action will just be extracting the text line by line. It will not retain the structure of data as in case of tables and all.


Paragraphs however, should be extracted as is without any hassles. You can verify this with any random 'lorem ipsum' text in pdf, it should extract text properly.


Alternatively, you could try using OCR based actions with Regex to extract the fields of your choice.


If possible, kindly share the pdf or some screenshots so that we may analyze the behavior better...

Hello Shravan Ventra,

thank you for your answer and I am  sorry for coming back late.

Here is an example, probably a bit difficult because it is in Czech.

A part of the PDF.

image

A part of extracted text.

image

I found a workaround - I open the pdf in Acrobat Reader, Select All, Copy and Save as txt file. It gives better results, see the example below.

image

I do not want to use Extract with OCR, I need to read a lot of files and they are not scanned, they should be readable.

Another Issue I faced is that extracted text for most of the PDF files changes Czech characters, see below.

image

Both ways are doing this, Extract Text From PDF and Copy Paste.


Please anybody let me know if you have some ideas how to avoid the issues. Thank you

Hi Jiří Mráz


Try using this tool https://blog.alivate.com.au/poppler-windows/


You can preserve the layout of your document (headers, footers, paging, etc.) from the original PDF file in the converted text file using the “-layout” flag.


ex :


pdftotext -layout /home/lori/Documents/Sample.pdf /home/lori/Documents/Sample.txt


Login or Signup to post a comment