Support Portal

for ProcessRobot and WinAutomation

Start a new topic
Answered

data extraction

How can I extract data from a scanned pdf. If possible give me a live demo

pdf

Best Answer

Hello Korra,


In order to extract data from scanned documents, you need to use OCR technology.

WinAutomation is using 3rd party OCR Software implemented into actions. (Tesseract, MODI are free and CaptureFast, Ancora require licensed software)  

This technology will not provide 100% accurate results and is not something you can rely for critical data extraction. The accuracy depends on:

- The quality of the printed document

- The quality of the characters on the document

- The configuration used

- The technology of the OCR Vendor


See bellow the example of reading your PDF while using the Tesseract OCR with 3x3 multipliers to extract data from the 5th page of the PDF.

Actions used:

1) Create Tesseract OCR Engine

2) Extract Text from PDF using OCR



ADMIN
Answer

Hello Korra,


In order to extract data from scanned documents, you need to use OCR technology.

WinAutomation is using 3rd party OCR Software implemented into actions. (Tesseract, MODI are free and CaptureFast, Ancora require licensed software)  

This technology will not provide 100% accurate results and is not something you can rely for critical data extraction. The accuracy depends on:

- The quality of the printed document

- The quality of the characters on the document

- The configuration used

- The technology of the OCR Vendor


See bellow the example of reading your PDF while using the Tesseract OCR with 3x3 multipliers to extract data from the 5th page of the PDF.

Actions used:

1) Create Tesseract OCR Engine

2) Extract Text from PDF using OCR


I was reading tesserac documentation onGitHub. May I use some parameters from it, inside WA, in order to improve it quality? Something like that:


tesseract imagename outputbase digits


And could you tell me which tesseract version are you using with WA? Version 2, 2.3, 3 or higher? Is there any upgrade for the language files? If I download a newer version for example, for "por.traineddata", I need to download another file too, like the pdf.ttf or the pdf.ttx?

ADMIN

Hello Roberto,


WinAutomation is currently using v4 Tesseract. 


You cannot configure anything that does not exist in the actions. (ie outputbase digits)

You can use language files from here:  https://github.com/tesseract-ocr/tessdata 

Login or Signup to post a comment