Support Portal

for ProcessRobot and WinAutomation

Start a new topic
Answered

Regular Expression - Extracting Street Addresses

Hey Guys,


So im despertely trying to figure out why regex isn't parsing out street addresses from a datatable. I have previously OCR PDFs and made a table inside of excel. Following that I've tried using the parse text option with regular expression to find the street addresses, lot number, block number, and unit numbers but the function does not seem to work.


Has anynone had any success with this?


Ive attached an excel file with the example output im looking for, any help would be appreciated.

xlsx
(34.3 KB)

Best Answer

It's hard to know what your exact problem is, but I could make it work. Check the file.

waj

Answer

It's hard to know what your exact problem is, but I could make it work. Check the file.

waj

That worked perfectly, thank you! Upon ocr'ing more documents with tesseract it keeps crashing on document 15-18. I am changing up the pdfs order and removing the pdfs its crashing on which are all the same size approximately. 1-2 pages in size. Do you know why this would possibly happen?


I was thinking the heartbeat check every 5 seconds, ram, etc I am going as far as creating a new engine each time to see if its a virtual ram issue.

zip
ocr crash.png
(93.4 KB)
waj

You're welcome. 


Try to run eventvwr on the computer, and check for errors etc. before the process crash


1 person likes this

Faulting application name: WinAutomation.Process.exe, version: 9.0.0.5481, time stamp: 0x5d47617c

Faulting module name: libtesseract400.dll, version: 0.0.0.0, time stamp: 0x5b7d4fc4

Exception code: 0xc0000005

Fault offset: 0x00000000000a342e

Faulting process id: 0x3ae0

Faulting application start time: 0x01d62d3500561dc4

Faulting application path: C:\Program Files\WinAutomation\WinAutomation.Process.exe

Faulting module path: C:\Program Files\WinAutomation\x64\libtesseract400.dll

Report Id: 007cb5be-1817-426b-93ef-3e063491f237

Faulting package full name:

Faulting package-relative application ID: 

Application: WinAutomation.Process.exe

Framework Version: v4.0.30319

Description: The process was terminated due to an unhandled exception.

Exception Info: System.AccessViolationException

   at InteropRuntimeImplementer.TessApiSignaturesInstance.TessApiSignaturesImplementation.BaseApiRecognize(System.Runtime.InteropServices.HandleRef, System.Runtime.InteropServices.HandleRef)

   at Tesseract.Page.Recognize()

   at Tesseract.Page.GetText()

   at WinAutomation.Actions.Runtime.TesseractOCREngineFacadeToVariant.GetText(System.Collections.Generic.List`1<System.String>)

   at WinAutomation.Actions.Runtime.OCRActions.ExtractTextWithOCRFromPDFImages(WinAutomation.Shared.Runtime.Variants.OCREngineVariant, System.Collections.Generic.List`1<System.String>)

   at WinAutomation.Actions.Runtime.PDFActions.ExtractTextFromPDFWithOCR2(WinAutomation.Shared.Runtime.Variants.Variant, WinAutomation.Shared.Runtime.Variants.Variant, WinAutomation.Shared.Runtime.Variants.Variant, WinAutomation.Shared.Runtime.Variants.Variant, WinAutomation.Shared.Runtime.Variants.Variant, WinAutomation.Shared.Runtime.Variants.Variant, WinAutomation.Shared.Runtime.Variants.Variant ByRef, Boolean, System.String, Int32, Int32)

   at WinAutomation.Actions.Runtime.ActualCompiledJob+<>c__DisplayClassa.<Execute>b__8(Boolean)

   at WinAutomation.Actions.Runtime.ActualCompiledJob.Execute()

   at WinAutomation.Robot.Runner.(Int32)

   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)

   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)

   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)

   at System.Threading.ThreadHelper.ThreadStart()

 


I just opened your WAJ file.


My guess is that you need to put your Create OCR action before the loop.

I've done it both ways, with the same result. No idea what to try next.

How else woud you suppose I fix this? Can I pay someone for assistance. Im in a bind and I dont expect anything for free. 

Thank you

+ System 

  - Provider 

   [ Name]  Application Error 
 
  - EventID 1000 

   [ Qualifiers]  0 
 
   Level 2 
 
   Task 100 
 
   Keywords 0x80000000000000 
 
  - TimeCreated 

   [ SystemTime]  2020-05-20T10:27:59.121655600Z 
 
   EventRecordID 20440 
 
   Channel Application 
 
   Computer DESKTOP-SL6BFC9 
 
   Security 
 

- EventData 

   WinAutomation.Process.exe 
   9.0.0.5481 
   5d47617c 
   libtesseract400.dll 
   0.0.0.0 
   5b7d4fc4 
   c0000005 
   00000000000a342e 
   304c 
   01d62e906ee0d807 
   C:\Program Files\WinAutomation\WinAutomation.Process.exe 
   C:\Program Files\WinAutomation\x64\libtesseract400.dll 
   a1c884d2-25da-4375-9c3a-c6e1aedec93e 
    
    

Auto-cleanup of Processes with no lifesigns Additional Data: The following Processes got terminated due to not having given any lifesigns for a certain period of time:

 

JUST OCR ALL PDFS (instance id: 657ce95e-3109-4278-a24f-e0294597360c)

 




and



The following Processes got terminated due to not having given any lifesigns for a certain period of time:

 

- 'JUST OCR ALL PDFS'



Can anyone help me, can I pay Winautomation to fix this problem, I dont expect anything for free but im fish out of water for two days and its desperate times right now. Any help will be compensated for.


PLEASE

Does anyone know if this message about Softmotive merging with Microsft means we should seek support on the PowerAutomate support page?


My process crashes as well, on pdf number 23...


You need to email the Softomotive support to figure out what's happening with the ocr engine.

Login or Signup to post a comment