The Tesseract OCR Engine supports multiple languages.


To detect characters from a specific language, the language needs to be specified while creating the OCR Engine itself.



English, German, Spanish, French and Italian languages come embedded with the action so they do not require additional parameters.

For other languages however, the 'Other' option needs to be selected which requires language specific parameters as shown below:



The Language Abbreviation tells the OCR engine which language to look for during OCR, and the Language Data Path should contain the data file for the corresponding language. This data file holds all the data which has been used to train the OCR engine in the first place.


The Language Abbreviations and Data Files for Tesseract OCR can be found in either one of the following links:

https://github.com/tesseract-ocr/tessdata/

https://tesseract-ocr.github.io/tessdoc/Data-Files


The "language".traindedata data file can be downloaded from the link above to any location on the pc. 

The path of this data file needs to be passed to the Language Data Path parameter along with the corresponding language code for the Language Abbreviation parameter.



In order to perform OCR on multi-lingual documents, the data files for all languages can be placed in the same folder (so that they can be found under the same path) and their corresponding language codes can be passed using a '+' operator.

This would ensure that OCR is carried out for characters in all the languages passed, as shown below.