Metadata-Version: 2.1
Name: multilingual-pdf2text
Version: 1.0.1
Summary: A python library for extracting text from PDFs without losing the formatting of the PDF content.
Home-page: https://github.com/shahrukhx01/multilingual-pdf2text
Author: Shahrukh Khan
Author-email: sk28671@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.6
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: pydantic
Requires-Dist: pytesseract
Requires-Dist: pdf2image
Requires-Dist: pillow

# Multilingual PDF to Text.

## Install Package from Pypi
1. Install it using pip.
```bash
pip install multilingual-pdf2text
```
The library uses Tesseract which can be installed by following instructions:

[Tesseract Installation](https://tesseract-ocr.github.io/tessdoc/Installation.html)

## Example Usage
2. Use it in your code
```python
from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging
logging.basicConfig(level=logging.INFO)

def main():
    ## create document for extraction with configurations
    pdf_document = Document(
        document_path='/Users/shahrukh/Desktop/multilingual-pdf2text/example/example.pdf',
        language='spa'
        )
    pdf2text = PDF2Text(document=pdf_document)
    content = pdf2text.extract()
    print(content)

if __name__ == "__main__":
    main()
```

Tesseract supports the following languages:
Code	Language
* afr	Afrikaans
* amh	Amharic
* ara	Arabic
* asm	Assamese
* aze	Azerbaijani
* aze_cyrl	Azerbaijani - Cyrillic	aze_
* bel	Belarusian
* ben	Bengali
* bod	Tibetan
* bos	Bosnian
* bul	Bulgarian
* cat	Catalan; Valencian
* ceb	Cebuano
* ces	Czech
* chi_sim	Chinese - Simplified	chi_
* chi_tra	Chinese - Traditional	chi_
* chr	Cherokee
* cym	Welsh
* dan	Danish
* deu	German
* dzo	Dzongkha
* ell	Greek, Modern (1453-)
* eng	English
* enm	English, Middle (1100-1500)
* epo	Esperanto
* est	Estonian
* eus	Basque
* fas	Persian
* fin	Finnish
* fra	French
* frk	German Fraktur
* frm	French, Middle (ca. 1400-1600)
* gle	Irish
* glg	Galician
* grc	Greek, Ancient (-1453)
* guj	Gujarati
* hat	Haitian; Haitian Creole
* heb	Hebrew
* hin	Hindi
* hrv	Croatian
* hun	Hungarian
* iku	Inuktitut
* ind	Indonesian
* isl	Icelandic
* ita	Italian
* ita_old	Italian - Old	ita_
* jav	Javanese
* jpn	Japanese
* kan	Kannada
* kat	Georgian
* kat_old	Georgian - Old	kat_
* kaz	Kazakh
* khm	Central Khmer
* kir	Kirghiz; Kyrgyz
* kor	Korean
* kur	Kurdish
* lao	Lao
* lat	Latin
* lav	Latvian
* lit	Lithuanian
* mal	Malayalam
* mar	Marathi
* mkd	Macedonian
* mlt	Maltese
* msa	Malay
* mya	Burmese
* nep	Nepali
* nld	Dutch; Flemish
* nor	Norwegian
* ori	Oriya
* pan	Panjabi; Punjabi
* pol	Polish
* por	Portuguese
* pus	Pushto; Pashto
* ron	Romanian; Moldavian; Moldovan
* rus	Russian
* san	Sanskrit
* sin	Sinhala; Sinhalese
* slk	Slovak
* slv	Slovenian
* spa	Spanish; Castilian
* spa_old	Spanish; Castilian - Old	spa_
* sqi	Albanian
* srp	Serbian
* srp_latn	Serbian - Latin	srp_
* swa	Swahili
* swe	Swedish
* syr	Syriac
* tam	Tamil
* tel	Telugu
* tgk	Tajik
* tgl	Tagalog
* tha	Thai
* tir	Tigrinya
* tur	Turkish
* uig	Uighur; Uyghur
* ukr	Ukrainian
* urd	Urdu
* uzb	Uzbek
* uzb_cyrl	Uzbek - Cyrillic	uzb_
* vie	Vietnamese
* yid	Yiddish


