Metadata-Version: 2.1
Name: multilingual-pdf2text
Version: 1.0.0
Summary: A python library for extracting text from PDFs without losing the formatting of the PDF content.
Home-page: https://github.com/shahrukhx01/multilingual-pdf2text
Author: Shahrukh Khan
Author-email: sk28671@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.6
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: pydantic
Requires-Dist: pytesseract
Requires-Dist: pdf2image
Requires-Dist: pillow

# Multilingual PDF to Text.

## Install Package from Pypi
1. Install it using pip.
```bash
pip install linkedin-jobs-pyscraper
```
The library uses Tesseract which can be installed by following instructions:

[Tesseract Installation](https://tesseract-ocr.github.io/tessdoc/Installation.html)

## Example Usage
2. Use it in your code
```python
from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging
logging.basicConfig(level=logging.INFO)

def main():
    ## create document for extraction with configurations
    pdf_document = Document(
        document_path='/Users/shahrukh/Desktop/multilingual-pdf2text/example/example.pdf',
        language='spa'
        )
    pdf2text = PDF2Text(document=pdf_document)
    content = pdf2text.extract()
    print(content)

if __name__ == "__main__":
    main()
```

Tesseract supports the following languages:
Code	Language
* afr	Afrikaans	
* amh	Amharic	
* ara	Arabic	
* asm	Assamese	
* aze	Azerbaijani	
* aze_cyrl	Azerbaijani - Cyrillic	aze_
* bel	Belarusian	
* ben	Bengali	
* bod	Tibetan	
* bos	Bosnian	
* bul	Bulgarian	
* cat	Catalan; Valencian	
* ceb	Cebuano	
* ces	Czech	
* chi_sim	Chinese - Simplified	chi_
* chi_tra	Chinese - Traditional	chi_
* chr	Cherokee	
* cym	Welsh	
* dan	Danish	
* deu	German	
* dzo	Dzongkha	
* ell	Greek, Modern (1453-)	
* eng	English	
* enm	English, Middle (1100-1500)	
* epo	Esperanto	
* est	Estonian	
* eus	Basque	
* fas	Persian	
* fin	Finnish	
* fra	French	
* frk	German Fraktur	
* frm	French, Middle (ca. 1400-1600)	
* gle	Irish	
* glg	Galician	
* grc	Greek, Ancient (-1453)	
* guj	Gujarati	
* hat	Haitian; Haitian Creole	
* heb	Hebrew	
* hin	Hindi	
* hrv	Croatian	
* hun	Hungarian	
* iku	Inuktitut	
* ind	Indonesian	
* isl	Icelandic	
* ita	Italian	
* ita_old	Italian - Old	ita_
* jav	Javanese	
* jpn	Japanese	
* kan	Kannada	
* kat	Georgian	
* kat_old	Georgian - Old	kat_
* kaz	Kazakh	
* khm	Central Khmer	
* kir	Kirghiz; Kyrgyz	
* kor	Korean	
* kur	Kurdish	
* lao	Lao	
* lat	Latin	
* lav	Latvian	
* lit	Lithuanian	
* mal	Malayalam	
* mar	Marathi	
* mkd	Macedonian	
* mlt	Maltese	
* msa	Malay	
* mya	Burmese	
* nep	Nepali	
* nld	Dutch; Flemish	
* nor	Norwegian	
* ori	Oriya	
* pan	Panjabi; Punjabi	
* pol	Polish	
* por	Portuguese	
* pus	Pushto; Pashto	
* ron	Romanian; Moldavian; Moldovan	
* rus	Russian	
* san	Sanskrit	
* sin	Sinhala; Sinhalese	
* slk	Slovak	
* slv	Slovenian	
* spa	Spanish; Castilian	
* spa_old	Spanish; Castilian - Old	spa_
* sqi	Albanian	
* srp	Serbian	
* srp_latn	Serbian - Latin	srp_
* swa	Swahili	
* swe	Swedish	
* syr	Syriac	
* tam	Tamil	
* tel	Telugu	
* tgk	Tajik	
* tgl	Tagalog	
* tha	Thai	
* tir	Tigrinya	
* tur	Turkish	
* uig	Uighur; Uyghur	
* ukr	Ukrainian	
* urd	Urdu	
* uzb	Uzbek	
* uzb_cyrl	Uzbek - Cyrillic	uzb_
* vie	Vietnamese	
* yid	Yiddish

