Metadata-Version: 2.1
Name: wiktionary-de-parser
Version: 0.10.1
Summary: Extracts data from German Wiktionary dump files.
Home-page: https://github.com/gambolputty/wiktionary-de-parser
License: MIT
Keywords: wiktionary,xml,parser,data-extraction,german,nlp
Author: Gregor Weichbrodt
Author-email: gregorweichbrodt@gmail.com
Requires-Python: >=3.11,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: German
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.7
Classifier: Topic :: Text Processing :: Markup :: XML
Requires-Dist: black (>=24.1.1,<25.0.0)
Requires-Dist: lxml (>=5.1.0,<6.0.0)
Requires-Dist: mwparserfromhell (>=0.6.6,<0.7.0)
Project-URL: Bug Tracker, https://github.com/gambolputty/wiktionary-de-parser/issues
Project-URL: Repository, https://github.com/gambolputty/wiktionary-de-parser
Description-Content-Type: text/markdown

# wiktionary-de-parser

A Python module to extract data from German Wiktionary XML files (for Python 3.11+).

## Features

- Extracts flexion tables, IPA transcriptions, language, genus, lemma, part of speech information (basic) and syllables of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)

## Installation

`pip install wiktionary-de-parser`

Or with [Poetry](https://python-poetry.org/):

`poetry add wiktionary-de-parser`

## Usage

```python
from bz2 import BZ2File
from wiktionary_de_parser import Parser

bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz_file = BZ2File(bzfile_path)

for record in Parser(bz_file):
    if record.lang_code != 'de':
      continue
    # do stuff with 'record'
```

Note: In this example we load a compressed Wiktionary dump file that was [obtained from here](https://dumps.wikimedia.org/dewiktionary/latest).


## Output
Example output for the page "Abend":
```python
Record(lemma='Abend',
       inflected=False,
       syllables=['Abend'],
       ipa=['ˈaːbn̩t', 'ˈaːbm̩t'],
       rhymes=['aːbn̩t'],
       pos={'Substantiv': []},
       lang='Deutsch',
       lang_code='de',
       flexion={'Akkusativ Plural': 'Abende',
                'Akkusativ Singular': 'Abend',
                'Dativ Plural': 'Abenden',
                'Dativ Singular': 'Abend',
                'Genitiv Plural': 'Abende',
                'Genitiv Singular': 'Abends',
                'Genus': 'm',
                'Nominativ Plural': 'Abende',
                'Nominativ Singular': 'Abend'},
       page_id=5719,
       index=0,
       title='Abend',
       wikitext=None)

Record(lemma='Abend',
       inflected=False,
       syllables=['Abend'],
       ipa=['ˈaːbn̩t'],
       rhymes=['aːbn̩t'],
       pos={'Substantiv': ['Nachname']},
       lang='Deutsch',
       lang_code='de',
       flexion=None,
       page_id=5719,
       index=1,
       title='Abend',
       wikitext=None)

Record(lemma='Abend',
       inflected=False,
       syllables=['Abend'],
       ipa=['ˈaːbn̩t', 'ˈaːbm̩t'],
       rhymes=['aːbn̩t'],
       pos={'Substantiv': ['Toponym']},
       lang='Deutsch',
       lang_code='de',
       flexion=None,
       page_id=5719,
       index=2,
       title='Abend',
       wikitext=None)
```

## Development
This project uses [Poetry](https://python-poetry.org/).

1. Install [Poetry](https://python-poetry.org/).
2. Clone this repository
3. Run `poetry install` inside of the project folder to install dependencies.
4. Change `wiktionary_de_parser/run.py` to your needs.
5. Run `poetry run python wiktionary_de_parser/run.py` to run the parser. Or `poetry run pytest` to run tests.

## License

[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) © Gregor Weichbrodt

