Metadata-Version: 2.1
Name: tei2neo
Version: 0.5.0
Summary: TEI (Text Encoding Initiative) parser to extract information and store it in Neo4j database
Home-page: https://sissource.ethz.ch/sis/semper-tei
Author: Swen Vermeul • ID SIS • ETH Zürich
Author-email: swen@ethz.ch
License: BSD
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.5
Description-Content-Type: text/markdown
Requires-Dist: pytest
Requires-Dist: py2neo>=2021.0.1
Requires-Dist: bs4
Requires-Dist: lxml
Requires-Dist: spacy
Requires-Dist: requests
Requires-Dist: gitpython

# TEI parser

This is a parser written in Python 3 that takes TEI-XML Documents as an inpput and writes them in a [Neo4j Graph Database](https://neo4j.com).

It makes use of the following existing libraries:

- [Beautiful Soup 4](https://beautiful-soup-4.readthedocs.io/en/latest/) An easy-to-use XML parser
- [Spacy](https://spacy.io). Currently we use the german language package `de_core_news_sm` to parse the text.
- [Py2neo v4](https://py2neo.org/v4/) whih is a library to work with the Neo4j database.

## Installation

```
$ pip install tei2neo
$ python "de_core_news_sm @https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.6.0/de_core_news_sm-3.6.0.tar.gz"
```

## Synopsis

```
from tei2neo import parse, GraphUtils
graph = Graph(host="localhost", user="neo4j", password="password")
doc, status, soup = parse(
	filename=file,
	start_with_tag='TEI',
	idno='20-MS-221'
)
tx = graph.begin()
doc.save(tx)
tx.commit()

ut = GraphUtils(graph)
paras = ut.paragraphs_for_filename('20_MS_221_1.xml')

# create unhyphened tokens
for para in paras:
    tokens = ut.tokens_in_paragraph(para)
    ut.create_unhyphenated(tokens)

# show hyphened text
for token in ut.tokens_in_paragraph(paras[5], concatenated=0):
    if 'lb' in token.labels:
        print(' | ', end='')
    print(token.get('string',''), end='')
    print(token.get('whitespace', ''), end='')

# show concatenated (non-hyphened) version of the text
for token in ut.tokens_in_paragraph(paras[5], concatenated=1):
    if 'lb' in token.labels:
        print(' ', end='')

    print(token.get('string',''), end='')
    print(token.get('whitespace', ''), end='')
```

# How the parser works

A TEI document can be constructed in various ways and there are many elements that work very similarly. Likewise, this parser expects certain elements and treats them in a specific manner.

## Elements that affect all following elements

### handShift

A `handShift` element **affects all elements that are below**, until another `handShift` element is encountered.

**Example**

From now on everything is written in «Latein» and a pencil is being used (medium=Blei):

```
<handShift new="#hWH" medium="Blei" script="Latein"/>
```

Now we switch to «Kurrent» script and use black ink (STinte):

```
<handShift new="#hGS" medium="STinte" script="Kurrent"/>
```

**Appearance in Neo4j**

As we have seen, a `handShift` element contains three attributes:

- new="#hWH"
- medium="Blei"
- script="Latein"

These attributes are passed to all Token elements that follow after a `handShift` occurs. Previous attributes are not deleted, i.e. if only the medium changes from «Blei» to «STinte», all other attributes stay the same.
The `handShift` element will _not_ appear as a node in Neo4j.

### delSpan

A `delSpan` element works much like a `handShift` element, as it alters the appearance of all the following text until it reaches its `spanTo` target:

```
<delSpan spanTo="#A20_MS_215_12_3"/>
... (a lot of XML code here)
<anchor xml:id="A20_MS_215_12_3"/>
```

**Appearance in Neo4j**

- both the `delSpan` and the `anchor` appear as additional nodes.
- all elements between the `delSpan` and the `anchor` element receive an additional `delSpan` label
- a `delSpan` attribute is added to every element, the value is equal to the `xml:id` attribute of the anchor.
