Metadata-Version: 2.4
Name: saf-datasets
Version: 0.7.2
Summary: Data set loading and annotation facilities for the Simple Annotation Framework
Home-page: 
Author: Danilo S. Carvalho
Author-email: "Danilo S. Carvalho" <danilo.carvalho@manchester.ac.uk>
License-Expression: GPL-3.0-or-later
Project-URL: Homepage, https://github.com/neuro-symbolic-ai/saf_datasets
Project-URL: Issues, https://github.com/neuro-symbolic-ai/saf_datasets/issues
Keywords: datasets,annotated,nlp
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: saf-nlp>=0.6.1
Requires-Dist: spacy
Requires-Dist: gdown
Requires-Dist: tqdm
Requires-Dist: torch
Requires-Dist: jsonlines
Requires-Dist: transformers
Requires-Dist: sentencepiece
Requires-Dist: protobuf
Requires-Dist: pyarrow
Dynamic: author
Dynamic: license-file

# SAF-Datasets
### Dataset loading and annotation facilities for the Simple Annotation Framework

The *saf-datasets* library provides easy access to Natural Language Processing (NLP) datasets, and tools to facilitate annotation at document, sentence and token levels. 

It is being developed to address a need for flexibility in manipulating NLP annotations that is not entirely covered by popular dataset libraries, such as HuggingFace Datasets and torch Datasets, Namely:

- Including and modifying annotations on existing datasets.
- Standardized API.
- Support for complex and multi-level annotations.

*saf-datasets* is built upon the [Simple Annotation Framework (SAF)](https://github.com/dscarvalho/saf) library, which provides its data model and API.

It also provides annotator classes to automatically label existing and new datasets.


## Installation

To install, you can use pip:

```bash
pip install saf-datasets
```

## Usage
### Loading datasets

```python
from saf_datasets import STSBDataSet

dataset = STSBDataSet()
print(len(dataset))  # Size of the dataset
# 17256
print(dataset[0].surface)  # First sentence in the dataset
# A plane is taking off
print([token.surface for token in dataset[0].tokens])  # Tokens (SpaCy) of the first sentence.
# ['A', 'plane', 'is', 'taking', 'off', '.']
print(dataset[0].annotations)  # Annotations for the first sentence
# {'split': 'train', 'genre': 'main-captions', 'dataset': 'MSRvid', 'year': '2012test', 'sid': '0001', 'score': '5.000', 'id': 0}

# There are no token annotations in this dataset
print([(tok.surface, tok.annotations) for tok in dataset[0].tokens])
# [('A', {}), ('plane', {}), ('is', {}), ('taking', {}), ('off', {}), ('.', {})]
```

**Available datasets:** AllNLI, CODWOE, CPAE, EntailmentBank, STSB, Wiktionary, WordNet (Filtered).

### Annotating datasets

```python
from saf_datasets import STSBDataSet
from saf_datasets.annotators import SpacyAnnotator

dataset = STSBDataSet()
annotator = SpacyAnnotator()  # Needs spacy and en_core_web_sm to be installed.
annotator.annotate(dataset)

# Now tokens are annotated
for tok in dataset[0].tokens:
    print(tok.surface, tok.annotations)

# A {'pos': 'DET', 'lemma': 'a', 'dep': 'det', 'ctag': 'DT'}
# plane {'pos': 'NOUN', 'lemma': 'plane', 'dep': 'nsubj', 'ctag': 'NN'}
# is {'pos': 'AUX', 'lemma': 'be', 'dep': 'aux', 'ctag': 'VBZ'}
# taking {'pos': 'VERB', 'lemma': 'take', 'dep': 'ROOT', 'ctag': 'VBG'}
# off {'pos': 'ADP', 'lemma': 'off', 'dep': 'prt', 'ctag': 'RP'}
# . {'pos': 'PUNCT', 'lemma': '.', 'dep': 'punct', 'ctag': '.'}
```

### Using with other libraries

*saf-datasets* provides wrappers for using the datasets with libraries expecting HF or torch datasets:

```python
from saf_datasets import CPAEDataSet
from saf_datasets.wrappers.torch import TokenizedDataSet
from transformers import AutoTokenizer

dataset = CPAEDataSet()
tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side="left", add_prefix_space=True)
tok_ds = TokenizedDataSet(dataset, tokenizer, max_len=128, one_hot=False)
print(tok_ds[:10])
# tensor([[50256, 50256, 50256,  ...,  2263,   572,    13],
#         [50256, 50256, 50256,  ...,  2263,   572,    13],
#         [50256, 50256, 50256,  ...,   781,  1133,    13],
#         ...,
#         [50256, 50256, 50256,  ...,  2712, 19780,    13],
#         [50256, 50256, 50256,  ...,  2685,    78,    13],
#         [50256, 50256, 50256,  ...,  2685,    78,    13]])

print(tok_ds[:10].shape)
# torch.Size([10, 128])
```
