Metadata-Version: 2.1
Name: rnc
Version: 0.10.0
Summary: API for Russian National Corpus
Home-page: https://kunansy.github.io/RNC/
License: MIT
Keywords: api,corpus,ruscorpora,linguistics,russian-national-corpus
Author: kunansy
Author-email: kolobov.kirill@list.ru
Requires-Python: >=3.7,<4.0
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: Bulgarian
Classifier: Natural Language :: Chinese (Simplified)
Classifier: Natural Language :: Chinese (Traditional)
Classifier: Natural Language :: Czech
Classifier: Natural Language :: English
Classifier: Natural Language :: Finnish
Classifier: Natural Language :: French
Classifier: Natural Language :: German
Classifier: Natural Language :: Italian
Classifier: Natural Language :: Latvian
Classifier: Natural Language :: Lithuanian
Classifier: Natural Language :: Polish
Classifier: Natural Language :: Spanish
Classifier: Natural Language :: Swedish
Classifier: Natural Language :: Ukrainian
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Education
Classifier: Topic :: Scientific/Engineering
Requires-Dist: aiofiles (>=0.8.0,<0.9.0)
Requires-Dist: aiohttp (>=3.8.1,<3.9.0)
Requires-Dist: beautifulsoup4 (>=4.11.1,<4.12.0)
Requires-Dist: bs4 (>=0.0.1,<0.1.0)
Requires-Dist: lxml (>=4.9.1,<4.10.0)
Requires-Dist: types-aiofiles (>=0.8.4,<0.9.0)
Requires-Dist: types-ujson (>=5.4.0,<5.5.0)
Requires-Dist: ujson (>=5.4.0,<5.5.0)
Project-URL: Bug Tracker, https://github.com/kunansy/RNC/issues
Project-URL: Changelog, https://github.com/kunansy/RNC/blob/master/CHANGELOG.md
Project-URL: Documentation, https://kunansy.github.io/RNC/
Project-URL: Funding, https://paypal.me/ansth
Project-URL: Repository, https://github.com/kunansy/RNC/
Description-Content-Type: text/markdown

# API for [Russian National Corpus](http://ruscorpora.ru) 
![Downloads](https://pepy.tech/badge/rnc)
![Tests status](https://github.com/kunansy/rnc/actions/workflows/python-app.yml/badge.svg)
![MIT licensed](https://img.shields.io/badge/license-MIT-blue.svg)
![Latest Release](https://img.shields.io/github/v/release/kunansy/rnc?color=%233D9970)
![PyPi status](https://img.shields.io/pypi/status/rnc.svg?style=flat-square)
![Supported python versions](https://img.shields.io/pypi/pyversions/rnc.svg?style=flat-square)

## Installation
```bash
pip install rnc
```

## Structure
Corpus object contains list of obtained examples.
There are two types of examples:
![](https://github.com/kunansy/RNC/blob/master/docs/Two_ex_types.png?raw=true) <br> 
* If `out` is `normal`, API uses normal example, which name is equal to the Corpus class name:

```python
ru = rnc.MainCorpus(...)
ru.request_examples()

print(type(ru[0]))
>>> MainExample
```
* if `out` is `kwic`, API uses `KwicExample`.

Examples' objects [fields](https://github.com/kunansy/RNC/blob/master/docs/Examples.md)   


## Usage
```python
import rnc

ru = rnc.MainCorpus(
    query='корпус', 
    p_count=5,
    file='filename.csv',
    marker=str.upper,
    **kwargs
)

ru.request_examples()
```
* `query` – one str or dict with tags. Words to find, you should give the vocabulary form of them.
* `p_count` – count of **PAGES**.
* `file` – path to local csv file, optional. Example: `file='data\\filename.csv'`. 
* `marker` – function, with which found wordforms will be marked, optional. 
* `kwargs` – additional params.

[Corpora](https://github.com/kunansy/RNC/blob/master/docs/Corpora.md) you can use.


### Full query form
```python
query = {
    'word1': {
        'gramm': 'acc', # grammar tags for lexgramm search
        'flags': 'bdot' # additional tags for lexgramm search
    },
    # you can get as a value one string or dict of params
    # params are: any name of dict key, name of tag (you can see them below)  
    'word2': {
        'gramm': { 
            # the NAMES of these keys might be any
            'pos (any name)': 'S' or ['S', 'A'], # one value or list of values,
            'case (any name)': 'acc' or ['acc', 'nom'],
        },
        'flags': {}, # all the same to here
        # distance between first and second words
        'min': 1,  
        'max': 3
    },  
}

corp = rnc.MainCorpus(
    query, 5, file='filename.csv', marker=str.upper, **kwargs)
corp.reques_examples()
```
[Lexgramm search params](https://github.com/kunansy/RNC/tree/master/docs/Lexgram%20search%20params)


### String as a query
Also you can pass as a query a string with the **vocabulary forms** of the 
words, divided by space: `query = 'get down'` or `query = 'я получить'`. 
Distance between them will be default.


### Additional request params
These params are optional, you can ignore them. Here are the default values.
```python
corp = rnc.ParallelCorpus(
    query=query, 
    p_count=5,
    file='filename.csv',
    marker=str.upper,
    
    dpp=5, # documents per page
    spd=10, # sentences per document (<= than spd)
    text='lexgramm' or 'lexform', # way to search
    out='normal' or 'kwic', # output format
    kwsz=5, # if out=kwic, count of words in context
    sort='i_grtagging', # way to sort the results, see HOWTO section below
    mycorp='', # see HOWTO section below
    lang=rnc.Languages.en,
    accent=0, # with accentology (1) or without (0), if it is available
)
```
[Sort keys](https://github.com/kunansy/RNC/blob/master/docs/HTTP%20params.md)


### API can work with a local file too
```python
ru = rnc.SpokenCorpus(file='local_database.csv') # it must exist
print(ru)
```
If the file exists, API works with it. If the data list is not empty you 
cannot request new examples. <br>

If you work with a file, it is not demanded to pass any argument to Corpus 
except for the file name (`file=...`).


### Working with corpora
```python
corp = rnc.corpus_name(...) 
```
* `corp.request_examples()` – request examples. 
There is an exception if:
    * Data still exist. 
    * No results found.
    * A requested page does not exist (if there are 10 pages in the RNC, but 
      you have requested > 10).
    * There is a mistake in the request.
    * You have no access to the Internet.
    * There is a problem while getting access to RNC.
    * another problems...
* `corp.data` – list of examples (only getter)
* `corp.query` – query (only getter).
* `corp.forms_in_query` – requested wordforms (only getter).
* `corp.p_count` – requested count of pages (only getter). 
* `corp.file` – path to the local csv file (only getter).
* `corp.marker` – marker (only getter).
* `corp.params` – dict, HTTP tags (only getter). 
* `corp.found_wordforms` – dict with found wordforms and their frequency (only getter).
* `corp.ex_type` – type of example (only getter).
* `corp.amount_of_docs` – amount of docs where the query was found.
* `corp.amount_of_contexts` – amount of contexts where the query was found.
* `corp.graphic_link` – link to the graphic of the distribution of query occurrences by years.
* `corp.dump()` – write two files: csv file with all data and json file with config.
* `corp.copy()` – create a copy.
* `corp.shuffle()` – shuffle data list.
* `corp.sort_data(key=, reverse=)` – sort the list of examples. Here HTTP keys do not work,
key is applied to Example objects.  
* `corp.pop(index)` – remove and return the example at the index.
* `corp.clear()` – empty the data list.
* `corp.filter(key)` – filter the data list, remove some examples using the key. 
Key is applied to the `Example` objects.
* `corp.url` – URL of the first RNC page (only getter).
* `corp.findall(pattern, args)` – get all examples where the pattern found and 
  the match.
* `corp.finditer(pattern, args)` – get all examples where the pattern found and 
  the match.
* `async corp.request_examples_async()` – make request in the running event loop.

Magic methods: 
* `corp.dpp` or another request param (only getter).
* `corp()` – all the same to `request_examples()`.
* `str(corp) or print(corp)` – str with info about Corpus, enumerated examples.
By default, Corpus shows first 50 examples, but you can change it 
or turn the restriction off. 

    Info about Corpus:
    ```
    Russian National Corpus (https://ruscorpora.ru)
    Class: CorpusName, len = amount of examples 
    Pages: n of 'words' requested
    ```
* `len(corp)` – count of examples.
* `bool(corp)` – whether data exist.
* `corp[index or slice]` – get element at the index or create a new object 
  with sliced data:
```python
from_2_to_10 = corp[2:10:2]
```
* `del corp[10]` or `del corp[:10]` – remove some examples from the data list.

* Also you can use cycle `for`. For example we want to see only left 
  context (`out=kwic`) and source:
```python
corp = rnc.ParallelCorpus(
    'corpus', 5, 
    out='kwic', kwsz=7, 
    lang=rnc.Languages.en
)
corp.request_examples()

for r in corp:
    print(r.left)
    print(r.src)
```

Set default values to all objects you will create:
* `corpus_name.set_dpp(value)` – change default `document per page` value.
* `corpus_name.set_spd(value)` – change default `sentences per document` value.
* `corpus_name.set_text(value)` – change default search way.
* `corpus_name.set_sort(value)` – change default sort key.
* `corpus_name.set_min(value)` – change default min distance between words.
* `corpus_name.set_max(value)` – change default max distance between words.
* `corpus_name.set_restrict_show(value)` – change default amount of shown examples in print. 
If it is equal to `False`, the Corpus shows all examples. 


### Corpora features
#### ParallelCorpus
* The query might be both in the original language and in the language of 
  translation. 

#### MultilingualParaCorpus
* Working with files is removed.
* Param `mycorp` is not demanded by default, but it might be passed, see 
  **HOWTO** section below.

#### MultimodalCorpus
* `corp.download_all()` – download all media files. **It is recommended** to use 
this method instead of `expl.download_file()`.
* `async corp.download_all_async()` – download all media files using the running event loop.


## Logger
* See all log messages
```python
rnc.set_stream_handler_level('debug')
```
* See less than all messages
```python
rnc.set_stream_handler_level('info')
```
* Turn the logger off
```python
rnc.set_logger_level('critical')
```
* Turn off all messages in the stream, but dump logs to file
```python
rnc.set_stream_handler_level('critical')
```
* Turn off dumping logs to file
```python
rnc.set_file_handler_level('critical')
```


## ATTENTION
* Do not forget to call this function
```python
corp.request_examples()
```
* If you have requested more than 10 pages, RNC returns 429 error 
  (Too many requests).
For example requesting 100 pages you should wait about 3 minutes: 
![100 pages](https://github.com/kunansy/RNC/blob/master/docs/100_pages.png?raw=true)
* **Do not call** the marker you pass

**RIGHT:**
```python
ru = rnc.MainCorpus(...,  marker=str.upper)
```
**WRONG:**
```python
ru = rnc.MainCorpus(..., marker=str.upper())
```
* Pass an empty string as a param if you do not want to set them
```python
query = {
    'word1': '',
    'word2': {'min': 2, 'max': 5}
}
```
* If `accent=1`, marker does not work.
* Do not run `corp.request_examples()` in the running event loop, instead use `await corp.request_examples_async()`

---

## HOWTO
You can ask any question you want [here](https://github.com/kunansy/RNC/discussions).

### How to set sort?
There are some sort keys:
1. `i_grtagging` – by default.
2. `random` – randomly.
3. `i_grauthor` – by author.
4. `i_grcreated_inv` – by creation date.
5. `i_grcreated` – by creation date in reversed order.
6. `i_grbirthday_inv` – by author's birth date.
7. `i_grbirthday` – by author's birth date in reversed order.

[Some of HTTP params](https://github.com/kunansy/RNC/blob/master/docs/HTTP%20params.md).


### How to set language in ParallelCorpus?
```python
en = rnc.ParallelCorpus('get', 5, lang=rnc.Languages.en)
```
Languages the corpus supports:
1. Armenian
1. Bashkir
1. Belarusian
1. Bulgarian
1. Buryatian
1. Chinese
1. Czech
1. English
1. Estonian
1. Finnish
1. French
1. German
1. Italian
1. Latvian
1. Lithuanian
1. Polish
1. Spanish
1. Swedish
1. Ukrainian

If you want to search something by several languages, choose and set the 
`mycorp` in the site, pass this param to Corpus. 


### How to set subcorpus?
Means specify the sample where you want to search the query. <br>

There are default keys in `rnc.mycorp` (working checked in 
**MainCorpus**) – Russian writers and poets: 
* Pushkin
* Dostoyevsky
* TolstoyLN
* Chekhov
* Gogol
* Turgenev

Example:
```python
ru = rnc.MainCorpus('нету', 1, mycorp=rnc.mycorp['Pushkin'])
```

**OR**

```python
ru = rnc.MainCorpus('нету', 1, mycorp=rnc.mycorp.Pushkin)
```

**OR**
 
![1](https://raw.githubusercontent.com/kunansy/RNC/master/docs/How%20to%20set%20subcorpus/1.png)
![2](https://raw.githubusercontent.com/kunansy/RNC/master/docs/How%20to%20set%20subcorpus/2.png)
![3](https://raw.githubusercontent.com/kunansy/RNC/master/docs/How%20to%20set%20subcorpus/3.png)
![4](https://raw.githubusercontent.com/kunansy/RNC/master/docs/How%20to%20set%20subcorpus/4.png)


## Links
* [Russian National Corpus](https://ruscorpora.ru)
* [Docs](https://github.com/kunansy/RNC/tree/master/docs)
* Examples' objects [fields](https://github.com/kunansy/RNC/blob/master/docs/Examples.md)
* [Corpora](https://github.com/kunansy/RNC/blob/master/docs/Corpora.md) you can use.
* [Lexgramm search params](https://github.com/kunansy/RNC/tree/master/docs/Lexgram%20search%20params)
* [Sort keys](https://github.com/kunansy/RNC/blob/master/docs/HTTP%20params.md)
---


## Requirements
* Python >= 3.7


## Licence
`rnc` is offered under MIT licence.


## Source code
The project is hosted on [Github](https://github.com/kunansy/RNC)

---

Please file an issue in the [bug tracker](https://github.com/kunansy/RNC/issues) 
if you have found a bug or have some suggestions to improve the library.

