Metadata-Version: 2.1
Name: wimbd
Version: 0.1.0
Summary: An elasticsearch wrapper that allows to query ES indices
Home-page: https://wimbd.apps.allenai.org/
Author: Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hanna Hajishirzi, Noah A. Smith, Jesse Dodge
Author-email: yanaiela@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE

Useful functions wrapping around Elasticsearch
==============================================

Connect to the server with a read-only account
----------------------------------------------

### Get access to the indices
* Dolma index: https://forms.gle/gQN4nP4HHYGwXAis9
* Other indices: https://forms.gle/yMz7uTFhd1dKNYTk7


```Python
from wimbd.es import es_init
es = es_init()
```

Find out which indices exist (with other information about the index)
---------------------------------------------------------------------
```Python
from wimbd.es import get_indices

# This returns all indices, along with their total document counts.
print(get_indices())

# This also returns elasticsearch mapping information.
print(get_indices(return_mapping=True))
```

Note that the `get_indices` function won't work with the access key we provide,
since it limits the access to the ES index.
However, you can find the names of the relevant indices below.

At the moment, this will return the following indices:
```Python
{'re_pile': {'docs.count': '211036967'},
 're_laion2b_multi': {'docs.count': '2248498161'}
 'openwebtext': {'docs.count': '8013769'},
 're_laion2b-en-1': {'docs.count': '1161075864'},
 're_laion2b-en-2': {'docs.count': '1161076588'},
 'c4': {'docs.count': '1074273501'},
 're_laion2b_nolang': {'docs.count': '1271703630'},
 're_oscar': {'docs.count': '431992659'}}
```

Different Indices
-----------------
We have 3 different indices that we can make publicly available. Each contain different corpora:
* The Pile, OpenWebText, C4 and Oscar (`re_pile`, `openwebtext`, `c4`, and `re_oscar`)
* RedPajamav1 (`redpajama-split`)
* Dolma (`docs_v1.5_2023-11-02`)

Indices Mapping
---------------
```json
{
    'mappings': {
        'dynamic': 'false',
        'properties': {
            'date': {
                'type': 'date'
            },
            'subset': {
                'type': 'keyword', 
                'ignore_above': 256
            },
            'text': {
                'type': 'text'
            },
            'url': {
                'type': 'text'
            }
        }
    }
}
```
 
Search over one index
---------------------

Search for one or more terms, or sequences of terms (phrases). When you search for
a sequence of terms, their exact order is matched. 

```Python
from wimbd.es import count_documents_containing_phrases

# Count the number of documents containing the term "legal".
count_documents_containing_phrases("test-index", "legal")  # single term

# Count the number of documents containing the term "legal" OR the term "license".
count_documents_containing_phrases("test-index", ["legal", "license"])  # list of terms

# Count the number of documents containing the phrase "terms of use" OR "legally binding".
count_documents_containing_phrases("test-index", ["terms of use", "legally binding"])  # list of word sequences

# Count the number of documents containing both `winter` AND `spring` in the text.
count_documents_containing_phrases("test-index", ["winter", "spring"], all_phrases=True)
```

If you want to actually inspect the documents, you can use `get_documents_containing_phrases` with the same queries as above instead.

```Python
from wimbd.es import get_documents_containing_phrases

# Get documents containing the term "legal".
get_documents_containing_phrases("test-index", "legal")  # single term

# Specify the number of documents to return using `num_documents`. Default is 10.
# Get documents containing the term "legal" OR the term "license".
get_documents_containing_phrases("test-index", ["legal", "license"], num_documents=50)  # list of terms

# Get documents containing the phrase "terms of use" OR "legally binding".
get_documents_containing_phrases("test-index", ["terms of use", "legally binding"])  # list of word sequences

# Get documents containing both `winter` AND `spring` in the text.
get_documents_containing_phrases("test-index", ["winter", "spring"], all_phrases=True)
```

Get total number of a term's occurrences (as opposed to document counts)
------------------------------------------------------------------------
```Python
from wimbd.es import count_total_occurrences_of_unigrams

count_total_occurrences_of_unigrams("test-index", ["legal", "license"])
```

Search over multiple indices
----------------------------

Because LAION has more documents than can fit into one Elastic Search index, it is split over multiple indices.
Fortunately, you can query more than one index at a time.

```Python
from wimbd.es import count_documents_containing_phrases

count_documents_containing_phrases("re_laion2b-en-*", "the woman")
```
