Metadata-Version: 2.1
Name: elasticmagic-qf-attrs
Version: 0.1.2
Summary: 
Home-page: https://github.com/anti-social/elasticmagic-qf-attrs
License: Apache-2.0
Author: Alexander Koval
Author-email: kovalidis@gmail.com
Maintainer: Alexander Koval
Maintainer-email: kovalidis@gmail.com
Requires-Python: >=3.7,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: elasticmagic (>=0.1.0-alpha.12,<0.2.0)
Project-URL: Repository, https://github.com/anti-social/elasticmagic-qf-attrs
Description-Content-Type: text/markdown

Elasticmagic query filters for attributes
=========================================

Library to store, filter and build facets for custom attributes

The problem
-----------

Each attribute pair can be stored in the index as a nested document.
We can use following mapping for that:

```yaml
attrs:
  type: nested
  properties:
    attr_id:
      type: integer
    # usually only one of the next fields should be populated
    value_int:
      type: integer
    value_bool:
      type: boolean
    value_float:
      type: float
```

This makes it possible to filter documents by an attribute id and its value
(for example we want to find all the documents with `attr_id = 1234` and `value = 5678`):

```yaml
query:
  bool:
    filter:
    - nested:
        path: attrs
        query:
          bool:
            must:
            - term:
                attrs.attr_id: 1234
            - term:
                attrs.value_int: 5678
```

It is also possible to build a facets for all attributes at once:

```yaml
aggs:
  attrs_nested:
    nested:
      path: attrs
    aggs:
      attrs:
        terms:
          field: attrs.attr_id
        aggs:
          values:
            field: attrs.value_int
```

or for a single attribute:

```yaml
aggs:
  attrs_nested:
    nested:
      path: attrs
    aggs:
      attr_1234:
        filter:
          term:
            attrs.attr_id: 1234
        aggs:
          values:
            field: attrs.value_int
```

But nested documents have some drawbacks. Every nested document is stored
in the index as different document. For instance, next document will be stored
as 5 lucene documents:

```yaml
name: "I'm a document with nested attributes"
attrs:
- attr_id: 1
  value_int: 42
- attr_id: 2
  value_int: 43
- attr_id: 3
  value_bool: true
- attr_id: 4
  value_float: 99.9
```

Nested queries are slow by itself:
   
> In particular, joins should be avoided. nested can make queries several times
slower and parent-child relations can make queries hundreds of times slower.
> - https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-search-speed.html#_document_modeling

But what is worse regular queries are also slower when there are nested documents
in the index. It is because of all the fields of main documents becomes sparse.
This in turn degrades performance of all filters and accesses to doc_values.

The solution
------------

The idea is to encode pair of an attribute id and a corresponding value into
a single value. If our attribute ids are 32-bit integers and all value types
also fit into 32 bits we can store them as a single 64-bit value.

So our mapping can be:

```yaml
attrs:
  type: object
  properties:
    int:
      type: long
    bool:
      type: long
    float:
      type: long
```

Document with encoded attributes:

```yaml
name: "I'm a document with packed attributes"
attrs:
# (1 << 32) | 42
- int: 0x1_0000002a
# (2 << 32) | 43
- int: 0x2_0000002b
# (3 << 1) | 1
- bool: 0x7
# (4 << 32) | {integer representation of 99.9}
# (4 << 32) | struct.unpack('=I', struct.pack('=f', 99.9))[0]
- float: 0x4_42c7cccd
```

Now with a bit of bit magic we can emulate nested queries.

Filtering by attribute id `1234` with value `5678`:

```yaml
query:
  bool:
    filter:
    - term:
        attrs.int: 0x4d2_0000162e
```

Building facet for all attribute values:

```yaml
aggs:
  attrs_int:
    terms:
      field: attrs.int
      # specify big enough aggregation size
      # so all flat attrite values should fit
      size: 10000
```

One more step that client should do is to decode and group values by
attribute id.

How to use it in python
-----------------------

```python
from elasticsearch import Elasticsearch

from elasticmagic import Cluster, Document, Field
from elasticmagic.types import List, Long
from elasticmagic.ext.queryfilter import QueryFilter

from elasticmagic_qf_attrs import AttrBoolFacetFilter
from elasticmagic_qf_attrs import AttrIntFacetFilter
from elasticmagic_qf_attrs import AttrRangeFacetFilter
from elasticmagic_qf_attrs.util import merge_attr_value_bool
from elasticmagic_qf_attrs.util import merge_attr_value_float
from elasticmagic_qf_attrs.util import merge_attr_value_int

# Specify document
class AttrsDocument(Document):
    __doc_type__ = 'attrs'

    ints = Field(List(Long))
    bools = Field(List(Long))
    floats = Field(List(Long))

# Create an index
index_name = 'test-attrs'
client = Elasticsearch()
client.indices.create(index=index_name)
cluster = Cluster(client)
index = cluster.get_index(index_name)
index.put_mapping(AttrsDocument)

# Index example document
index.add([
    AttrsDocument(
        ints=[
            merge_attr_value_int(1, 42),
            merge_attr_value_int(2, 43),
        ],
        bools=[merge_attr_value_bool(3, True)],
        floats=[merge_attr_value_float(4, 99.9)],
    ),
], refresh=True)

# Define a query filter
class AttrsQueryFilter(QueryFilter):
    ints = AttrIntFacetFilter(AttrsDocument.ints, alias='a')
    bools = AttrBoolFacetFilter(AttrsDocument.bools, alias='a')
    ranges = AttrRangeFacetFilter(AttrsDocument.floats, alias='a')

# Now we can build facets
qf = AttrsQueryFilter()
sq = index.search_query()
sq = qf.apply(sq, {})
res = sq.get_result()
assert res.total == 1
qf_res = qf.process_result(res)

# And finally lets print results
for attr_id, facet in qf_res.ints.facets.items():
    print(f'> {attr_id}:')
    for facet_value in facet.all_values:
        print(f'  {facet_value.value}: ({facet_value.count_text})')

for attr_id, facet in qf_res.bools.facets.items():
    print(f'> {attr_id}:')
    for facet_value in facet.all_values:
        print(f'  {facet_value.value}: ({facet_value.count_text})')

for attr_id, facet in qf_res.ranges.facets.items():
    print(f'> {attr_id}: ({facet.count})')

# Also we can filter documents:
qf = AttrsQueryFilter()
sq = index.search_query()
sq = qf.apply(
    sq,
    {
        'a1': '42',
        'a3': 'true',
        'a4__lte': '100',
    }
)
res = sq.get_result()
assert res.total == 1

qf = AttrsQueryFilter()
sq = index.search_query()
sq = qf.apply(
    sq,
    {
        'a4__gte': '100',
    }
)
res = sq.get_result()
assert res.total == 0
```

This script should print:

```
> 1:
  42: (1)
> 2:
  43: (1)
> 3:
  True: (1)
> 4: (1)
```

