Metadata-Version: 2.4
Name: owid-catalog
Version: 0.4.4
Summary: Core data types used by OWID for managing data.
Project-URL: Homepage, https://github.com/owid/etl/tree/master/lib/catalog
Project-URL: Repository, https://github.com/etl/etl.git
Author-email: Our World in Data <tech@ourworldindata.org>
License-Expression: MIT
License-File: LICENSE
Requires-Python: <4.0,>=3.10
Requires-Dist: dataclasses-json>=0.6.7
Requires-Dist: dynamic-yaml>=1.3.5
Requires-Dist: ipdb>=0.13.9
Requires-Dist: jinja2>=3.1.6
Requires-Dist: jsonschema>=3.2.0
Requires-Dist: mistune>=3.0.1
Requires-Dist: owid-datautils
Requires-Dist: owid-repack
Requires-Dist: pandas>=2.2.3
Requires-Dist: pyarrow>=10.0.1
Requires-Dist: pyreadr>=0.5.0
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: rapidfuzz>=3.14.3
Requires-Dist: rdata>=0.11.2
Requires-Dist: requests>=2.26.0
Requires-Dist: structlog>=21.5.0
Requires-Dist: typing-extensions>=4.7.1
Requires-Dist: unidecode>=1.3.4
Description-Content-Type: text/markdown

[![Build status](https://badge.buildkite.com/66cc67fc572120ca97b9ffff288d5d73cb33e019dd70323053.svg)](https://buildkite.com/our-world-in-data/owid-catalog-unit-tests)
[![PyPI version](https://badge.fury.io/py/owid-catalog.svg)](https://badge.fury.io/py/owid-catalog)
![](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13-blue.svg)

# owid-catalog

_A Pythonic API for working with OWID's data catalog._

Status: experimental, APIs likely to change

## Overview

Our World in Data is building a new data catalog, with the goal of our datasets being reproducible and transparent to the general public. That project is our [etl](https://github.com/owid/etl), which going forward will contain the recipes for all the datasets we republish.

This library allows you to query our data catalog programmatically, and get back data in the form of Pandas data frames, perfect for data pipelines or Jupyter notebook explorations.

```mermaid
graph TB

etl -->|reads| snapshot[upstream datasets]
etl -->|generates| s3[data catalog]
catalog[owid-catalog-py] -->|queries| s3
```

We would love feedback on how we can make this library and overall data catalog better. Feel free to send us an email at info@ourworldindata.org, or start a [discussion](https://github.com/owid/etl/discussions) on Github.

## Quickstart

Install with `pip install owid-catalog`. Then you can get data in two different ways.

### Charts catalog

This API attempts to give you exactly the data you in a chart on our site.

```python
from owid.catalog import charts

# get the data for one chart by URL
df = charts.get_data('https://ourworldindata.org/grapher/life-expectancy')
```

Notice that the last part of the URL is the chart's slug, its identifier, in this case `life-expectancy`. Using the slug alone also works.

```python
df = charts.get_data('life-expectancy')
```


### Data science API

We also curate much more data than is available on our site. To access that in efficient binary (Feather) format, use our data science API.

This API is designed for use in Jupyter notebooks.

```python
from owid import catalog

# look for Covid-19 data, return a data frame of matches
catalog.find('covid')

# load Covid-19 data from the Our World in Data namespace as a data frame
df = catalog.find('covid', namespace='owid').load()

# search is case-insensitive and supports regex by default
catalog.find(table='gdp.*capita')

# use fuzzy search for typo-tolerant matching (sorted by relevance)
catalog.find(table='forest area', fuzzy=True)
catalog.find(dataset='wrld bank', fuzzy=True, threshold=60)
```

There many be multiple versions of the same dataset in a catalog, each will have a unique path. To easily load the same dataset again, you should record its path and load it this way:

```python
from owid import catalog

path = 'garden/ihme_gbd/2023-05-15/gbd_mental_health_prevalence_rate/gbd_mental_health_prevalence_rate'

rc = catalog.RemoteCatalog()
df = rc[path]
```

## Development

You need Python 3.10+, `uv` and `make` installed. Clone the repo, then you can simply run:

```
# run all unit tests and CI checks
make test

# watch for changes, then run all checks
make watch
```

## Changelog

<details>
<summary>Click to expand changelog</summary>

- `v0.4.4`
    - Enhanced `find()` with better search capabilities:
      - Case-insensitive search by default (use `case=True` for case-sensitive)
      - Regex support enabled by default for `table` and `dataset` parameters
      - New fuzzy search with `fuzzy=True` - typo-tolerant matching sorted by relevance
      - Configurable fuzzy threshold (0-100) to control match strictness
    - New dependency: `rapidfuzz` for fuzzy string matching
- `v0.4.3`
    - Fixed minor bugs
- `v0.4.0`
    - **Highlights**
      - Support for Python 3.10-3.13 (was 3.11-3.13)
      - Drop support for Python 3.9 (breaking change)
    - **Others**
      - Deprecate Walden.
      - Dependencies: Change `rdata` for `pyreadr`.
      - Support: indicator dimensions.
      - Support: MDIMs.
      - Switched from Poetry to UV package manager.
      - New decorator `@keep_metadata` to propagate metadata in pandas functions.
    - Fixes: `Table.apply`, `groupby.apply`, metadata propagation, type hinting, etc.
- `v0.3.11`
    - Add support for Python 3.12 in `pypackage.toml`
- `v0.3.10`
    - Add experimental chart data API in `owid.catalog.charts`
- `v0.3.9`
    - Switch from isort & black & fake8 to ruff
- `v0.3.8`
    - Pin dataclasses-json==0.5.8 to fix error with python3.9
- `v0.3.7`
    - Fix bugs.
    - Improve metadata propagation.
    - Improve metadata YAML file handling, to have common definitions.
    - Remove `DatasetMeta.origins`.
- `v0.3.6`
    - Fixed tons of bugs
    - `processing.py` module with pandas-like functions that propagate metadata
    - Support for Dynamic YAML files
    - Support for R2 alongside S3
- `v0.3.5`
    - Remove `catalog.frames`; use `owid-repack` package instead
    - Relax dependency constraints
    - Add optional `channel` argument to `DatasetMeta`
    - Stop supporting metadata in Parquet format, load JSON sidecar instead
    - Fix errors when creating new Table columns
- `v0.3.4`
    - Bump `pyarrow` dependency to enable Python 3.11 support
- `v0.3.3`
    - Add more arguments to `Table.__init__` that are often used in ETL
    - Add `Dataset.update_metadata` function for updating metadata from YAML file
    - Python 3.11 support via update of `pyarrow` dependency
- `v0.3.2`
    - Fix a bug in `Catalog.__getitem__()`
    - Replace `mypy` type checker by `pyright`
- `v0.3.1`
    - Sort imports with `isort`
    - Change black line length to 120
    - Add `grapher` channel
    - Support path-based indexing into catalogs
- `v0.3.0`
    - Update `OWID_CATALOG_VERSION` to 3
    - Support multiple formats per table
    - Support reading and writing `parquet` files with embedded metadata
    - Optional `repack` argument when adding tables to dataset
    - Underscore `|`
    - Get `version` field from `DatasetMeta` init
    - Resolve collisions of `underscore_table` function
    - Convert `version` to `str` and load json `dimensions`
- `v0.2.9`
    - Allow multiple channels in `catalog.find` function
- `v0.2.8`
    - Update `OWID_CATALOG_VERSION` to 2
- `v0.2.7`
    - Split datasets into channels (`garden`, `meadow`, `open_numbers`, ...) and make garden default one
    - Add `.find_latest` method to Catalog
- `v0.2.6`
    - Add flag `is_public` for public/private datasets
    - Enforce snake_case for table, dataset and variable short names
    - Add fields `published_by` and `published_at` to Source
    - Added a list of supported and unsupported operations on columns
    - Updated `pyarrow`
- `v0.2.5`
    - Fix ability to load remote CSV tables
- `v0.2.4`
    - Update the default catalog URL to use a CDN
- `v0.2.3`
    - Fix methods for finding and loading data from a `LocalCatalog`
- `v0.2.2`
    - Repack frames to compact dtypes on `Table.to_feather()`
- `v0.2.1`
    - Fix key typo used in version check
- `v0.2.0`
    - Copy dataset metadata into tables, to make tables more traceable
    - Add API versioning, and a requirement to update if your version of this library is too old
- `v0.1.1`
    - Add support for Python 3.8
- `v0.1.0`
    - Initial release, including searching and fetching data from a remote catalog

</details>
