Metadata-Version: 2.1
Name: eredesscraper
Version: 1.0.1
Summary: Web scraper to extract data from E-REDES website and load it into database storage.
Home-page: https://github.com/rf-santos/eredes-scraper
License: MIT
Keywords: web,scraper,influxdb,database,electricity,energy,e-redes
Author: Ricardo Filipe dos Santos
Author-email: ricardofilipecdsantos@gmail.com
Requires-Python: >=3.11,<3.13
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: duckdb (>=0.10.2,<0.11.0)
Requires-Dist: fastapi (>=0.110.0,<0.111.0)
Requires-Dist: influxdb-client (>=1.38.0,<2.0.0)
Requires-Dist: openpyxl (>=3.1.2,<4.0.0)
Requires-Dist: pandas (>=2.1.1,<3.0.0)
Requires-Dist: playwright (>=1.44.0,<2.0.0)
Requires-Dist: playwright-stealth (>=1.0.6,<2.0.0)
Requires-Dist: pykwalify (>=1.8.0,<2.0.0)
Requires-Dist: python-multipart (>=0.0.9,<0.0.10)
Requires-Dist: pyyaml (>=6.0.1,<7.0.0)
Requires-Dist: screeninfo (>=0.8.1,<0.9.0)
Requires-Dist: typer[all] (>=0.9.0,<0.10.0)
Requires-Dist: uvicorn (>=0.28.0,<0.29.0)
Project-URL: Documentation, https://github.com/rf-santos/eredes-scraper
Project-URL: Repository, https://github.com/rf-santos/eredes-scraper
Description-Content-Type: text/markdown

# E-REDES Scraper
## Description
This is a web scraper that collects data from the E-REDES website and can upload it to a database.
Since there is no exposed programatic interface to the data, this web scraper was developed as approach to collect it.
A high-level of the process is:
1. The scraper collects the data from the E-REDES website.
2. A file with the energy consumption readings is downloaded.
3. [ Optional ] The file is parsed and the data is uploaded to the selected database. 
4. [ Optional ] A feature supporting only the insertion of "deltas" is available.

> This package supports E-REDES website available at time of writing 14/06/2023. 
> The entrypoint for the scraper is the page https://balcaodigital.e-redes.pt/consumptions/history.

## Installation
The package can be installed using pip:
```bash
pip install eredesscraper
```

## Configuration
Usage is based on a YAML configuration file.  
`config.yml` holds the credentials for the E-REDES website and 
the database connection. Currently, **only InfluxDB is supported** as a database sink.  

### Template `config.yml`:
```yaml
eredes:
  # eredes credentials
  nif: <my-eredes-nif>
  pwd: <my-eredes-password>
  # CPE to monitor. e.g. PT00############04TW (where # is a digit). CPE can be found in your bill details
  cpe: PT00############04TW


influxdb:
  # url to InfluxDB.  e.g. http://localhost or https://influxdb.my-domain.com
  host: http://localhost
  # default port is 8086
  port: 8086
  bucket: <my-influx-bucket>
  org: <my-influx-org>
  # access token with write access
  token: <token>
```

## Usage
### CLI:
```bash
ers config load "/path/to/config.yml"

# get current month readings
ers run -d influxdb

# get only deltas from last month readings 
ers run -w previous -d influxdb --delta

# get readings from May 2023
ers run -w select -d influxdb -m 5 -y 2023

# start an API server
ers server -H "localhost" -p 8778 --reload -S <path/to/database>
```

### API:

For more details refer to the OpenAPI documentation or the UI endpoints available at `http://<host>:<port>/docs` and `http://<host>:<port>/redoc`

```bash
# main methods:

# load an ers configuration 
# different options to load available:
# - directly in the request body,
# - download remote file,
# - upload local file
curl -X 'POST' \
  'http://localhost:8778/config/upload' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@my-config.yml'


# run sync workflow
curl -X 'POST' \
  'http://localhost:8778/run' \
  -H 'Content-Type: application/json' \
  -d '{
  "workflow": "current"
}'

# run async workflow
curl -X 'POST' \
  'http://localhost:8778/run_async' \
  -H 'Content-Type: application/json' \
  -d '{
  "workflow": "select",
  "db": [
    "influxdb"
  ],
  "month": 5,
  "year": 2023,
  "delta": true,
  "download": true
}'

# get task status (`task_id` returned in /run_async response body)
curl -X 'GET' \
  'http://localhost:8778/status/<task_id>'

# download the file retrieved by the workflow
curl -X 'GET' \
  'http://localhost:8778/download/<task_id>'
```

### Python:

```python
from eredesscraper.workflows import switchboard
from pathlib import Path

# get deltas from current month readings
switchboard(config_path=Path("./config.yml"),
            name="current",
            db=list("influxdb"),
            delta=True,
            keep=True)

# get readings from May 2023
switchboard(config_path=Path("./config.yml"),
            name="select",
            db=list("influxdb"),
            month=5,
            year=2023)
```

## Features
### Available workflows:
- `current`: Collects the current month consumption.
- `previous`: Collects the previous month consumption data.
- `select`: Collects the consumption data from an arbitrary month parsed by the user.

### Available databases:
- `influxdb`: Loads the data in an InfluxDB database. (https://docs.influxdata.com/influxdb/v2/get-started/)

## Roadmap
- [X] ~~Add workflow for retrieving previous month data.~~
- [X] ~~Add workflow for retrieving data form an arbitrary month.~~
- [X] ~~Build CLI~~.
- [X] ~~Build API~~
- [ ] ~~Containerize app~~.
- [ ] Documentation.
- [X] ~~Add CI/CD~~.
- [ ] Add logging.
- [X] ~~Add tests~~ (limited coverage).
- [ ] Add runtime support for multiple CPEs.

## Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

## License
See [LICENSE](LICENSE) file.

