Metadata-Version: 2.1
Name: domsdatabasen
Version: 0.1.4
Summary: Scraper and PDF text processor for domsdatabasen.dk
License: MIT
Author: Oliver Kinch
Author-email: oliver.kinch@alexandra.dk
Requires-Python: >=3.10,<3.13
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: datasets (>=2.17.1,<3.0.0)
Requires-Dist: easyocr (>=1.7.1,<2.0.0)
Requires-Dist: hydra-core (>=1.3.2,<2.0.0)
Requires-Dist: img2table (>=1.2.11,<2.0.0)
Requires-Dist: jsonlines (>=4.0.0,<5.0.0)
Requires-Dist: numpy (>=1.26.4,<2.0.0)
Requires-Dist: omegaconf (>=2.3.0,<3.0.0)
Requires-Dist: pdf2image (>=1.17.0,<2.0.0)
Requires-Dist: poppler-utils (>=0.1.0,<0.2.0)
Requires-Dist: pypdf (>=4.0.2,<5.0.0)
Requires-Dist: scikit-image (>=0.22.0,<0.23.0)
Requires-Dist: selenium (>=4.18.1,<5.0.0)
Requires-Dist: tabulate (>=0.9.0,<0.10.0)
Requires-Dist: tika (>=2.6.0,<3.0.0)
Requires-Dist: webdriver-manager (>=4.0.1,<5.0.0)
Description-Content-Type: text/markdown

<a href="https://github.com/alexandrainst/domsdatabasen"><img src="gfx/alexandra_logo.png" width="239" height="175" align="right" /></a>
# Domsdatabasen

Scraping og processering af sager fra [Domsdatabasen](https://domsdatabasen.dk/#).

Hver enkelt dom kan tilgås via https://domsdatabasen.dk/#sag/\<nummer\>, hvor \<nummer\> er mellem 1 og 3821 (pr. 11-10-2023).

Datasættet ligger i en processeret udgave på [Huggingface](https://huggingface.co/datasets/alexandrainst/domsdatabasen).

## Scraping af sager
Se `src/scripts/scrape.py`.

## Processing af scraped data
Se `src/scripts/process.py`.

## Byg datasæt
Se `src/scripts/finalize.py`.

______________________________________________________________________
[![Documentation](https://img.shields.io/badge/docs-passing-green)](https://alexandrainst.github.io/domsdatabasen/domsdatabasen.html)
[![License](https://img.shields.io/github/license/alexandrainst/domsdatabasen)](https://github.com/alexandrainst/domsdatabasen/blob/main/LICENSE)
[![LastCommit](https://img.shields.io/github/last-commit/alexandrainst/domsdatabasen)](https://github.com/alexandrainst/domsdatabasen/commits/main)
[![Code Coverage](https://img.shields.io/badge/Coverage-100%25-brightgreen.svg)](https://github.com/alexandrainst/domsdatabasen/tree/main/tests)
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/alexandrainst/domsdatabasen/blob/main/CODE_OF_CONDUCT.md)


Developers:

- Oliver Kinch (oliver.kinch@alexandra.dk)
- Dan Saattrup Nielsen (dan.nielsen@alexandra.dk)


## Setup

### Installation

1. Run `pip install -r requirements.txt` to install the required packages.


## A Word on Modules and Scripts
In the `src` directory there are two subdirectories, `domsdatabasen`
and `scripts`. This is a brief explanation of the differences between the two.

### Modules
All Python files in the `domsdatabasen` directory are _modules_
internal to the project package. Examples here could be a general data loading script,
a definition of a model, or a training function. Think of modules as all the building
blocks of a project.

When a module is importing functions/classes from other modules we use the _relative
import_ notation - here's an example:

```
from .other_module import some_function
```

### Scripts
Python files in the `scripts` folder are scripts, which are short code snippets that
are _external_ to the project package, and which is meant to actually run the code. As
such, _only_ scripts will be called from the terminal. An analogy here is that the
internal `numpy` code are all modules, but the Python code you write where you import
some `numpy` functions and actually run them, that a script.

When importing module functions/classes when you're in a script, you do it like you
would normally import from any other package:

```
from domsdatabasen import some_function
```

Note that this is also how we import functions/classes in tests, since each test Python
file is also a Python script, rather than a module.


## Features

### Automatic Test Coverage Calculation

Run `make test` to test your code, which also updates the "coverage badge" in the
README, showing you how much of your code base that is currently being tested.

### Continuous Integration

Github CI pipelines are included in the repo, running all the tests in the `tests`
directory, as well as building online documentation, if Github Pages has been enabled
for the repository (can be enabled on Github in the repository settings).

### Code Spaces

Code Spaces is a new feature on Github, that allows you to develop on a project
completely in the cloud, without having to do any local setup at all. This repo comes
included with a configuration file for running code spaces on Github. When hosted on
`alexandrainst/domsdatabasen` then simply press the `<> Code` button
and add a code space to get started, which will open a VSCode window directly in your
browser.


## Project structure
```
.
├── .devcontainer
│   └── devcontainer.json
├── .github
│   └── workflows
│       ├── ci.yaml
│       └── docs.yaml
├── .gitignore
├── .name_and_email
├── .pre-commit-config.yaml
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Dockerfile
├── LICENSE
├── README.md
├── config
│   ├── __init__.py
│   ├── config.yaml
│   └── hydra
│       └── job_logging
│           └── custom.yaml
├── data
├── gfx
│   └── alexandra_logo.png
├── makefile
├── models
├── notebooks
├── poetry.toml
├── pyproject.toml
├── src
│   ├── scripts
│   │   ├── fix_dot_env_file.py
│   │   └── your_script.py
│   └── domsdatabasen
│       ├── __init__.py
│       └── your_module.py
└── tests
    ├── __init__.py
    └── test_dummy.py
```

