Metadata-Version: 2.4
Name: wikicorpus
Version: 0.1.1
Summary: Retrieve and filter Wikipedia articles to build a corpus for knowledge extraction
Project-URL: Homepage, https://github.com/aschimmenti/wikicorpus
Project-URL: Repository, https://github.com/aschimmenti/wikicorpus
Author-email: Andrea Schimmenti <andschimmenti@gmail.com>
License: MIT
Keywords: corpus,knowledge-extraction,nlp,text-mining,wikidata,wikipedia
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Requires-Dist: nltk
Requires-Dist: requests
Description-Content-Type: text/markdown

# wikicorpus

## Abstract

wikicorpus is a Python command-line tool for constructing text corpora from
Wikipedia categories, designed to support knowledge extraction and natural
language processing research. Given one or more Wikipedia category names, the
tool retrieves all member articles, aligns each article with its Wikidata
entity, and applies a configurable filtering pipeline. Filtering is entirely
optional and operates at three independent levels: Wikidata entity and property
constraints, section-header keyword matching, and verb-lemma matching within
sentences. All filter parameters — including the verb lexicon and
section-keyword list — are runtime-configurable via CLI flags or JSON files,
with sensible defaults provided for knowledge-extraction tasks. Researchers can
supply domain-specific verb lists and section keywords without modifying source
code, making the tool applicable across different domains and corpora. Output is
written to a structured directory tree with per-article files and a root index,
suitable for downstream parsing, annotation, or training pipelines.

---

## Installation

```bash
pip install wikicorpus
```

Python 3.10 or later is required. The only runtime dependencies are
`requests` and `nltk`; NLTK corpora are downloaded automatically on first
use of the sentence-filtering step.

---

## Quick-start examples

### Retrieve all articles in a single category with no filtering

```bash
wikicorpus --categories "Painting forgeries" --output ./corpus
```

### Retrieve two categories, keep only articles with a Wikidata entity, and require the creator property (P170)

```bash
wikicorpus \
  --categories "Painting forgeries | Document forgeries" \
  --output ./corpus \
  --filter-wikidata \
  --wikidata-properties "P170!" \
  --verbose
```

### Full pipeline with section and sentence filtering, plus custom keywords and verbs

```bash
wikicorpus \
  --categories-file categories.txt \
  --output ./corpus \
  --filter-wikidata \
  --wikidata-properties "P170! | P571?" \
  --filter-sections \
  --add-sections "analysis | methodology" \
  --filter-sentences \
  --add-verbs "posit | allege | maintain" \
  --verbose
```

---

## CLI reference

All filter flags are **off by default**. Articles are always retrieved and
saved unless a filter flag is explicitly enabled.

| Flag | Type / default | Description |
|------|----------------|-------------|
| `--categories` | string / — | Quoted string of Wikipedia category names separated by ` \| `. At least one of `--categories` or `--categories-file` is required. |
| `--categories-file` | path / — | Path to a plain-text file with one category name per line. Lines beginning with `#` and blank lines are ignored. |
| `--output` | path / `./corpus` | Root output directory. Created if it does not exist. |
| `--filter-wikidata` | flag / off | Skip articles with no matching Wikidata entity. When `--wikidata-properties` is also given, additionally enforce the property constraints. |
| `--wikidata-properties` | string / — | Pipe-separated Wikidata property IDs. Suffix each ID with `!` (mandatory: all must be present) or `?` (optional: at least one must be present). Bare IDs are treated as mandatory. Example: `"P170! | P571?"`. Has no filtering effect unless `--filter-wikidata` is also set. |
| `--filter-sections` | flag / off | Skip articles that contain no sections whose header matches the active keyword list. The keyword list is configurable via `--sections-file` and `--add-sections`. |
| `--filter-sentences` | flag / off | Skip articles that contain no sentences matching the active verb list. The verb list is fully configurable via `--verbs-file` and `--add-verbs`. |
| `--verbs-file` | path / — | Path to a JSON file mapping category names to lists of verb lemmas. Merged into the active verb list: existing categories receive extra verbs; new category names are created. Pass a file with entirely new categories to replace the default vocabulary for a different domain. See format specification below. |
| `--add-verbs` | string / — | Pipe-separated verb lemmas to add to the `custom` verb category on top of whatever `--verbs-file` provides. Example: `"allege \| posit \| maintain"`. |
| `--sections-file` | path / — | Path to a JSON file containing an array of section-header keywords. Appended to the active keyword list; duplicates are ignored. See format specification below. |
| `--add-sections` | string / — | Pipe-separated section-header keywords to add on top of `--sections-file` and the built-in list. Example: `"analysis \| interpretation"`. |
| `--verbose` | flag / off | Enable INFO-level progress messages written to stderr. |

---

## File format specifications

### `--verbs-file` (JSON object)

A JSON object mapping category names (strings) to lists of verb lemmas
(strings). Category names that already exist in the active list receive the
extra verbs appended with duplicates removed. New category names are created.
To replace the default vocabulary entirely for a different domain, supply a
file whose categories do not overlap with the built-in ones.

```json
{
  "revision": ["posit", "maintain"],
  "custom": ["allege", "impute"]
}
```

The built-in verb categories are defaults suited for knowledge-extraction
tasks. They can be extended or replaced entirely at runtime:

| Category | Default lemmas |
|----------|---------------|
| `argumentation` | argue, dispute, contend, refute, contest, challenge, oppose |
| `assertion` | claim, state, declare, assert, report, attribute, ascribe |
| `epistemic_uncertainty` | believe, think, suppose, assume, suspect, doubt, question, suggest |
| `inference` | conclude, deduce, infer, imply, indicate, derive, propose |
| `revision` | revise, reassign, reattribute, reconsider, overturn, correct, update |

### `--sections-file` (JSON array)

A JSON array of lowercase keyword strings. A section is selected when any
keyword appears as a substring of its header (case-insensitive). Keywords are
appended to the active list; duplicates are ignored.

```json
["analysis", "interpretation", "methodology", "significance"]
```

The built-in section keywords are defaults suited for knowledge-extraction
tasks and can be extended at runtime: `attribution`, `provenance`, `dating`,
`controversy`, `authenticity`, `authorship`, `historiography`, `reception`,
`debate`, `forgery`, `misattribution`, `reattribution`.

---

## Output structure

```
<output>/
├── index.json
└── <Category_Name>_<hash>/
    └── <Article_Title>_<hash>/
        ├── full_text.txt
        ├── sections.json
        ├── wikidata.json
        └── candidates.json
```

Directory names are derived from the category and article title by replacing
spaces with underscores, removing characters outside `[a-zA-Z0-9_-]`, and
appending an 8-character MD5 hash of the original name to guarantee
uniqueness.

### Per-article files

| File | Content |
|------|---------|
| `full_text.txt` | Plain-text extract of the full article as returned by the Wikipedia API. |
| `sections.json` | JSON array of section objects, each with `"header"` (string) and `"text"` (string) keys. |
| `wikidata.json` | JSON object with `"qid"`, `"labels"`, and `"properties"` keys, or `null` if no Wikidata entity was found. |
| `candidates.json` | JSON array of candidate-sentence objects (see below). Always written; used for filtering only when `--filter-sentences` is active. |

### `index.json` fields

The root `index.json` contains one object per saved article:

| Field | Type | Description |
|-------|------|-------------|
| `title` | string | Wikipedia article title. |
| `url` | string | Full URL of the Wikipedia article. |
| `category` | string | Category label as supplied on the command line. |
| `qid` | string or null | Wikidata item identifier (e.g. `"Q12418"`), or `null`. |
| `has_target_properties` | boolean | Whether the entity satisfies the property filter. Always `true` when no property spec is given. |
| `candidate_sentence_count` | integer | Number of sentences matching the active verb list. |
| `interpretive_section_count` | integer | Number of sections matching the active keyword list. |

### `candidates.json` fields

Each entry in `candidates.json` describes one sentence that matched the active
verb list:

| Field | Type | Description |
|-------|------|-------------|
| `sentence` | string | The full sentence text. |
| `section_header` | string | Header of the section containing the sentence. |
| `verb_category` | string | Name of the first matched verb category. |
| `matched_verbs` | array of strings | All matched verb lemmas found in the sentence. |
| `negated` | boolean | `true` if a negation word (`not`, `never`, `no`) immediately precedes any matched verb. |

---

## Pipeline description

The pipeline has four stages:

1. **Retrieve** — `get_category_articles` queries the Wikipedia
   `categorymembers` API to list all articles in each requested category.
   `get_article_content` fetches the plain-text extract and internal links
   for each article. Section boundaries are detected by wiki-markup headers
   (`== Header ==`) and returned as a structured list.

2. **Align** — `get_wikidata_entity` queries the Wikidata API for the entity
   linked to each Wikipedia article. If `--filter-wikidata` is active,
   articles with no entity (or whose entity does not satisfy the property
   spec) are dropped. The entity's QID, labels, and property values are
   stored alongside the article data.

3. **Filter** — Two independent, optional filters are applied. First,
   `get_interpretive_sections` selects sections whose headers contain at
   least one keyword from the active keyword list; if `--filter-sections` is
   active, articles with no matching sections are dropped. Second,
   `get_candidate_sentences` tokenises, POS-tags, and lemmatises sentences,
   then retains those containing at least one verb lemma from the active verb
   list; if `--filter-sentences` is active, articles with no matching
   sentences are dropped. Both the keyword list and the verb list are
   fully configurable at runtime. When neither filter flag is set, all
   articles are saved and the filter outputs are still computed and written
   to disk for downstream use.

4. **Write** — `save_article` creates the per-article directory and writes
   the four output files. `save_index` writes the root `index.json`
   summarising every saved article.

---

## Reproducibility

All filtering parameters are runtime-configurable through CLI flags and JSON
configuration files. The tool contains no hard-coded domain assumptions: the
built-in keyword and verb lists are defaults that can be extended or replaced
without modifying source code. To reproduce a corpus exactly, record:

- The exact `--categories` or `--categories-file` input.
- The `--wikidata-properties` spec string.
- The contents of any `--verbs-file` or `--sections-file` used.
- The exact set of `--add-verbs` and `--add-sections` tokens.
- The Wikipedia API snapshot date (the API returns the current live version
  of articles; consider archiving `full_text.txt` for long-term
  reproducibility).

The filtering logic is deterministic given fixed inputs: no randomness is
introduced at any stage.

---

## Summary output

After the pipeline completes, a summary is printed to stdout:

```
Categories processed : 2
Articles retrieved   : 147
After Wikidata filter: 132
Wikidata properties  : P170! | P571?
After section filter : 89
Section keywords     : 12 active (+2 custom: analysis, interpretation)
After sentence filter: 61
Verb categories      : argumentation(7), assertion(7), epistemic_uncertainty(8), inference(7), revision(7)
Candidate sentences  : 430
Output directory     : /absolute/path/to/corpus
```

| Field | Description |
|-------|-------------|
| Categories processed | Number of category names processed. |
| Articles retrieved | Total articles found across all categories. |
| After Wikidata filter | Articles remaining after the Wikidata filter (equals retrieved count if `--filter-wikidata` is off). |
| Wikidata properties | Active property spec, or `none (all entities retained)`. |
| After section filter | Articles remaining after the section filter (equals Wikidata count if `--filter-sections` is off). |
| Section keywords | Count of active keywords and any custom additions. |
| After sentence filter | Articles remaining after the sentence filter (equals section count if `--filter-sentences` is off). |
| Verb categories | Each active category name and the number of lemmas it contains. |
| Candidate sentences | Total sentences matching the active verb list across all saved articles. |
| Output directory | Absolute path to the output root. |
