Metadata-Version: 2.4
Name: wikicorpus
Version: 0.1.0
Summary: Retrieve and filter Wikipedia articles to build a corpus for knowledge extraction
Author-email: Andrea Schimmenti <andschimmenti@gmail.com>
License: MIT
Keywords: corpus,knowledge-extraction,nlp,text-mining,wikidata,wikipedia
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Requires-Dist: nltk
Requires-Dist: requests
Description-Content-Type: text/markdown

# wikicorpus

## Abstract

wikicorpus is a Python command-line tool for constructing text corpora from
Wikipedia categories, designed to support knowledge extraction and natural
language processing research. Given one or more Wikipedia category names, the
tool retrieves all member articles, aligns each article with its Wikidata
entity, filters sections by configurable interpretive-content keywords, and
extracts candidate sentences containing epistemic verbs (verbs that signal
belief, attribution, argumentation, or uncertainty). All filtering parameters
are runtime-configurable: researchers can supply custom verb lexicons and
section-keyword lists without modifying source code, making the tool
reproducible across different domains and corpora. Output is written to a
structured directory tree with per-article files and a root index, suitable
for downstream parsing, annotation, or training pipelines.

---

## Installation

```bash
pip install -e .
```

Python 3.10 or later is required. The only runtime dependencies are
`requests` and `nltk`; NLTK corpora are downloaded automatically on first
use of the sentence-filtering step.

---

## Quick-start examples

### Retrieve all articles in a single category with no filtering

```bash
wikicorpus --categories "Painting forgeries" --output ./corpus
```

### Retrieve two categories, keep only articles with a Wikidata entity, and
### require that the entity has the creator property (P170)

```bash
wikicorpus \
  --categories "Painting forgeries | Document forgeries" \
  --output ./corpus \
  --filter-wikidata \
  --wikidata-properties "P170!" \
  --verbose
```

### Full pipeline with interpretive-section and sentence filtering,
### plus custom section keywords and additional verb lemmas

```bash
wikicorpus \
  --categories-file categories.txt \
  --output ./corpus \
  --filter-wikidata \
  --wikidata-properties "P170! | P571?" \
  --filter-sections \
  --add-sections "analysis | methodology" \
  --filter-sentences \
  --add-verbs "posit | allege | maintain" \
  --verbose
```

---

## CLI reference

| Flag | Type / default | Description |
|------|----------------|-------------|
| `--categories` | string / — | Quoted string of Wikipedia category names separated by ` \| `. At least one of `--categories` or `--categories-file` is required. |
| `--categories-file` | path / — | Path to a plain-text file with one category name per line. Lines beginning with `#` and blank lines are ignored. |
| `--output` | path / `./corpus` | Root output directory. Created if it does not exist. |
| `--filter-wikidata` | flag / off | Skip articles with no matching Wikidata entity. When `--wikidata-properties` is also given, additionally enforce the property constraints. |
| `--wikidata-properties` | string / — | Pipe-separated Wikidata property IDs. Suffix each ID with `!` (mandatory: all must be present) or `?` (optional: at least one must be present). Bare IDs are treated as mandatory. Example: `"P170! | P571?"`. Has no filtering effect unless `--filter-wikidata` is also set. |
| `--filter-sections` | flag / off | Skip articles that contain no interpretive sections (as determined by the active section-keyword list). |
| `--filter-sentences` | flag / off | Skip articles that contain no candidate epistemic sentences. |
| `--verbs-file` | path / — | Path to a JSON file mapping verb-category names to lists of verb lemmas. Merged into the built-in categories: existing categories receive extra verbs; new category names are created. See format specification below. |
| `--add-verbs` | string / — | Pipe-separated verb lemmas to add to the `custom` verb category on top of whatever `--verbs-file` provides. Example: `"allege \| posit \| maintain"`. |
| `--sections-file` | path / — | Path to a JSON file containing an array of section-header keywords. Appended to the built-in list; duplicates are ignored. See format specification below. |
| `--add-sections` | string / — | Pipe-separated section-header keywords to add on top of `--sections-file` and the built-in list. Example: `"analysis \| interpretation"`. |
| `--verbose` | flag / off | Enable INFO-level progress messages written to stderr. |

---

## File format specifications

### `--verbs-file` (JSON object)

A JSON object mapping verb-category names (strings) to lists of verb lemmas
(strings). Category names that already exist in the built-in list receive the
extra verbs appended with duplicates removed. New category names are created.

```json
{
  "revision": ["posit", "maintain"],
  "custom": ["allege", "impute"]
}
```

Built-in verb categories and their default lemmas:

| Category | Default lemmas |
|----------|---------------|
| `argumentation` | argue, dispute, contend, refute, contest, challenge, oppose |
| `assertion` | claim, state, declare, assert, report, attribute, ascribe |
| `epistemic_uncertainty` | believe, think, suppose, assume, suspect, doubt, question, suggest |
| `inference` | conclude, deduce, infer, imply, indicate, derive, propose |
| `revision` | revise, reassign, reattribute, reconsider, overturn, correct, update |

### `--sections-file` (JSON array)

A JSON array of lowercase keyword strings. A section is considered
interpretive if any keyword appears as a substring of its header
(case-insensitive). Keywords are appended to the built-in list; duplicates
are ignored.

```json
["analysis", "interpretation", "methodology", "significance"]
```

Built-in interpretive keywords: `attribution`, `provenance`, `dating`,
`controversy`, `authenticity`, `authorship`, `historiography`, `reception`,
`debate`, `forgery`, `misattribution`, `reattribution`.

---

## Output structure

```
<output>/
├── index.json
└── <Category_Name>_<hash>/
    └── <Article_Title>_<hash>/
        ├── full_text.txt
        ├── sections.json
        ├── wikidata.json
        └── candidates.json
```

Directory names are derived from the category and article title by replacing
spaces with underscores, removing characters outside `[a-zA-Z0-9_-]`, and
appending an 8-character MD5 hash of the original name to guarantee
uniqueness.

### Per-article files

| File | Content |
|------|---------|
| `full_text.txt` | Plain-text extract of the full article as returned by the Wikipedia API. |
| `sections.json` | JSON array of section objects, each with `"header"` (string) and `"text"` (string) keys. |
| `wikidata.json` | JSON object with `"qid"`, `"labels"`, and `"properties"` keys, or `null` if no Wikidata entity was found. |
| `candidates.json` | JSON array of candidate-sentence objects (see below). |

### `index.json` fields

The root `index.json` contains one object per saved article:

| Field | Type | Description |
|-------|------|-------------|
| `title` | string | Wikipedia article title. |
| `url` | string | Full URL of the Wikipedia article. |
| `category` | string | Category label as supplied on the command line. |
| `qid` | string or null | Wikidata item identifier (e.g. `"Q12418"`), or `null`. |
| `has_target_properties` | boolean | Whether the entity satisfies the property filter. Always `true` when no property spec is given. |
| `candidate_sentence_count` | integer | Number of candidate sentences extracted from the article. |
| `interpretive_section_count` | integer | Number of interpretive sections found. |

### `candidates.json` fields

Each entry in `candidates.json` describes one candidate sentence:

| Field | Type | Description |
|-------|------|-------------|
| `sentence` | string | The full sentence text. |
| `section_header` | string | Header of the section containing the sentence. |
| `verb_category` | string | Name of the first matched verb category (e.g. `"assertion"`). |
| `matched_verbs` | array of strings | All matched verb lemmas found in the sentence. |
| `negated` | boolean | `true` if a negation word (`not`, `never`, `no`) immediately precedes any matched verb. |

---

## Pipeline description

The pipeline has four stages:

1. **Retrieve** — `get_category_articles` queries the Wikipedia
   `categorymembers` API to list all articles in each requested category.
   `get_article_content` fetches the plain-text extract and internal links
   for each article. Section boundaries are detected by wiki-markup headers
   (`== Header ==`) and returned as a structured list.

2. **Align** — `get_wikidata_entity` queries the Wikidata API for the entity
   linked to each Wikipedia article. If `--filter-wikidata` is active,
   articles with no entity (or whose entity does not satisfy the property
   spec) are dropped. The entity's QID, labels, and property values are
   stored alongside the article data.

3. **Filter** — `get_interpretive_sections` selects sections whose headers
   match at least one keyword from the active keyword list. If
   `--filter-sections` is active, articles with no matching sections are
   dropped. `get_candidate_sentences` tokenises, POS-tags, and lemmatises
   sentences within interpretive sections (falling back to all sections if
   none are found), then retains only those containing at least one
   epistemic verb from the active verb-category dictionary. If
   `--filter-sentences` is active, articles with no candidate sentences are
   dropped.

4. **Write** — `save_article` creates the per-article directory and writes
   the four output files. `save_index` writes the root `index.json`
   summarising every saved article.

---

## Reproducibility

All filtering parameters are runtime-configurable through CLI flags and JSON
configuration files. The tool contains no hard-coded domain assumptions: the
built-in keyword and verb lists are defaults that can be extended or replaced
without modifying source code. To reproduce a corpus exactly, record:

- The exact `--categories` or `--categories-file` input.
- The `--wikidata-properties` spec string.
- The contents of any `--verbs-file` or `--sections-file` used.
- The exact set of `--add-verbs` and `--add-sections` tokens.
- The Wikipedia API snapshot date (the API returns the current live version
  of articles; consider archiving `full_text.txt` for long-term
  reproducibility).

The filtering logic is deterministic given fixed inputs: no randomness is
introduced at any stage.

---

## Summary output

After the pipeline completes, a summary is printed to stdout:

```
Categories processed : 2
Articles retrieved   : 147
After Wikidata filter: 132
Wikidata properties  : P170! | P571?
After section filter : 89
Section keywords     : 12 active (+2 custom: analysis, interpretation)
After sentence filter: 61
Verb categories      : argumentation(7), assertion(7), epistemic_uncertainty(8), inference(7), revision(7)
Candidate sentences  : 430
Output directory     : /absolute/path/to/corpus
```

| Field | Description |
|-------|-------------|
| Categories processed | Number of category names processed. |
| Articles retrieved | Total articles found across all categories. |
| After Wikidata filter | Articles remaining after the Wikidata filter (equals retrieved count if `--filter-wikidata` is off). |
| Wikidata properties | Active property spec, or `none (all entities retained)`. |
| After section filter | Articles remaining after the section filter. |
| Section keywords | Count of active keywords and any custom additions. |
| After sentence filter | Articles remaining after the sentence filter. |
| Verb categories | Each active category name and the number of lemmas it contains. |
| Candidate sentences | Total candidate sentences across all saved articles. |
| Output directory | Absolute path to the output root. |
