Metadata-Version: 2.1
Name: goodwiki
Version: 1.0.1
Summary: Utility that converts Wikipedia pages into GitHub-flavored Markdown.
Home-page: https://github.com/euirim/goodwiki
License: MIT
Keywords: wikipedia,markdown,dataset,wikitext,wikicode
Author: Euirim Choi
Author-email: euirim@gmail.com
Requires-Python: >=3.11,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Filters
Classifier: Topic :: Text Processing :: Markup :: Markdown
Requires-Dist: click (>=8.1.6,<9.0.0)
Requires-Dist: httpx (>=0.24.1,<0.25.0)
Requires-Dist: multiprocess (>=0.70.15,<0.71.0)
Requires-Dist: mwparserfromhell (>=0.6.4,<0.7.0)
Requires-Dist: pyarrow (>=12.0.1,<13.0.0)
Requires-Dist: pypandoc (>=1.11,<2.0)
Requires-Dist: tqdm (>=4.66.1,<5.0.0)
Requires-Dist: wikipedia-api (>=0.6.0,<0.7.0)
Project-URL: Repository, https://github.com/euirim/goodwiki
Description-Content-Type: text/markdown

# GoodWiki

GoodWiki is a Python package that carefully converts Wikipedia pages into GitHub-flavored Markdown. Converted pages preserve layout features like lists, code blocks, math, and block quotes.

This package is used to generate the [GoodWiki Dataset](https://github.com/euirim/goodwiki).

## Installation

This package supports Python 3.11+.

1. Install via pip.

```bash
pip install goodwiki
```

2. Install pandoc v2.19.2. Follow instructions [here](https://pandoc.org/installing.html).

## Usage

### Initializing Client

```python
import asyncio
from goodwiki import GoodwikiClient

client = GoodwikiClient()
```

You can also optionally provide your own user agent (default is `goodwiki/1.0 (https://euirim.org)`):

```python

client = GoodwikiClient("goodwiki/1.0 (bob@gmail.com)")
```

### Getting Single Page

```python
page = asyncio.run(client.get_page("Usain Bolt"))
```

You can also optionally include styling syntax like bolding to the final markdown:

```python
page = asyncio.run(client.get_page("Usain Bolt", with_styling=True))
```

You can access the resulting data via properties. For example:

```python
print(page.markdown)
```

### Getting Category Pages

To get a list of page titles associated with a Wikipedia category, run the following:

```python
client.get_category_pages("Category:Good_articles")
```

### Converting Existing Raw Wikitext

If you've already downloaded raw wikitext from Wikipedia, you can convert it to Markdown by running:

```python
client.get_page_from_wikitext(
	raw_wikitext="RAW_WIKITEXT",
	# The rest of the fields are meant for populating the final WikiPage object
	title="Usain Bolt",
	pageid=123,
	revid=123,
)
```

## Methodology

Full details are available in this package's [GitHub repo README](https://github.com/euirim/goodwiki).

## External Links

* [Changelog](https://github.com/euirim/goodwiki/releases)
* [GitHub](https://github.com/euirim/goodwiki)
* [Dataset](https://huggingface.co/datasets/euirim/goodwiki)

