Metadata-Version: 2.1
Name: wikipedia_tools
Version: 2.0.0
Summary: This is a Wikipedia Tool to fetch revisions based on a period of time.
Keywords: wikipedia,wikipedia revisions,wikipedia stats
Author-email: Roxanne El Baff <roxanne.elbaff@dlr.de>
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Science/Research
Requires-Dist: pandas>=1.0.1
Requires-Dist: matplotlib>=3.2.1
Requires-Dist: pymediawiki==0.7.2
Requires-Dist: IPy>=1.01
Requires-Dist: seaborn>=0.11.2
Requires-Dist: nlpaf
Requires-Dist: tqdm==4.43.0
Requires-Dist: dataclasses==0.6
Requires-Dist: beautifulsoup4
Requires-Dist: requests>=2.0.0,<3.0.0
Requires-Dist: pip-tools ; extra == "dev"
Project-URL: Homepage, https://github.com/DLR-SC/wikipedia-periodic-revisions
Provides-Extra: dev

# Wikipedia Periodic Revisions

## Installation
This package is built on top of the [Wikipedia API](https://github.com/goldsmith/Wikipedia). This code was forked under the `base` subpackage.
Also we fork the code from [ajoer/WikiRevParser](https://github.com/ajoer/WikiRevParser) and we modify it to support *from* and *to* datetime to fetch revisions between certain periods; the modified code is `wikipedia_toools.scraper.wikirevparser_with_time.py`.

## Installation
Install manually by cloning and then running

``` 
pip install -e wikipedia_tools
```

or by running

``` 
pip install git+https://github.com/DLR-SC/wikipedia-periodic-revisions.git
```

## wikipedia_tools package

This packages is responsible for:
- fetching the wikipages revisions based on a period of time
- load them into parquet, and
- provide basic analysis

It contains three main subpackages and the *utils* package which contains few helpers functions:

### Scraper [[wikipedia_tools.scraper](wikipedia_tools/wikipedia_tools/scraper.py)]
This subpackage is responsible for downloading the wikipedia revisions from the web.

The code below shows how to download all the revisions of pages:
  - belonging to the *Climate_change* category.
  - revisions between start of 8 months ago (1.1.2022) and now (29.9.2022). The *get_last_month* function returns the datetime of the beginning of 8 months ago.
  
    ```python 
    from wikipedia_tools.utils import utils 
    utils.get_last_month(8)
    ```
  - if  save_each_page= True: each page is fetched and downloaded on the spot under the folder **data/periodic_wiki_batches/{*categories_names*}/from{month-year}_to{month-year}**. Otherwise, all the page revisions are fetched first and then saved into one jsonl file.
  


```python
from wikipedia_tools.scraper import downloader
from datetime import datetime

wikirevs= downloader.WikiPagesRevision( 
                                        categories = ["Climate_change"],
                                        revisions_from = utils.get_last_month(8),
                                        revisions_to=datetime.now(),
                                        save_each_page= True
                                        )
```



