Metadata-Version: 2.1
Name: wikirevparser
Version: 0.0.6
Summary: Wikipedia revision history parser for Python
Home-page: https://github.com/ajoer/WikiRevParser
Author: Anna Jørgensen
Author-email: anka.jorgensen@gmail.com
License: MIT
Keywords: python wikipedia API
Platform: UNKNOWN
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: beautifulsoup4
Requires-Dist: nltk
Requires-Dist: requests (<3.0.0,>=2.0.0)

# WikiRevParser

**WikiRevParser** is a Python library that parses Wikipedia revision histories. It allows you to analyse the development of pages on Wikipedia over time and across language versions.

The library takes a language code and Wikipedia page title as input, extracts the revision history with our [Wikipedia API wrapper](https://github.com/ajoer/Wikipedia), and parses the noisy, unstructured content into clean, accessible data for each timestamp in the revision history. 
You can use this library to access the development of references of a page, analyse the content or images over time, compare the tables of content across languages, create editor networks, and much more.

## Get Started

Beside the WikiRevParser, you'll need a version of our [Wikipedia API wrapper](https://github.com/ajoer/Wikipedia) (forked from [Wikipedia](https://github.com/goldsmith/Wikipedia) by @goldsmith), which extracts and returns the entire revision history of a Wikipedia page. Note that Python3+ is required.

	$ pip3 install wikirevparser
	$ git clone git@github.com:ajoer/Wikipedia.git

## Example

To get the revision history for the page on [Marie Curie](https://en.wikipedia.org/wiki/Marie_Curie) on the English Wikipedia, run:

	>>> from wikirevparser import wikirevparser
	>>> parser_instance = wikirevparser.ProcessRevisions("en", "Marie Curie") 
	>>> parser_instance.wikipedia_page()
	>>> data = parser_instance.parse_revisions()

Now you have the revisions of the [Marie Curie](https://en.wikipedia.org/wiki/Marie_Curie) page in a structured dictionary format, and you can start exploring the data.

Let's look at the use of **links**.
I want to know whether the links on the page are the same now as when the page was first made?

	>>> edits = list(data.items())
	>>> first_links = edits[-1][1]["links"]
	>>> latest_links = edits[0][1]["links"]
	>>> present_now = first_links[0] in latest_links 
	>>> print("The only link in the first version was '%s'. \nThat link is still present in the current version: %s." % (first_links[0], present_now))

	The only link in the first version was 'pierre and marie curie'.
	That link is still present in the current version: False.

Okay, but what are then the most frequent links on the page now?

	>>> from collections import Counter
	>>> links = Counter()
	>>> for l in latest_links:
	...	links[l] += 1
	>> print(links)
	Counter({'polonium': 5, 'radium': 5, 'university of paris': 5, 'russian empire': 4, 'gabriel lippmann': 4, 'nobel prize in physics': 4, 'nobel prize in chemistry': 4, ... })

Using the revision history parsed by the WikiRevParser, you could also answer questions like:
* When was the 'pierre and marie curie' link deleted?
* Who made that edit?
* Did that editor also edit the [Afrikaans page](https://af.wikipedia.org/wiki/Marie_Curie) on Marie Curie?
* What are the most referenced sources on the page?
* Which references are used on both the English page and the [Arabic one](https://ar.wikipedia.org/wiki/%D9%85%D8%A7%D8%B1%D9%8A_%D9%83%D9%88%D8%B1%D9%8A)?
* How many Wikipedians have edited the English page? And the [Dutch page](https://nl.wikipedia.org/wiki/Marie_Curie)?
* Do all language versions use the same image of Marie Curie as the top image?
* Where are the Wikipedians located?
* How frequently is the page edited? 
* Has the English page developed consistently or did editing intensify at one point?
* How does the editing pattern of the English page match that of the [Korean page](https://ko.wikipedia.org/wiki/%EB%A7%88%EB%A6%AC_%ED%80%B4%EB%A6%AC)?
* ... and many other questions

See :ref:`examples` for more inspiration and functionalities, and go to [FAQ or file a bug](https://github.com/ajoer/WikiRevParser/issues) if you run into issues!

## Documentation

Read the docs at [wikirevparser.readthedocs.io](https://wikirevparser.readthedocs.io/en/latest/) for more details and use case examples.

## License

This work is MIT licensed. See the [LICENSE file](https://github.com/ajoer/WikiRevParser/LICENSE) for full details.

## Credits

- @goldsmith for the Python Wikipedia API wrapper [Wikipedia](https://github.com/goldsmith/Wikipedia).
- The [Wikimedia Foundation](http://wikimediafoundation.org/wiki/Home) and all Wikipedians for creating and maintaining the data.
- This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 812997.



