Metadata-Version: 2.4
Name: corpress
Version: 1.1.2
Summary: Create a text corpus from a WordPress site using the WordPress API.
Home-page: https://github.com/polsci/corpress
Author: polsci
Author-email: polsci@users.noreply.github.com
License: MIT License
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests
Requires-Dist: beautifulsoup4
Requires-Dist: lxml
Requires-Dist: pandas
Requires-Dist: python-slugify
Provides-Extra: dev
Requires-Dist: jupyterlab; extra == "dev"
Requires-Dist: nbdev; extra == "dev"
Requires-Dist: jupyterlab-quarto; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# CorPress


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

Geoff Ford  
<https://geoffford.nz/corpress-release>

![GitHub
Release](https://img.shields.io/github/v/release/polsci/corpress.png)
[![DOI](https://zenodo.org/badge/844819254.svg)](https://zenodo.org/doi/10.5281/zenodo.13364642)

[CorPress documentation](https://geoffford.nz/corpress)

CorPress (Cor from Corpus, Press from WordPress) provides a simple
approach to retrieve posts or pages from a WordPress site’s [REST
API](https://developer.wordpress.org/rest-api/) and create a corpus
(i.e. a data-set of texts). CorPress provides an efficient and
standardized way to collect text data from WordPress sites, avoiding the
need for customized scrapers. Not all WordPress sites provide access to
the REST API, but many do.

I’m a political scientist who applies corpus linguistics and digital
methods in [my research](https://geoffford.nz). I’m releasing CorPress
with academic researchers in mind. This tool is intended for academic
research. Please cite CorPress if you use it in your research.

CorPress attempts to detect a REST API endpoint from a website URL for
[posts](https://developer.wordpress.org/rest-api/reference/posts/#list-posts)
(default) and
[pages](https://developer.wordpress.org/rest-api/reference/pages/#list-pages),
then downloads JSON from the API, and then processes the JSON to create
a corpus. You can create a corpus in:

1.  ‘txt’ format: texts are saved in separate .txt files, compatible
    with common corpus linguistics tools, like AntConc. An optional
    meta-data file can be output with the link to each file, title, and
    date; or  
2.  ‘csv’ format: meta-data and text is saved in a single CSV file.

I’ve used [nbdev](https://nbdev.fast.ai/) to develop this library, which
uses a Jupyter notebooks to develop code,
[documentation](https://geoffford.nz/corpress), code examples and tests.
If you want to contribute, you will need to clone the [Github
repo](https://github.com/polsci/corpress) and [setup
nbdev](https://nbdev.fast.ai/getting_started.html).

## Acknowledgements

This library was developed through my research on these projects:

- [Mapping LAWS project: Issue Mapping and Analysing the Lethal
  Autonomous Weapons Debate](https://mappinglaws.net/) (Funded by Royal
  Society of New Zealand’s Marsden Fund, Grant 19-UOC-068)  
- [Into the Deep: Analysing the Actors and Controversies Driving the
  Adoption of the World’s First Deep Sea Mining
  Governance](https://miningthesea.net/) (Funded by Royal Society of New
  Zealand’s Marsden Fund, Grant 22-UOC-059)

## Install

``` sh
pip install corpress
```

## Before using

- There are good reasons not to collect and/or distribute corpora and it
  is the end-user’s responsibility to use this software in an ethical
  way.  
- Depending on the nature of the texts collected, what you are doing
  when analyzing the texts, and how you disseminate your research, it
  may be appropriate to process the texts further (e.g. to remove
  personally identifying information).  
- Not all Wordpress sites make the REST API accessible. See [example
  output when there is no REST API available](#no-rest-api-available).  
- It is possible the API data may differ from what is visible online.
  You should check the texts in your corpus to make sure you have what
  you expect!  
- CorPress will exit with appropriate logging information if an API
  endpoint is not found, not accessible or returns unexpected data. Just
  read what it returns.  
- Collecting data uses energy and server resources. It is your
  responsibility set the seconds between requests to the API to be
  respectful in your use of this tool.  
- It is your responsibility to [set an appropriate User
  Agent](#set-an-appropriate-user-agent).

## How to use

The [CorPress function](https://geoffford.nz/corpress/core#corpress) is
the intended way to invoke CorPress and create a corpus. Other functions
are relevant if you just want to get the API endpoint or download the
JSON data. If you want the data in a different format, you could just
generate the CSV and then convert that to whatever format you need.

CorPress is intentionally verbose in terms of log output. This is
helpful to record and understand the process of collecting the data.

Most WordPress sites don’t have more than 100s to 1000s of posts. Using
a Jupyter Notebook could be helpful to view and capture the log data
from running CorPress and scope the corpus.

Here’s a step-by-step description, with discussion of the key
functionality.

First import the
[`corpress`](https://geoffford.nz/corpress/core.html#corpress) function.

``` python
from corpress.core import corpress
```

You are going to need to set a few arguments. The [corpress function is
documented in full](https://geoffford.nz/corpress/core#corpress). Here
I’m breaking it down and showing an example.

- `url`: Set the URL of the WordPress website, Corpress will try to
  determine the endpoint from this.  
- `endpoint_type`: Do you want ‘posts’ or ‘pages’. If you want both, see
  the note on [collecting both posts and
  pages](#collecting-both-posts-and-pages).  
- `corpus_format`: How do you want your corpus saved? ‘txt’ is a
  directory of txt files, ‘csv’ is a single CSV with meta-data and text.

``` python
url = 'https://www.adho.org/'
endpoint_type = 'posts'
corpus_format = 'txt'
```

Setup where and how to save the data. CorPress will try and create
directory paths if they don’t exist.

- `json_save_path` (required): Specify the directory where CorPress will
  save the JSON data. Note: you should set a new path for every new
  Wordpress site you collect.  
- `corpus_save_path`: Required for ‘txt’ corpus format, this is where
  the .txt files will be saved. Set as `None` or ommit if using ‘csv’
  format.  
- `csv_save_file`:
  - For ‘txt’ corpus format this is optional. This provides a way to
    export meta-data (date, title, link to text etc) for each text in
    the corpus.  
  - For ‘csv’ corpus format this is required. This specifies the file
    where the meta-data and text will be saved.  
- `include_title_in_text`: Depending on the data you are collecting and
  what you want to do with it, you can save the title of the post/page
  as part of the text or not. This is set to `True` by default.  
- `encoding`: The encoding of the exported files. This is set to `utf-8`
  by default.

``` python
json_save_path = '../test_data/example/json/'
corpus_save_path = '../test_data/example/txt/'
csv_save_file = csv_save_file='../test_data/example/metadata.csv'
include_title_in_text = True
encoding = 'utf-8'
```

Set how you query the API:

- `seconds_between_requests`: By default this is set to one request
  every 5 seconds. You can’t specify less than 1 second. It may be
  appropiate if you are collecting lots of texts to specify a large
  number of seconds between requests.  
- `headers`: CorPress uses the
  [Requests](https://requests.readthedocs.io/en/latest/) Python Library
  for HTTP requests. You can pass headers you want in HTTP requests
  directly as a `dict`. [See documentation
  here](https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers).
  The most relevant one is to set a [User-Agent
  header](https://en.wikipedia.org/wiki/User-Agent_header). See the note
  below about how to [set an appropriate
  User-Agent](#set-an-appropriate-user-agent).  
- `params`: The
  [posts](https://developer.wordpress.org/rest-api/reference/posts/#list-posts)
  and
  [pages](https://developer.wordpress.org/rest-api/reference/pages/#list-pages)
  endpoints support a number of parameters. This includes parameters to
  specify a search term, restrict dates and set the way results are
  ordered. Set additional parameters as a `dict`. See the Requests
  library documentation on [passing parameters in
  URLS](https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls)
  to understand this.  
- `max_pages`: By default CorPress will collect *all* post (or pages).
  That might not be necessary. Interpret max_pages as the maximum number
  of successful API requests. The REST API normally returns 10
  posts/pages per request, so if you want 100 posts you would set
  max_pages to 10.

#### Set an appropriate User-Agent

Here’s a suggested format:
`Your Research Project (https://university.edu/webpage)`. See how to set
this below.

``` python
seconds_between_requests = 5
headers = {'User-Agent': 'Your Research Project (https://university.edu/webpage)'}
params = {'search': 'common'} # just comment out or remove this line to collect every post, I've just chosen a word arbitrarily here
max_pages = None # collecting all available data, if want less data - set to an integer
```

Now you can call the
[`corpress`](https://geoffford.nz/corpress/core.html#corpress) function
and create a corpus. There will be lots of information logged about
collecting and processing the texts. When completed it will output a
table with a summary of the process and texts collected. This is the
same data returned by the
[`corpress`](https://geoffford.nz/corpress/core.html#corpress) function.

``` python
result = corpress(url=url, 
                  endpoint_type=endpoint_type, 
                  corpus_format=corpus_format, 
                  json_save_path=json_save_path, 
                  corpus_save_path=corpus_save_path, 
                  csv_save_file=csv_save_file, 
                  include_title_in_text=include_title_in_text, 
                  seconds_between_requests=seconds_between_requests, 
                  headers=headers, 
                  params=params, 
                  max_pages=max_pages,
                  encoding=encoding)
```

    2025-05-01 14:45:03 - INFO - Found REST API endpoint link
    2025-05-01 14:45:03 - INFO - Setting posts route https://adho.org/wp-json/wp/v2/posts
    2025-05-01 14:45:03 - INFO - Using JSON save path: ../test_data/example/json/
    2025-05-01 14:45:05 - INFO - Downloading https://adho.org/wp-json/wp/v2/posts?search=common&page=1
    2025-05-01 14:45:05 - INFO - Total pages to retrieve is 3
    2025-05-01 14:45:12 - INFO - Downloading https://adho.org/wp-json/wp/v2/posts?search=common&page=2
    2025-05-01 14:45:18 - INFO - Downloading https://adho.org/wp-json/wp/v2/posts?search=common&page=3
    2025-05-01 14:45:23 - INFO - Creating corpus in txt format
    2025-05-01 14:45:23 - INFO - Using corpus save path: ../test_data/example/txt/
    2025-05-01 14:45:23 - INFO - Creating CSV file for metadata: ../test_data/example/metadata.csv
    2025-05-01 14:45:23 - INFO - Processing JSON: posts-3.json
    2025-05-01 14:45:23 - INFO - Processing JSON: posts-2.json
    2025-05-01 14:45:23 - INFO - Processing JSON: posts-1.json
    2025-05-01 14:45:23 - INFO - CSV file for metadata created: ../test_data/example/metadata.csv


|     | Key                | Value                                             |
|-----|--------------------|---------------------------------------------------|
| 0   | url                | https://www.adho.org/                             |
| 1   | endpoint_url       | https://adho.org/wp-json/wp/v2/posts              |
| 2   | headers            | {'User-Agent': 'Your Research Project (https:/... |
| 3   | params             | {'search': 'common'}                              |
| 4   | get_api_url        | True                                              |
| 5   | get_json           | True                                              |
| 6   | create_corpus      | True                                              |
| 7   | corpus_format      | txt                                               |
| 8   | corpus_save_path   | ../test_data/example/txt/                         |
| 9   | csv_save_file      | ../test_data/example/metadata.csv                 |
| 10  | corpus_texts_count | 29                                                |


You can now preview the data you’ve collected.

``` python
import pandas as pd
pd.set_option('display.max_colwidth', None) # to display full text in pandas dataframe
metadata = pd.read_csv(csv_save_file)
metadata = metadata.sort_values('date')
metadata[['date', 'link', 'title', 'filename']].head(5) # display first 5 rows of metadata, this is not all the fields available
```

|  | date | link | title | filename |
|----|----|----|----|----|
| 8 | 2012-12-06 | https://adho.org/2012/12/06/adho-adopts-creative-commons-license-for-its-web-site/ | ADHO Adopts Creative Commons License for Its Web Site | 2012-12-06-post-382-adho-adopts-creative-commons-license-for-its-web-site.txt |
| 7 | 2013-03-28 | https://adho.org/2013/03/28/apply-to-be-adhos-publications-liaison/ | Apply to be ADHO’s Publications Liaison | 2013-03-28-post-366-apply-to-be-adhos-publications-liaison.txt |
| 6 | 2013-06-23 | https://adho.org/2013/06/23/adho-calls-for-proposals-for-new-special-interest-groups/ | ADHO Calls for Proposals for New Special Interest Groups | 2013-06-23-post-338-adho-calls-for-proposals-for-new-special-interest-groups.txt |
| 5 | 2013-07-09 | https://adho.org/2013/07/09/participate-in-the-joint-adho-and-centernet-agm-at-digital-humanities-2013/ | Participate in the Joint ADHO and centerNet AGM at Digital Humanities 2013 | 2013-07-09-post-408-participate-in-the-joint-adho-and-centernet-agm-at-digital-humanities-2013.txt |
| 4 | 2013-07-14 | https://adho.org/2013/07/14/digital-humanities-2015-to-be-held-in-sydney-australia/ | Digital Humanities 2015 to be held in Sydney, Australia | 2013-07-14-post-288-digital-humanities-2015-to-be-held-in-sydney-australia.txt |



You can view a specific text file (if you used the ‘txt’ format) like
this:

``` python
import os
filename = '2012-12-06-post-382-adho-adopts-creative-commons-license-for-its-web-site.txt'
with open(os.path.join(corpus_save_path, filename), 'r', encoding = 'utf-8') as file:
    text = file.read()   
    print(text)
```

    ADHO Adopts Creative Commons License for Its Web Site

    The Alliance of Digital Humanities Organizations (ADHO) is pleased to announce that all content on its web site is now available under a Creative Commons Attribution (CC-BY) license. This means that individuals and organizations are welcome to re-use and adapt ADHO’s documents and resources, so long as ADHO is cited as the source. Neil Fraistat, Chair of ADHO’s Steering Committee, notes that “this is one of an ongoing series of actions this year that are being designed to make ADHO resources more open and available to the larger community.”
     
    ADHO’s decision to adopt the CC-BY license was prompted by the recognition that through explicitly sharing its work it can have a greater impact, contribute to best practices, and demonstrate its support for open access. Recently the Program Committee for the 2013 Digital Humanities conference  revamped ADHO’s Guidelines for Proposal Authors & Reviewers, making them more inclusive, concrete, and transparent. PC chair Bethany Nowviskie received a request from the organizers of another conference to re-use these guidelines. Prompted by Nowviskie's suggestion, the ADHO Steering Committee determined that not only should the conference guidelines be made freely available, but its entire web site.
     
    In adopting a Creative Commons license for its website, ADHO follows suit with several of its existing publications, including Digital Studies/Le Champ Numerique, Digital Humanities Quarterly, and DH Answers.

## Collecting both posts and pages

If you want to collect both posts and pages, just invoke the
[`corpress`](https://geoffford.nz/corpress/core.html#corpress) function
twice: once with `endpoint_type` set to `posts` and then with it set to
`pages`.

If you are outputting in the ‘txt’ corpus format without a metadata file
(i.e. `csv_save_file` set to `None` or omitted from the function call),
you won’t have a problem. The filenames for posts/pages won’t conflict.

If you are specifying a `csv_save_file` - either because you are
outputting in the ‘csv’ corpus format or in the ‘txt’ format and wanting
the meta-data - make sure you use a different `csv_save_file` for
‘posts’ and ‘pages’. You will get two separate files, combining these
with a library like [Pandas](https://pandas.pydata.org/), which is
installed with CorPress, is trivial. I will leave that for you to Google
how to merge two CSV files into one using Pandas.

## No REST API available

Here’s an example showing what you will see if there no REST API is
accessible.

``` python
# test a site that has no endpoint
result = corpress(url = 'https://www.whitehouse.gov/', 
                endpoint_type='posts',
                corpus_format='txt',
                json_save_path = '../test_data/json/', 
                corpus_save_path = '../test_data/corpus/', 
                max_pages=2)
```

    2025-05-01 14:45:59 - INFO - No REST API endpoint link in markup
    2025-05-01 14:45:59 - INFO - Guessing posts route based on URL https://www.whitehouse.gov/wp-json/wp/v2/posts
    2025-05-01 14:45:59 - INFO - Using JSON save path: ../test_data/json/
    2025-05-01 14:45:59 - INFO - Max pages to retrieve from API is set: 2
    2025-05-01 14:45:59 - INFO - Downloading https://www.whitehouse.gov/wp-json/wp/v2/posts?page=1
    2025-05-01 14:45:59 - ERROR - Error downloading page 1 from https://www.whitehouse.gov/wp-json/wp/v2/posts
    2025-05-01 14:45:59 - ERROR - Status code: 403
    2025-05-01 14:45:59 - ERROR - It appears that this website does not provide access to the REST API. Exiting.
    2025-05-01 14:45:59 - ERROR - Error downloading data. Exiting.


|     | Key                | Value                                          |
|-----|--------------------|------------------------------------------------|
| 0   | url                | https://www.whitehouse.gov/                    |
| 1   | endpoint_url       | https://www.whitehouse.gov/wp-json/wp/v2/posts |
| 2   | headers            | None                                           |
| 3   | params             | None                                           |
| 4   | get_api_url        | True                                           |
| 5   | get_json           | False                                          |
| 6   | create_corpus      | False                                          |
| 7   | corpus_format      | txt                                            |
| 8   | corpus_save_path   | ../test_data/corpus/                           |
| 9   | csv_save_file      | None                                           |
| 10  | corpus_texts_count | 0                                              |

