Metadata-Version: 2.4
Name: web-observatory
Version: 1.2.3
Summary: Python package for collecting and analyzing webpages
Author-email: Eric Nost <enost@uoguelph.ca>
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
License-File: LICENSE
Requires-Dist: google_api_python_client==2.44.0
Requires-Dist: items==0.6.5
Requires-Dist: pandas>=1.3.4
Requires-Dist: psycopg2==2.8.6
Requires-Dist: Requests==2.31.0
Requires-Dist: Scrapy==2.6.1
Requires-Dist: tldextract==3.2.0
Requires-Dist: wayback>=0.3.2
Requires-Dist: pyopenssl==22.0.0
Requires-Dist: cryptography<38
Requires-Dist: Twisted==22.10.0
Project-URL: Homepage, https://github.com/ericnost/observatory
Project-URL: Issues, https://github.com/ericnost/observatory/issues

# web-observatory
[![Download Latest Version from PyPI](https://img.shields.io/pypi/v/web-observatory.svg)](https://pypi.python.org/pypi/web-observatory)

*web-observatory* is a Python package for collecting and analyzing webpages.

See [here](https://github.com/ericnost/digital_conservation) for extended examples of `web-observatory` in use.

Modules
--------------------------
### `start_project`
Initializes a project directory

### `search_google`
Searches Google for terms. Google Custom Search Engine credentials required.

### `google_process`
Compiles results from multiple Google searches.

### `get_domains`
Extracts domain-level information from the urls returned by Google searches (e.g. 'google' in www.google.com)

### `initialize_crawl`
Initializes a Scrapy crawl on a set of domains. Returns a JSON file of urls found through the crawl.

### `crawl_process`
Processes the JSON output of a crawl into a pandas DataFrame.

### `crawl`
Not implemented as a module yet, but it can be run through a command like `!scrapy crawl digcon_crawler -O output.json --nolog`

### `search_merge`
Merges Google searches and crawl results.

### `get_versions`
~Gets historical versions of Twitter-searched urls using the Internet Archive's Wayback Machine. Attempts to find the version of the page archived closest in time to when it was tweeted.~ \
Uses the `requests` package to ping the url and get the "full" address rather than a redirect (e.g. bit.ly/12312). This helps in scraping.

### `initialize_scrape`
Initializes files to scrape urls for their HTML.

### `scrape`
Conducts the scrape of pages' HTML. Stores body text in a Postgresql database. 

### `query`
A set of methods for searching the Postgreql database of site text, including filtering empty results and counting specified search terms.

### `ground_truth`
Produces a sample of pages for verifying counts of terms.

### `analyze_orgs`
Calculates and visualizes averages and frequencies for each search term in the site text and summarizes by organization (domain).

### `analyze_currentuse`
Calculate current average and frequency - useful when dealing with historical page versions

### `analzye_term_correlations`
Calculates and visualizes co-variance metrics for specified search terms in the site text. 

### `analyze_association`
Associations per terms as measured by % of shared pages

### `co_occurrence`
Returns specific pages using two or more specified search terms.

Issues and Development
--------------------------
See: [web-observatory project](https://github.com/users/ericnost/projects/3/views/1)

