Metadata-Version: 2.0
Name: tns
Version: 1.0.0
Summary: Top News Selectors (tns): Top news parsing from select websites
Home-page: https://github.com/oduwsdl/top-news-selectors
Author: Grant Atkins
Author-email: gatkins@cs.odu.edu
License: MIT
Download-URL: https://github.com/oduwsdl/top-news-selectors
Description-Content-Type: UNKNOWN
Keywords: html web tns odu memento
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Text Processing :: Markup :: HTML
Provides: tns
Requires-Dist: beautifulsoup4

# Top-news-selectors (tns)

A static HTML site parser for parsing the top story titles and URIs for the following websites:

- https://www.washingtonpost.com/
- http://www.foxnews.com
- http://abcnews.go.com/
- https://www.nytimes.com/
- https://www.usatoday.com/
- https://www.cbsnews.com/
- http://www.chicagotribune.com/
- https://www.nbcnews.com/
- http://www.latimes.com/
- https://www.npr.org/
- https://www.wsj.com/

This parser is built on [Beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) based on the CSS selectors found inside the respective HTML documents.
These selectors were chosen based on the selectors that were present during the 30 days for the month of **November 2016**.

This package has not been tested beyond the month of November 2016, 11/01/2016 - 11/30/2016.
Use at your own risk when going beyond that range.

## Install

This package is available via pip:

```
$ pip install tns
```

## Usage

```python
>>> from tns import SiteParser
>>> html = open('npr.html').read()
>>> parser = SiteParser(html)
>>> parser.npr()
{
'hero_text': "foo",
'hero_link': 'http://bar.com',
'headlines': [{'splash_title': 'baz', 'link': 'http://qux.com'}]
}
>>> parser.washingtonpost()
{'hero_text': '', 'hero_link': '', 'headlines': []}
```

The keys `hero_text` and `hero_link` refer to the top most identified news post, often identified by a enlarged picture or text across the top of the page.
Headlines refer to the subsequent headlines after the hero article, where `splash_title` refers to the title found on the home page of the site not the actual title of the article.

You can see that the second function call with parser, `parser.washingtonpost()`, has zero entries because the document passed to SiteParser was not intended for that parser.

Each of the sites are paired to a function:

```python
"http://abcnews.go.com/":          parser.abcnews()
"https://www.cbsnews.com/":        parser.cbsnews()
"https://www.nbcnews.com/":        parser.nbcnews()
"https://www.washingtonpost.com/": parser.washingtonpost()
"https://www.npr.org/":            parser.npr()
"http://www.latimes.com/":         parser.latimes()
"https://www.usatoday.com/":       parser.usatoday()
"https://www.wsj.com/":            parser.wsj()
"https://www.nytimes.com/":        parser.nytimes()
"http://www.foxnews.com":          parser.foxnews()
"http://www.chicagotribune.com/":  parser.chicagotribune()
```

## Debugging

When the parser fails to find specific headlines or returns no headlines at all, this could be due to:

- Iframes loading content dynamically
- Headlines being injected via Javascript
- The wrong parser is being used for the specified site


