Metadata-Version: 2.1
Name: fundus
Version: 0.1.0
Summary: A very simple news crawler
Author-email: Max Dallabetta <max.dallabetta@googlemail.com>
License: MIT
Project-URL: Repository, https://github.com/flairNLP/fundus
Keywords: web scraping, web crawling
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: python-dateutil~=2.8.2
Requires-Dist: lxml~=4.9.1
Requires-Dist: more-itertools~=9.1.0
Requires-Dist: cssselect~=1.1.0
Requires-Dist: feedparser~=6.0.10
Requires-Dist: colorama~=0.4.4
Requires-Dist: typing-extensions<5.0,>=4.0
Requires-Dist: langdetect~=1.0.9
Requires-Dist: aiohttp~=3.8.4
Requires-Dist: aioitertools~=0.11.0
Requires-Dist: validators~=0.20.0
Provides-Extra: dev
Requires-Dist: pytest~=7.2.2; extra == "dev"
Requires-Dist: mypy~=1.1.1; extra == "dev"
Requires-Dist: isort==5.12.0; extra == "dev"
Requires-Dist: black==23.1.0; extra == "dev"
Requires-Dist: types-lxml~=2023.2.11; extra == "dev"
Requires-Dist: types-python-dateutil~=2.8.19.10; extra == "dev"
Requires-Dist: types-requests~=2.28.11.15; extra == "dev"
Requires-Dist: types-colorama~=0.4.15.8; extra == "dev"

<img alt="alt text" src="resources/fundus_logo.png" width="180"/>

<p align="center">A very simple <b>news crawler</b> in Python.
Developed at <a href="https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/">Humboldt University of Berlin</a>.
</p>
<p align="center">
<img alt="version" src="https://img.shields.io/badge/version-0.1-green">
<img alt="python" src="https://img.shields.io/badge/python-3.8-blue">
<img alt="Static Badge" src="https://img.shields.io/badge/license-MIT-green">
</p>
<div align="center">
<hr>

[Quick Start](#quick-start) |  [Tutorials](#tutorials)  | [News Sources](/docs/supported_publishers.md)

</div>


---

Fundus is:

* **A static news crawler.** 
  Fundus lets you crawl online news articles with only a few lines of Python code!

* **An open-source Python package.**
  Fundus is built on the idea of building something together. We welcome your
  contribution to  help Fundus [grow](docs/how_to_contribute.md)!

<hr>

## Quick Start

To install from pip, simply do:

```
pip install fundus
```

Fundus requires Python 3.8+.


## Example 1: Crawl a bunch of English-language news articles

Let's use Fundus to crawl 2 articles from publishers based in the US.

```python
from fundus import PublisherCollection, Crawler

# initialize the crawler for news publishers based in the US
crawler = Crawler(PublisherCollection.us)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)
```

That's already it!

If you run this code, it should print out something like this:

```console
Fundus-Article:
- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
- Text:  "Democrats jammed three of President Joe Biden's controversial court nominees
          through committee votes on Thursday thanks to a last-minute [...]"
- URL:    https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
- From:   FreeBeacon (2023-05-11 18:41)

Fundus-Article:
- Title: "Northwestern student government freezes College Republicans funding over [...]"
- Text:  "Student government at Northwestern University in Illinois "indefinitely" froze
          the funds of the university's chapter of College Republicans [...]"
- URL:    https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
- From:   FoxNews (2023-05-09 14:37)
```

This printout tells you that you succesfully crawled two articles!

For each article, the printout details:
- the "Title" of the article, i.e. its headline 
- the "Text", i.e. the main article body text
- the "URL" from which it was crawled
- the news source it is "From"


## Example 2: Crawl a specific news source

Maybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:

```python

from fundus import PublisherCollection, Crawler

# initialize the crawler for Washington Times
crawler = Crawler(PublisherCollection.us.WashingtonTimes)

# crawl 5 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)
```

## Tutorials

We provide **quick tutorials** to get you started with the library:

1. [**Tutorial 1: How to crawl news with Fundus**](docs/1_getting_started.md)
2. [**Tutorial 2: The Article Class**](docs/2_the_article_class.md)
3. [**Tutorial 3: How to filter articles**](docs/3_how_to_filter_articles.md)
4. [**Tutorial 4: How to search for publishers**](docs/4_how_to_search_for_publishers.md)

If you wish to contribute check out these tutorials:
1. [**How to contribute**](docs/how_to_contribute.md)
2. [**How to add a publisher**](docs/how_to_add_a_publisher.md)

## Currently Supported News Sources

You can find the publishers currently supported [**here**](/docs/supported_publishers.md).

Also: **Adding a new publisher is easy - consider contributing to the project!**

## Contact

Please email your questions or comments to [**Max Dallabetta**](mailto:max.dallabetta@googlemail.com?subject=[GitHub]%20Fundus)

## Contributing

Thanks for your interest in contributing! There are many ways to get involved;
start with our [contributor guidelines](docs/how_to_contribute.md) and then
check these [open issues](https://github.com/flairNLP/fundus/issues) for specific tasks.

## License

[MIT](LICENSE)
