Metadata-Version: 2.1
Name: pyaiocrawler
Version: 0.2.2
Summary: Asynchronous web crawler built on asyncio
Home-page: https://github.com/tapanpandita/aiocrawler
Author: Tapan Pandita
Author-email: tapan.pandita@gmail.com
License: MIT
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Framework :: AsyncIO
Description-Content-Type: text/markdown
Requires-Dist: aiohttp
Requires-Dist: beautifulsoup4
Requires-Dist: cchardet
Requires-Dist: aiodns

AIOCrawler
==========
[![Build Status](https://travis-ci.org/tapanpandita/aiocrawler.svg?branch=master)](https://travis-ci.org/tapanpandita/aiocrawler)
[![Coverage Status](https://coveralls.io/repos/github/tapanpandita/aiocrawler/badge.svg?branch=master)](https://coveralls.io/github/tapanpandita/aiocrawler?branch=master)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/eab04685503c490082f1c6a545c4016e)](https://www.codacy.com/app/tapanpandita/aiocrawler?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=tapanpandita/aiocrawler&amp;utm_campaign=Badge_Grade)
[![PyPI version](https://badge.fury.io/py/pyaiocrawler.svg)](https://badge.fury.io/py/pyaiocrawler)

Asynchronous web crawler built on [asyncio](https://docs.python.org/3/library/asyncio.html)

Installation
------------
```shell
pip install pyaiocrawler
```
Usage
-----
### Generating sitemap
```python
from aiocrawler import SitemapCrawler

crawler = SitemapCrawler('https://www.google.com', depth=3)
sitemap = await crawler.get_results()
```
### Configuring the crawler
```python
from aiocrawler import SitemapCrawler

crawler = SitemapCrawler(
    init_url='https://www.google.com', # The base URL to start crawling from
    depth=3,                           # The maximum depth to crawl till
    concurrency=300,                   # Maximum concurrent requests to make
    user_agent='My Amazing Crawler',   # User agent sent in the request
)
```
### Extending the crawler
To create your own crawler, simply subclass `AIOCrawler` and implement the `parse` method. For every page crawled, the `parse` method is executed with the url of the page, the links in that page and the html of the page. The return of the `parse` method is appended to an array which is then available when the `get_results` method is called. We have implemented an example crawler here that extracts the title from the page.
```python
from aiocrawler import AIOCrawler
from bs4 import BeautifulSoup          # We will use beautifulsoup to extract the title from the html
from typing import Set, Tuple


class TitleScraper(AIOCrawler):
    '''
    Subclasses AIOCrawler to extract titles for the pages on the given domain
    '''

    def parse(self, url: str, links: Set[str], html: bytes) -> Tuple[str, str]:
        '''
        Returns the url and the title of the url
        '''
        soup = BeautifulSoup(html, 'html.parser')
        title = soup.find('title').string
        return url, title


crawler = TitleScraper('https://www.google.com', 3)
titles = await crawler.get_results()
```


