Metadata-Version: 2.4
Name: amina-page-scraper
Version: 1.2.0
Summary: Lightweight web page crawler, extractor, and pipeline
Author-email: Emad Ataollahi <aminagr.3938@gmail.com>
License: MIT
Project-URL: Homepage, https://amina-group.com
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Requires-Dist: beautifulsoup4>=4.14.3
Requires-Dist: certifi>=2026.1.4
Requires-Dist: charset-normalizer>=3.4.4
Requires-Dist: idna>=3.11
Requires-Dist: requests>=2.32.5
Requires-Dist: soupsieve>=2.8.3
Requires-Dist: typing_extensions>=4.15.0
Requires-Dist: urllib3>=2.6.3

# Amina Group Page Scraper
this package is help you to get seo information on a webpage.
the types are info are:
## JSON-LD schema info
json ld schema is a graph which is used to define what is the web page and what can you find in this webpage.
we use detectors to find info on page and categorize them.
### Article
the types on page which are defined as article are
 - "Article"
 - "NewsArticle",
 - "TechArticle",
 - "ScholarlyArticle",
 - "BlogPosting",
 - "Report",
 - "Blog"

these types on page are identified as article  

### Contact and About

for **_contact pages_** we have 
 - "ContactPage"
 - "AboutPage"

### Domain and Organization

for pages which are root of the domain we have domain info and organization which are specify the brand of the page

for example **_name_** of organization which have url is identified as brand name of the website

### Other Types

 - navigation
 - person
 - product
 - service
 - category

are other types which we identify as ld json schema all these are gathered as **ENTITY**

### Entity

entity is a data type we provide which have these fields

 - id: (string)
 - name: (string) can be null
 - type: (string) *this field defined by type detectors which are explained above* 
 - alternate_name: (string) **can be null**
 - url: (string) **can be null**
 - description: (string) **can be null**
 - raw: (dictionary)
 - date_published: (date time) **can be null** 
 - date_modified: (date time) **can be null**

### usage

```python
#import Entity pipeline
from page_scraper import EntityPipeline

#import detectors you need
from page_scraper import ArticleDetector,DomainDetector

#import http fetcher and parser to fetch the page correctly and parse the html
from page_scraper import HttpFetcher, PageParser

from page_scraper import build_page_context

#from builders import page builder

# define url
url = "https://amina-group.com/"

# get and parse the page to create a Page instance
fetcher = HttpFetcher()
parser = PageParser()
fetched_url = fetcher.get(url)
parsed_info = parser.parse(fetched_url)
# create page instance using build_page_context
page = build_page_context(parsed_info)

# create pipeline with detectors you want
pipeline = EntityPipeline(
    [
        ArticleDetector(),
        DomainDetector(),
    ]
)

# run the pipeline using page
pipeline.run(page)
```
## Link scrape
we let you classify the links on page but first before using this you need to add Beautiful soup object as soup property of the page
### BeautifulSoup instance


```python
from bs4 import BeautifulSoup

page.soup = BeautifulSoup(fetched_url.content, "html.parser")
```
### links of the page
every url in page (*except the links on ld json*) are stored as **UrlContext** data type

for main urls we have:
 - canonical (*stores as canonical property of page as well*)
 - navigation
 - internal

you can use *_scraper pipeline_* to get each or all urls

### usage
```python
# import pipeline
from page_scraper import ScraperPipeline

# import scrapers
from page_scraper import GetNavLinks(),InternalScraper(),CanonicalScraper()

# create pipeline
scrape_pipeline = ScraperPipeline([
    GetNavLinks(),
    InternalScraper(),
    CanonicalScraper(),
])

# now use page and remember page soup property must be set
scrape_pipeline.run(page)
```

## Content Scrape
what need of content is headers from _*h1,h2,h3,h4,h5,h6*_ we want links and the type of links images videos audios

### headings
headings are also have a data type which is *_PageHeading_* and after scraping each heading stored as **PageHeading** instance

the page have a property named headings **page.headings** and heading scraper stores heading of the page to this property

### media tags
media tags are also url and they will be treated as **UrlContext** as well the media types are: 
- image
- video
- audio

each media tag in url context define as {meida_name}_TAG for example **IMAGE_TAG** for image

you can get the video of a page with 3 different ways
1. video tag
2. iframe
3. embed or object tag

*NOTE: you cannot get iframe by a normal url fetcher because iframe will only loads if javascript available.*

### usage
```python
# import pipeline
from page_scraper import ContentPipeline

# import scrapers
from page_scraper import HeadingScraper(),\
    ImageScraper(),VideoTagScraper(),AudioTagScraper(),\
    VideoIframeScraper(),VideoEmbedScraper()

# create pipeline
content_pipeline = ContentPipeline([
    HeadingScraper(),
    ImageScraper(),
    VideoTagScraper(),
    AudioTagScraper(),
    VideoIframeScraper(),
    VideoEmbedScraper()
])

# now use page and remember page soup property must be set
content_pipeline.run(page)
```

## Meta Scrape

in meta scrape we get page required meta such as robots (*index and follow*) status of page. *meta title and meta description* social tags (*og:*) page type and site name.

some scrapers of this pipeline have direct effect to properties of PageContext like _is_title_ which means the title tag found in title tag

first lets explain each scraper
### Title Scraper
this scraper tries to get page title from title tag.

if title tag found in page is_title property of page will be sets to true (*it's false by default*)

### Description scraper
this scraper tries to find meta title with name "description".

if meta description found in page is_description property of page will be sets to true (*it's false by default*) 

### Robots scraper
sets page is_index and is_follow properties 

### Social scraper (OG)
social scrapers defines social *title, description, site name, modified time and page type*

### usage
### usage
```python
# import pipeline
from page_scraper import ContentPipeline

# import scrapers
from page_scraper import TitleScraper(),RobotScraper(),DescriptionScraper(),\
            OgTitleScraper(),OgDescriptionScraper(),OgTypeScraper(),\
            OgSiteNameScraper(),ModifiedTimeScraper()

# create pipeline
content_pipeline = ContentPipeline([
    TitleScraper(),
    RobotScraper(),
    DescriptionScraper(),
    OgTitleScraper(),
    OgDescriptionScraper(),
    OgTypeScraper(),
    OgSiteNameScraper(),
    ModifiedTimeScraper()
])

# now use page and remember page soup property must be set
content_pipeline.run(page)
```

