Metadata-Version: 2.4
Name: langchain-advertools
Version: 0.0.2
Summary: LangChain integration for advertools
Author-email: Elias Dabbas <eliasdabbas@gmail.com>
Requires-Python: >=3.10
Requires-Dist: advertools>=0.14.0
Requires-Dist: langchain>=0.3.0
Description-Content-Type: text/markdown

# LangChain integration with advertools

This package provides an integration to integrate advertools into the LangChain ecosystem.

Currently there is one class provided `WebsiteLoader` which is a document loader.

## Installation

```bash
python3 -m pip install langchain-advertools
```

## Typical workflow

### Crawl a website

```python
import advertools as adv
import pandas as pd
adv.crawl("https://www.langchain.com/", "langchain.jsonl", follow_links=True)
crawldf = pd.read_json("langchain.json", lines=True)
```

We now have the full website crawled that can be read into a DataFrame `crawldf`:


|    | url                                 | title                      | meta_desc                           | viewport                            | charset   | h1                                  | h2                                  | h3                            | canonical                           | og:title                   | og:description                      | og:image                            | og:type   | twitter:card        | body_text                           |   size |   download_timeout | download_slot     |   download_latency |   depth |   status | links_url                           | links_text             | links_nofollow                      | nav_links_url                       | nav_links_text                      | nav_links_nofollow                  | header_links_url                    | header_links_text                   | header_links_nofollow               | footer_links_url                    | footer_links_text                   | footer_links_nofollow               | img_src                             | img_loading                         | img_width                           | img_alt                             | img_sizes                           | img_srcset                          | img_height                          | ip_address   | crawl_time          | resp_headers_Date             | resp_headers_Content-Type   | resp_headers_Cf-Ray   | resp_headers_Cf-Cache-Status   |   resp_headers_Age | resp_headers_Last-Modified    | resp_headers_Content-Security-Policy   | resp_headers_Surrogate-Control   | resp_headers_Surrogate-Key          | resp_headers_X-Frame-Options   | resp_headers_X-Lambda-Id            | resp_headers_Vary   | resp_headers_Set-Cookie             | resp_headers_Alt-Svc   | resp_headers_X-Cluster-Name   | request_headers_Accept              | request_headers_Accept-Language   | request_headers_User-Agent   | request_headers_Accept-Encoding   | request_headers_Referer    |   h6 |   h4 |   h5 |
|---:|:------------------------------------|:---------------------------|:------------------------------------|:------------------------------------|:----------|:------------------------------------|:------------------------------------|:------------------------------|:------------------------------------|:---------------------------|:------------------------------------|:------------------------------------|:----------|:--------------------|:------------------------------------|-------:|-------------------:|:------------------|-------------------:|--------:|---------:|:------------------------------------|:-----------------------|:------------------------------------|:------------------------------------|:------------------------------------|:------------------------------------|:------------------------------------|:------------------------------------|:------------------------------------|:------------------------------------|:------------------------------------|:------------------------------------|:------------------------------------|:------------------------------------|:------------------------------------|:------------------------------------|:------------------------------------|:------------------------------------|:------------------------------------|:-------------|:--------------------|:------------------------------|:----------------------------|:----------------------|:-------------------------------|-------------------:|:------------------------------|:---------------------------------------|:---------------------------------|:------------------------------------|:-------------------------------|:------------------------------------|:--------------------|:------------------------------------|:-----------------------|:------------------------------|:------------------------------------|:----------------------------------|:-----------------------------|:----------------------------------|:---------------------------|-----:|-----:|-----:|
|  0 | https://www.langchain.com/          | LangChain                  | LangChain’s suite of products suppo | width=device-width, initial-scale=1 | utf-8     | Applications that can reason. Power | From startups to global enterprises | Hear from our happy customers | https://www.langchain.com/          | LangChain                  | LangChain’s suite of products suppo | https://cdn.prod.website-files.com/ | website   | summary_large_image | LangChain’s suite of products suppo | 105173 |                180 | www.langchain.com |          0.0991158 |       0 |      200 | https://www.langchain.com/@@https:/ | @@LangGraph@@LangSmith | False@@False@@False@@False@@False@@ | https://www.langchain.com/langgraph | LangGraph@@LangSmith@@LangChain@@Re | False@@False@@False@@False@@False@@ | https://www.langchain.com/contact-s | Get a demo@@See customer stories    | False@@False@@False@@False@@False   | https://www.langchain.com/langchain | LangChain@@LangSmith@@LangGraph@@Ag | False@@False@@False@@False@@False@@ | https://cdn.prod.website-files.com/ | lazy@@lazy@@lazy@@lazy@@lazy@@lazy@ | 35.5@@@@84@@@@@@66@@38@@@@72.5@@86@ | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | @@@@@@@@@@(max-width: 1279px) 65.99 | @@@@@@@@@@https://cdn.prod.website- | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@Aut | 54.243.86.28 | 2025-05-13 03:17:37 | Tue, 13 May 2025 03:17:37 GMT | text/html                   | 93ef00fbdbcac998-IAD  | HIT                            |             247862 | Sat, 10 May 2025 06:26:35 GMT | frame-ancestors 'self'                 | max-age=432000                   | www.langchain.com 65b8cd72835ceeacd | SAMEORIGIN                     | 59418863-7ac9-4380-a760-0e3817a430c | Accept-Encoding     | _cfuvid=l5QDJcN0tziza860Y8K9y2SRXZo | h3=":443"; ma=86400    | us-east-1-prod-hosting-red    | text/html,application/xhtml+xml,app | en                                | advertools/0.16.6            | gzip, deflate, zstd               | nan                        |  nan |  nan |  nan |
|  1 | https://www.langchain.com/contact-s | Talk to our team           | You can expect a conversation with  | width=device-width, initial-scale=1 | utf-8     | Talk to our team                    | Trusted by the best teams building  | nan                           | https://www.langchain.com/contact-s | Talk to our team           | You can expect a conversation with  | https://cdn.prod.website-files.com/ | website   | summary_large_image | Trusted by the best teams building  |  40694 |                180 | www.langchain.com |          0.053534  |       1 |      200 | https://www.langchain.com/@@https:/ | @@LangGraph@@LangSmith | False@@False@@False@@False@@False@@ | https://www.langchain.com/langgraph | LangGraph@@LangSmith@@LangChain@@Re | False@@False@@False@@False@@False@@ | nan                                 | nan                                 | nan                                 | nan                                 | nan                                 | nan                                 | https://cdn.prod.website-files.com/ | lazy@@lazy@@lazy@@lazy@@lazy@@lazy@ | @@@@@@@@@@@@@@@@@@1                 | @@@@@@@@@@@@@@@@@@                  | @@@@@@@@@@@@@@@@(max-width: 1919px) | @@@@@@@@@@@@@@@@https://cdn.prod.we | @@@@@@@@@@@@@@@@@@1                 | 54.243.86.28 | 2025-05-13 03:17:37 | Tue, 13 May 2025 03:17:37 GMT | text/html                   | 93ef00fd9ca9f28b-IAD  | HIT                            |             798472 | Sat, 03 May 2025 21:29:45 GMT | frame-ancestors 'self'                 | max-age=2147483647               | www.langchain.com 65b8cd72835ceeacd | SAMEORIGIN                     | 9b369e26-9fb7-4b62-af02-a4b8bcea45d | Accept-Encoding     | _cfuvid=MO5ivQuIZ0V.alvdhTofYmEnlk6 | h3=":443"; ma=86400    | us-east-1-prod-hosting-red    | text/html,application/xhtml+xml,app | en                                | advertools/0.16.6            | gzip, deflate, zstd               | https://www.langchain.com/ |  nan |  nan |  nan |
|  2 | https://www.langchain.com/resources | Resources                  | Curated content for the AI engineer | width=device-width, initial-scale=1 | utf-8     | Resources                           | Built with LangGraph@@Built with La | nan                           | https://www.langchain.com/resources | Resources                  | Curated content for the AI engineer | https://cdn.prod.website-files.com/ | website   | summary_large_image | Resources                           |  62532 |                180 | www.langchain.com |          0.031961  |       1 |      200 | https://www.langchain.com/@@https:/ | @@LangGraph@@LangSmith | False@@False@@False@@False@@False@@ | https://www.langchain.com/langgraph | LangGraph@@LangSmith@@LangChain@@Re | False@@False@@False@@False@@False@@ | https://www.langchain.com/built-wit | Use cases & inspirationUpcomingBuil | False@@False@@False@@False@@False@@ | https://www.langchain.com/langchain | LangChain@@LangSmith@@LangGraph@@Ag | False@@False@@False@@False@@False@@ | https://cdn.prod.website-files.com/ | lazy@@lazy@@lazy@@lazy@@lazy@@lazy@ | @@@@@@@@@@@@@@@@@@@@@@@@@@1         | Built with LangGraph@@Built with La | 100vw@@100vw@@100vw@@100vw@@100vw@@ | https://cdn.prod.website-files.com/ | @@@@@@@@@@@@@@@@@@@@@@@@@@1         | 54.243.86.28 | 2025-05-13 03:17:37 | Tue, 13 May 2025 03:17:37 GMT | text/html                   | 93ef00fdde1ac957-IAD  | HIT                            |              73040 | Mon, 12 May 2025 07:00:17 GMT | frame-ancestors 'self'                 | max-age=86383                    | www.langchain.com 65b8cd72835ceeacd | SAMEORIGIN                     | a5c89e0e-9196-4be5-b41c-041765aab03 | Accept-Encoding     | _cfuvid=TdQBy1PjEDzsHLwBgiCjlNdsC_n | h3=":443"; ma=86400    | us-east-1-prod-hosting-red    | text/html,application/xhtml+xml,app | en                                | advertools/0.16.6            | gzip, deflate, zstd               | https://www.langchain.com/ |  nan |  nan |  nan |
|  3 | https://www.langchain.com/pricing-l | LangGraph Platform Pricing | LangGraph Platform plans for teams  | width=device-width, initial-scale=1 | utf-8     | LangGraph Platform plansfor teams o | LangSmith for Startups and Educatio | nan                           | https://www.langchain.com/pricing-l | LangGraph Platform Pricing | LangGraph Platform plans for teams  | https://cdn.prod.website-files.com/ | website   | summary_large_image | LangGraph Platform plans for teams  |  91367 |                180 | www.langchain.com |          0.1007    |       1 |      200 | https://www.langchain.com/@@https:/ | @@LangGraph@@LangSmith | False@@False@@False@@False@@False@@ | https://www.langchain.com/langgraph | LangGraph@@LangSmith@@LangChain@@Re | False@@False@@False@@False@@False@@ | https://langchain-ai.github.io/lang | Get started@@Get started@@Contact u | False@@False@@False@@False@@False@@ | https://www.langchain.com/langchain | LangChain@@LangSmith@@LangGraph@@Ag | False@@False@@False@@False@@False@@ | https://cdn.prod.website-files.com/ | lazy@@                              | @@1                                 | @@                                  | nan                                 | nan                                 | @@1                                 | 54.243.86.28 | 2025-05-13 03:17:37 | Tue, 13 May 2025 03:17:37 GMT | text/html                   | 93ef00fdea022d0f-IAD  | HIT                            |              86950 | Mon, 12 May 2025 03:08:27 GMT | frame-ancestors 'self'                 | max-age=432000                   | www.langchain.com 65b8cd72835ceeacd | SAMEORIGIN                     | 75d9c6c2-17b1-44f5-9090-efe59cf4db1 | Accept-Encoding     | _cfuvid=BM46MCuUt4XXJJ5ZJu3TBt3DjQq | h3=":443"; ma=86400    | us-east-1-prod-hosting-red    | text/html,application/xhtml+xml,app | en                                | advertools/0.16.6            | gzip, deflate, zstd               | https://www.langchain.com/ |  nan |  nan |  nan |
|  4 | https://www.langchain.com/langchain | LangChain                  | An all-in-one developer platform fo | width=device-width, initial-scale=1 | utf-8     | The largest community building the  | A complete set of interoperable bui | Why choose LangChain?         | https://www.langchain.com/langchain | LangChain                  | An all-in-one developer platform fo | https://cdn.prod.website-files.com/ | website   | summary_large_image | The largest community building the  |  65327 |                180 | www.langchain.com |          0.0322678 |       1 |      200 | https://www.langchain.com/@@https:/ | @@LangGraph@@LangSmith | False@@False@@False@@False@@False@@ | https://www.langchain.com/langgraph | LangGraph@@LangSmith@@LangChain@@Re | False@@False@@False@@False@@False@@ | https://python.langchain.com/docs/t | Get started with Python@@Get starte | False@@False                        | https://www.langchain.com/langchain | LangChain@@LangSmith@@LangGraph@@Ag | False@@False@@False@@False@@False@@ | https://cdn.prod.website-files.com/ | lazy@@lazy@@lazy@@lazy@@lazy@@lazy@ | 656@@Auto@@Auto@@Auto@@Auto@@Auto@@ | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | (max-width: 767px) 100vw, 656px@@@@ | https://cdn.prod.website-files.com/ | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | 54.243.86.28 | 2025-05-13 03:17:37 | Tue, 13 May 2025 03:17:37 GMT | text/html                   | 93ef00fe0eb8c957-IAD  | HIT                            |              86944 | Mon, 12 May 2025 03:08:33 GMT | frame-ancestors 'self'                 | max-age=432000                   | www.langchain.com 65b8cd72835ceeacd | SAMEORIGIN                     | 8a40c6f8-f244-4511-8ac3-b9b553a12cc | Accept-Encoding     | _cfuvid=CuzspoSNIU_ALUsqpea_wSMI2rR | h3=":443"; ma=86400    | us-east-1-prod-hosting-red    | text/html,application/xhtml+xml,app | en                                | advertools/0.16.6            | gzip, deflate, zstd               | https://www.langchain.com/ |  nan |  nan |  nan |



This `WebsiteLoader` class a thin wrapper that provides this rich representation as a langchain `Document` object, lazily read, and containing all the available data under the `metadata` key

```python
>>> from langchain_advertools import WebsiteLoader
>>> loader = WebsiteLoader("langchain.jsonl")  # note that the crawling process is a separate one, and has already happened
>>> lazy = loader.lazy_load()
>>> home = next(lazy)
>>> home.id
'https://www.langchain.com/'

>>> home.page_content[:800]
LangChain’s suite of products supports developers along each step of the LLM application lifecycle. Applications that can reason. Powered by LangChain. Get a demo Sign up to be the first to access recordings from  Interrupt, The AI Agent Conference ! Learn More From startups to global enterprises,  ambitious builders choose  LangChain products. Build LangChain is a composable framework to build with LLMs. LangGraph is the orchestration framework for controllable agentic workflows. Run Deploy your LLM applications at scale with LangGraph Platform, our infrastructure purpose-built for agents. Manage LangSmith is a unified agent observability and evals platform to optimize the performance of your AI agents - whether they're built with a LangChain framework or not.  Build your app with LangChain ...

>>> home.metadata.keys()
dict_keys(['title', 'meta_desc', 'viewport', 'charset', 'h1', 'h2', 'h3', 'canonical', 'og:title', 'og:description', 'og:image', 'og:type', 'twitter:card', 'size', 'download_timeout', 'download_slot', 'download_latency', 'depth', 'status', 'links_url', 'links_text', 'links_nofollow', 'nav_links_url', 'nav_links_text', 'nav_links_nofollow', 'header_links_url', 'header_links_text', 'header_links_nofollow', 'footer_links_url', 'footer_links_text', 'footer_links_nofollow', 'img_src', 'img_loading', 'img_width', 'img_alt', 'img_sizes', 'img_srcset', 'img_height', 'ip_address', 'crawl_time', 'resp_headers_Date', 'resp_headers_Content-Type', 'resp_headers_Cf-Ray', 'resp_headers_Cf-Cache-Status', 'resp_headers_Age', 'resp_headers_Last-Modified', 'resp_headers_Content-Security-Policy', 'resp_headers_Surrogate-Control', 'resp_headers_Surrogate-Key', 'resp_headers_X-Frame-Options', 'resp_headers_X-Lambda-Id', 'resp_headers_Vary', 'resp_headers_Set-Cookie', 'resp_headers_Alt-Svc', 'resp_headers_X-Cluster-Name', 'request_headers_Accept', 'request_headers_Accept-Language', 'request_headers_User-Agent', 'request_headers_Accept-Encoding'])
```

We can now explore the very rich metadata that tells us a lot about the crawled webpage

```python
>>> home.metadata['title']
'LangChain'
>>> home.metadata['h1']
'Applications that can reason. Powered by LangChain.'
>>> home.metadata['h2'].split('@@') # multiple elements on the same page are delimited with @@
['From startups to global enterprises, ambitious builders choose LangChain products.', 'Build your app with LangChain', 'Run at scale with LangGraph\xa0Platform', 'Manage agent observability & performance with\xa0LangSmith', 'The reference architecture enterprises adopt for success.', 'The biggest developer community in GenAI', "Get started with LangChain's suite of products.", 'Get inspired by companies who have done it.', 'Ready to start shipping \u2028reliable GenAI apps faster?']

>>> home.metadata['links_url'].split('@@')[:10]
['https://www.langchain.com/', 'https://www.langchain.com/langgraph', 'https://www.langchain.com/langsmith', 'https://www.langchain.com/langchain', 'https://www.langchain.com/resources', 'https://blog.langchain.dev/', 'https://www.langchain.com/customers', 'https://academy.langchain.com/', 'https://www.langchain.com/community', 'https://www.langchain.com/experts']
>>> home.metadata['links_text'].split('@@')[:10]
['\n\n\n\n\n\n\n\n\n\n\n\n\n', 'LangGraph', 'LangSmith', 'LangChain', 'Resources Hub', 'Blog', 'Customer Stories', 'LangChain Academy', 'Community', 'Experts']
```