Metadata-Version: 2.4
Name: langchain-olostep
Version: 0.2.2
Summary: The most reliable and cost-effective web search, scraping and crawling API for AI. Build intelligent agents that can search, scrape, analyze, and structure data from any website.
Home-page: https://github.com/olostep/langchain-olostep
Author: Olostep
Author-email: Olostep <info@olostep.com>
License: MIT
Project-URL: Homepage, https://github.com/olostep/langchain-olostep
Project-URL: Repository, https://github.com/olostep/langchain-olostep
Project-URL: Documentation, https://docs.olostep.com/integrations/langchain
Project-URL: Issues, https://github.com/olostep/langchain-olostep/issues
Keywords: langchain,langgraph,web-scraping,web-search,crawling,olostep,ai,llm,agents
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Internet :: WWW/HTTP :: Dynamic Content
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: langchain-core>=0.3.0
Requires-Dist: requests>=2.31.0
Requires-Dist: pydantic>=2.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "test"
Requires-Dist: pytest-mock>=3.10.0; extra == "test"
Requires-Dist: httpx>=0.24.0; extra == "test"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# LangChain Olostep Integration

**The most reliable and cost-effective web search, scraping and crawling API for AI.**

Build intelligent agents that can search, scrape, analyze, and structure data from any website. Perfect for LangChain and LangGraph applications.

## Features

### Web Search
Search the web with natural language and return AI-powered answers and data in the JSON shape you want. Ground your products on real-world data and sources.

### Web Scraping
Extract content from any website with JavaScript rendering support. Handles anti-scraping measures and dynamic content automatically.

### Web Crawling  
Crawl entire websites with customizable depth and filters. Perfect for building comprehensive datasets.

### Core Capabilities
- **Batch Processing**: Scrape up to 100,000 URLs in parallel
- **AI-Powered Q&A**: Ask questions about websites and get intelligent answers  
- **Data Extraction**: Extract specific fields using AI-powered mapping
- **Multiple Formats**: Support for Markdown, HTML, JSON, and plain text
- **Specialized Parsers**: Use custom parsers for specific websites (e.g., Amazon, LinkedIn)
- **Location-Specific**: Scrape with country-specific settings
- **LangGraph Ready**: Perfect for building complex AI agent workflows
- **Cost-Effective**: Pay only for what you use with competitive pricing

## Installation

```bash
pip install langchain-olostep
```

## Setup

Set your Olostep API key:

```bash
export OLOSTEP_API_KEY="your_olostep_api_key_here"
```

Get your API key from https://olostep.com/dashboard

## Quick Start

### Basic Web Scraping

```python
from langchain_olostep import scrape_website
import asyncio

# Scrape a website
content = asyncio.run(scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown"
}))

print(content)
```

### With LangChain Agent

```python
from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from langchain_olostep import scrape_website, scrape_with_answer

# Create agent with Olostep tools
tools = [scrape_website, scrape_with_answer]
llm = ChatOpenAI(model="gpt-4o-mini")

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Use the agent
result = agent.run("""
Scrape https://example.com and tell me:
1. What is the main content about?
2. Extract any contact information
""")

print(result)
```

### With LangGraph

```python
from langgraph.graph import StateGraph
from langchain_olostep import scrape_website, scrape_batch
from langchain_openai import ChatOpenAI

# Build a research agent workflow
workflow = StateGraph(dict)

def scrape_node(state):
    urls = state["urls"]
    result = scrape_batch.invoke({"urls": urls})
    return {"scraped_data": result}

workflow.add_node("scrape", scrape_node)
# ... add more nodes
```

## Available Tools

### 1. scrape_website

Scrape content from any website.

```python
from langchain_olostep import scrape_website

result = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown",  # markdown, html, json, or text
    "country": "US",  # Optional: country code for location-specific content
    "wait_before_scraping": 2000,  # Optional: wait time in ms for JS rendering
    "parser": "@olostep/amazon-product"  # Optional: specialized parser
})
```

**Perfect for:**
- Extracting article content
- Scraping dynamic websites  
- Bypassing anti-scraping measures
- Getting clean, formatted content

### 2. scrape_batch

Scrape multiple URLs in parallel.

```python
from langchain_olostep import scrape_batch

urls = [
    "https://example1.com",
    "https://example2.com",
    "https://example3.com"
]

result = await scrape_batch.ainvoke({
    "urls": urls,
    "format": "markdown"
})
```

**Perfect for:**
- Competitive analysis
- Large-scale data collection
- Building datasets
- Monitoring multiple sources

### 3. scrape_with_answer

Ask questions about website content and get AI-powered answers.

```python
from langchain_olostep import scrape_with_answer

result = await scrape_with_answer.ainvoke({
    "url": "https://company.com",
    "question": "What is the company's main product and its pricing?"
})
```

**Perfect for:**
- Research and information extraction
- Competitive intelligence
- Lead generation
- Content analysis

### 4. scrape_with_map

Extract specific fields using AI-powered mapping.

```python
from langchain_olostep import scrape_with_map

result = await scrape_with_map.ainvoke({
    "url": "https://store.com/product/123",
    "fields": ["product_name", "price", "rating", "description"]
})
```

**Perfect for:**
- Structured data extraction
- Product information gathering
- Contact details extraction
- E-commerce data collection

## Examples

### Example 1: Research Agent

```python
from langchain_olostep import scrape_website, scrape_with_answer
from langchain.agents import initialize_agent
from langchain_openai import ChatOpenAI

tools = [scrape_website, scrape_with_answer]
llm = ChatOpenAI(model="gpt-4o-mini")

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION
)

# Research a topic
result = agent.run("""
Research the latest developments in AI by:
1. Scraping https://openai.com/blog
2. Extracting key announcements
3. Summarizing the findings
""")
```

### Example 2: Competitive Analysis

```python
from langchain_olostep import scrape_batch, scrape_with_map

# Scrape competitor websites
competitors = [
    "https://competitor1.com/pricing",
    "https://competitor2.com/pricing",
    "https://competitor3.com/pricing"
]

batch_result = await scrape_batch.ainvoke({"urls": competitors})

# Extract pricing information
for url in competitors:
    pricing = await scrape_with_map.ainvoke({
        "url": url,
        "fields": ["pricing_tiers", "features", "prices"]
    })
    print(f"Competitor: {url}")
    print(f"Pricing: {pricing}")
```

### Example 3: Content Monitoring

```python
from langchain_olostep import scrape_website
import schedule
import time

def monitor_website():
    content = await scrape_website.ainvoke({
        "url": "https://important-site.com",
        "format": "markdown"
    })
    
    # Check for changes, send alerts, etc.
    # ... your logic here

# Run every hour
schedule.every().hour.do(monitor_website)

while True:
    schedule.run_pending()
    time.sleep(1)
```

### Example 4: LangGraph Research Workflow

See the complete example in the examples directory.

```python
from langgraph.graph import StateGraph, END
from langchain_olostep import scrape_website, scrape_with_answer

# Define your research workflow
workflow = StateGraph(dict)

# Add nodes for different stages
workflow.add_node("plan", plan_research)
workflow.add_node("scrape", scrape_content)
workflow.add_node("analyze", analyze_data)
workflow.add_node("report", generate_report)

# Connect the nodes
workflow.set_entry_point("plan")
workflow.add_edge("plan", "scrape")
workflow.add_edge("scrape", "analyze")
workflow.add_edge("analyze", "report")
workflow.add_edge("report", END)

# Compile and run
agent = workflow.compile()
result = agent.invoke({"query": "Research AI developments"})
```

## Advanced Features

### JavaScript Rendering

Handle dynamic websites that load content via JavaScript:

```python
result = await scrape_website.ainvoke({
    "url": "https://dynamic-site.com",
    "wait_before_scraping": 3000  # Wait 3 seconds
})
```

### Location-Specific Scraping

Get content as it appears in different countries:

```python
result = await scrape_website.ainvoke({
    "url": "https://example.com",
    "country": "GB"  # Scrape as viewed from UK
})
```

### Specialized Parsers

Use pre-built parsers for specific websites:

```python
# Amazon product parser
product = await scrape_website.ainvoke({
    "url": "https://amazon.com/product/xyz",
    "parser": "@olostep/amazon-product"
})

# LinkedIn profile parser
profile = await scrape_website.ainvoke({
    "url": "https://linkedin.com/in/username",
    "parser": "@olostep/linkedin-profile"
})
```

### Multiple Output Formats

Get content in different formats:

```python
# Get markdown for readability
markdown = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "markdown"
})

# Get JSON for structured data
json_data = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "json"
})

# Get HTML for full page structure
html = await scrape_website.ainvoke({
    "url": "https://example.com",
    "format": "html"
})
```

## Configuration

### Environment Variables

- `OLOSTEP_API_KEY`: Your Olostep API key (required)

### Tool Parameters

All tools accept an optional `api_key` parameter:

```python
result = await scrape_website.ainvoke({
    "url": "https://example.com",
    "api_key": "your_api_key_here"  # Override environment variable
})
```

## Use Cases

### Research & Analysis
- Market research
- Competitive intelligence
- Academic research
- News monitoring

### Data Collection
- Building datasets
- Product information gathering
- Price monitoring
- Contact information extraction

### AI Agents
- Research assistants
- Data extraction bots
- Content analyzers
- Web automation agents

### Business Intelligence
- Competitor tracking
- Lead generation
- Market analysis
- Trend monitoring

## Getting Started

1. **Install the package**
   ```bash
   pip install langchain-olostep
   ```

2. **Get your API key**
   - Sign up at olostep.com
   - Get your API key from the dashboard

3. **Set your API key**
   ```bash
   export OLOSTEP_API_KEY="your_key_here"
   ```

4. **Try the examples**
   Check out the examples in the repository

## Documentation

- Olostep API Documentation: https://docs.olostep.com
- LangChain Documentation: https://python.langchain.com
- LangGraph Documentation: https://langchain-ai.github.io/langgraph/

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

MIT License - see LICENSE file for details.

## Support

- **Documentation**: docs.olostep.com
- **Issues**: GitHub Issues
- **Email**: info@olostep.com

## Why Olostep?

- **Reliable**: Handle JavaScript rendering, anti-scraping measures, and dynamic content
- **Fast**: Parallel processing for batch operations
- **Accurate**: AI-powered extraction for precise data gathering
- **Flexible**: Multiple formats, parsers, and configuration options
- **Scalable**: From single URLs to 100,000+ URLs in batch

## Changelog

### 0.2.0
- Complete redesign focusing on Olostep's core features
- Added scrape_with_answer for AI-powered Q&A
- Added scrape_with_map for structured data extraction
- Removed confusing document loader terminology
- Improved tool descriptions and examples
- Added comprehensive LangGraph example

### 0.1.0
- Initial release
