Metadata-Version: 2.3
Name: ytfetcher
Version: 0.4.1
Summary: YTFetcher lets you fetch YouTube transcripts in bulk with metadata like titles, publish dates, and thumbnails. Great for ML, NLP, and dataset generation.
License: MIT License
         
         Copyright (c) 2025 Ahmet Kaya
         
         Permission is hereby granted, free of charge, to any person obtaining a copy
         of this software and associated documentation files (the "Software"), to deal
         in the Software without restriction, including without limitation the rights
         to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
         copies of the Software, and to permit persons to whom the Software is
         furnished to do so, subject to the following conditions:
         
         The above copyright notice and this permission notice shall be included in
         all copies or substantial portions of the Software.
         
         THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
         IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
         FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
         AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
         LIABILITY...
Keywords: yt,transcript,youtube,cli,dataset,scraping,python
Author: Ahmet Kaya
Author-email: kaya70875@gmail.com
Requires-Python: >=3.11,<3.14
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Utilities
Requires-Dist: fake-useragent (>=2.2.0,<3.0.0)
Requires-Dist: httpx (>=0.28.1,<0.29.0)
Requires-Dist: pydantic (>=2.11.7,<3.0.0)
Requires-Dist: requests (>=2.32.4,<3.0.0)
Requires-Dist: tqdm (>=4.67.1,<5.0.0)
Requires-Dist: types-requests (>=2.32.4.20250611,<3.0.0.0)
Requires-Dist: youtube-transcript-api (>=1.1.1,<2.0.0)
Project-URL: Documentation, https://github.com/kaya70875/ytfetcher#readme
Project-URL: Homepage, https://github.com/kaya70875/ytfetcher
Project-URL: Repository, https://github.com/kaya70875/ytfetcher
Description-Content-Type: text/markdown

# YTFetcher
[![codecov](https://codecov.io/gh/kaya70875/ytfetcher/branch/main/graph/badge.svg)](https://codecov.io/gh/kaya70875/ytfetcher)
[![PyPI version](https://img.shields.io/pypi/v/ytfetcher)](https://pypi.org/project/ytfetcher/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

> ⚡ Turn hours of YouTube videos into clean, structured text in minutes.

**YTFetcher** is a Python tool for fetching YouTube video transcripts in bulk, along with rich metadata like titles, publish dates, and descriptions. Ideal for building NLP datasets, search indexes, or powering content analysis apps.

---

## 📚 Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Basic Usage](#basic-usage)
- [Fetching With Custom Video IDs](#fetching-with-custom-video-ids)
- [Exporting](#exporting)
- [Proxy Configuration](#proxy-configuration)
- [Advanced HTTP Configuration](#advanced-http-configuration-optional)
- [CLI](#cli)
- [Contributing](#contributing)
- [Running Tests](#running-tests)
- [Related Projects](#related-projects)
- [License](#license)

---

## Features

- Fetch full transcripts from a YouTube channel.
- Get video metadata: title, description, thumbnails, published date.
- Async support for high performance.
- Export fetched data as txt, csv or json.
- CLI support.

---

## Quick Start

```bash
pip install ytfetcher
```

Fetch 50 videos of transcripts with metadata from a channel and save it as JSON:
```bash
ytfetcher from_channel --api-key YOUR_API_KEY -c TheOffice -m 50 -f json
```

Check [this](https://console.cloud.google.com/apis/api/youtube.googleapis.com) link for getting your API Key.

## Installation

It is recommended to install this package by using pip:

```bash
pip install ytfetcher
```

## Basic Usage

**Note:** When specifying the channel, you must provide the exact **channel handle** without the `@` symbol, channel URL, or channel display name.  
For example, use `TheOffice` instead of `@TheOffice` or `https://www.youtube.com/c/TheOffice`.

### How to find the channel handle for a YouTube channel

1. Go to the YouTube channel page.
2. Look at the URL in your browser's address bar.
3. The handle is the part that comes right after `https://www.youtube.com/@`  

Ytfetcher uses **YoutubeV3 API** to get channel details and video id's so you have to create your API key from Google Cloud Console [In here](https://console.cloud.google.com/apis/api/youtube.googleapis.com).

Also keep in mind that you have a quota limit for **YoutubeV3 API**, but for basic usage quota isn't generally a concern.

Here how you can get transcripts and metadata informations like channel name, description, publishedDate etc. from a single channel with `from_channel` method:

```python
from ytfetcher import YTFetcher
from ytfetcher import ChannelData # Or ytfetcher.models import ChannelData
import asyncio

fetcher = YTFetcher.from_channel(
    api_key='your-youtubev3-api-key', 
    channel_handle="TheOffice", 
    max_results=2)

async def get_channel_data() -> list[ChannelData]:
    channel_data = await fetcher.fetch_youtube_data()
    return channel_data

if __name__ == '__main__':
    data = asyncio.run(get_channel_data())
    print(data)
```

---

This will return a list of `ChannelData`. Here's how it's looks like:

```python
[
ChannelData(
    video_id='video1',
    transcripts=[
        Transcript(
            text="Hey there",
            start=0.0,
            duration=1.54
        ),
        Transcript(
            text="Happy coding!",
            start=1.56,
            duration=4.46
        )
    ]
    metadata=Snippet(
        title='VideoTitle',
        description='VideoDescription',
        publishedAt='02.04.2025',
        channelId='id123',
        thumbnails=Thumbnails(
            default=Thumbnail(
                url:'thumbnail_url',
                width: 124,
                height: 124
            )
        )
    )
),
# Other ChannelData objects...
]
```

## Fetching With Custom Video IDs

You can also initialize `ytfetcher` with custom video id's using `from_video_ids` method.

```python
from ytfetcher import YTFetcher
import asyncio

fetcher = YTFetcher.from_video_ids(
    api_key='your-youtubev3-api-key', 
    video_ids=['video1', 'video2', 'video3']) # Here we initialized ytfetcher with from_video_ids method.

# Rest is same ...
```

## Exporting

To export data you can use `Exporter` class. Exporter allows you to export `ChannelData` with formats like **csv**, **json** or **txt**.

```python
from ytfetcher.services import Exporter

channel_data = await fetcher.fetch_youtube_data()

exporter = Exporter(
    channel_data=channel_data,
    allowed_metadata_list=['title', 'publishedAt'],   # You can customize this
    timing=True,                                      # Include transcript start/duration
    filename='my_export',                             # Base filename
    output_dir='./exports'                            # Optional export directory
)

exporter.export_as_json()  # or .export_as_txt(), .export_as_csv()

```

## Other Methods

You can also fetch only transcript data or metadata with video ID's using `fetch_transcripts` and `fetch_snippets` methods.

### Fetch Transcripts

```python
from ytfetcher import VideoTranscript

fetcher = YTFetcher.from_channel(
    api_key='your-youtubev3-api-key', 
    channel_handle="TheOffice", 
    max_results=2)

async def get_transcript_data() -> list[VideoTranscript]:
    transcript_data = await fetcher.fetch_transcripts()
    return transcript_data

if __name__ == '__main__':
    data = asyncio.run(get_transcript_data())
    print(data)

```

### Fetch Snippets

```python
from ytfetcher import VideoMetadata

# Init ytfetcher ...

def get_metadata() -> list[VideoMetadata]:
    metadata = fetcher.fetch_snippets()
    return metadata

if __name__ == '__main__':
    get_metadata()

```

## Proxy Configuration

`YTFetcher` supports proxy usage for fetching YouTube transcripts by leveraging the built-in proxy configuration support from [youtube-transcript-api](https://pypi.org/project/youtube-transcript-api/).

To configure proxies, you can pass a proxy config object from `ytfecher.config` directly to `YTFetcher`:

```python
from ytfetcher import YTFetcher
from ytfetcher.config import GenericProxyConfig, WebshareProxyConfig

fetcher = YTFetcher.from_channel(
    api_key="your-api-key",
    channel_handle="TheOffice",
    max_results=3,
    proxy_config=GenericProxyConfig() | WebshareProxyConfig()
)
```

For more information about proxy configuration please check official `youtube-transcript-api` documents.

## Advanced HTTP Configuration (Optional)

You can pass a custom timeout or headers (e.g., user-agent) to `YTFetcher` using `HTTPConfig`:

```python
from ytfetcher import YTFetcher
from ytfetcher.config import HTTPConfig

custom_config = HTTPConfig(
    timeout=4.0,
    headers={"User-Agent": "ytfetcher/1.0"} # Doesn't recommended to change this unless you have a strong headers.
)

fetcher = YTFetcher.from_channel(
    api_key="your-key",
    channel_handle="TheOffice",
    max_results=10,
    http_config=custom_config
)
```

## CLI

### Basic Usage

```bash
ytfetcher from_channel --api-key <API_KEY> -c <CHANNEL_HANDLE> -m <MAX_RESULTS> -f <FORMAT>
```

Basic usage example:

```bash
ytfetcher from_channel --api-key <API_KEY> -c <CHANNEL_HANDLE> -m 20 -f json
```

### Fetching by Video IDs

```bash
ytfetcher from_video_ids --api-key <API_KEY> -v video_id1 video_id2 ... -f json
```

### Output Example

```json
[
  {
    "video_id": "abc123",
    "metadata": {
      "title": "Video Title",
      "description": "Video Description",
      "publishedAt": "2023-07-01T12:00:00Z"
    },
    "transcripts": [
      {"text": "Welcome!", "start": 0.0, "duration": 1.2}
    ]
  }
]
```

### Setting API Key Globally In CLI

You can save your api key once with `ytfetcher config` command and use it globally without needing to write everytime while using CLI.

```bash
ytfetcher config <YOUR_API_KEY>
```

Now you can basically run ytfetcher **without passing API key** argument.

```bash
ytfetcher from_channel -c <CHANNEL_HANDLE>
```

### Using Webshare Proxy

```bash
ytfetcher from_channel --api-key <API_KEY> -c <CHANNEL_HANDLE> -f json \
  --webshare-proxy-username "<USERNAME>" \
  --webshare-proxy-password "<PASSWORD>"

```

### Using Custom Proxy

```bash
ytfetcher from_channel --api-key <API_KEY> -c <CHANNEL_HANDLE> -f json \
  --http-proxy "http://user:pass@host:port" \
  --https-proxy "https://user:pass@host:port"

```

### Using Custom HTTP Config
```bash
ytfetcher from_channel --api-key <API_KEY> -c <CHANNEL_HANDLE> \
  --http-timeout 4.2 \
  --http-headers "{'key': 'value'}" # Must be exact wrapper with double quotes with following single quotes.
```

---

## Contributing

To insall this project locally:

```bash
git clone https://github.com/kaya70875/ytfetcher.git
cd ytfetcher
poetry install
```

## Running Tests

```bash
poetry run pytest
```

## Related Projects

- [youtube-transcript-api](https://github.com/jdepoix/youtube-transcript-api)

## License

This project is licensed under the MIT License — see the [LICENSE](./LICENSE) file for details.

