Metadata-Version: 2.4
Name: keytake-datareader
Version: 0.1.6
Summary: A package for extracting and formatting content from various sources
Home-page: https://github.com/vankhoa21991/DataReader
Author: Keytake Team
Author-email: vankhoa21991@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Text Processing :: Markup :: Markdown
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pypdf>=3.0.0
Requires-Dist: beautifulsoup4>=4.9.3
Requires-Dist: requests>=2.25.1
Requires-Dist: markdown>=3.3.4
Requires-Dist: SpeechRecognition>=3.8.1
Requires-Dist: pymupdf>=1.19.0
Requires-Dist: pydub>=0.25.1
Requires-Dist: yt-dlp>=2023.3.4
Requires-Dist: youtube-transcript-api>=0.6.0
Requires-Dist: moviepy>=1.0.3
Requires-Dist: markitdown
Requires-Dist: pytest>=6.0.0
Requires-Dist: black>=21.5b2
Requires-Dist: flake8>=3.9.0
Requires-Dist: isort>=5.9.0
Requires-Dist: ffmpeg-python>=0.2.0
Dynamic: license-file

# DataReader

A Python package for extracting and formatting content from various sources including PDFs, URLs, videos, and audio files.

## Features

- PDF processing with multiple backends (PyMuPDF, pypdf, markitdown)
- Web page content extraction
- Video content processing
- Audio transcription
- Markdown formatting

## Installation

```bash
pip install keytake-datareader
```

All dependencies are included with the package by default, so you can immediately use all features without installing additional packages.

## Quick Start

```python
from datareader import DataReader

# Process a PDF file
text = DataReader.read_pdf("document.pdf")

# Process a URL
web_content = DataReader.read_url("https://example.com")

# Process a video file
transcript = DataReader.read_video("video.mp4")

# Process an audio file
audio_text = DataReader.read_audio("audio.mp3")

# Save as markdown
DataReader.save_markdown(text, "output.md")
```

## Command Line Usage

The package provides command-line scripts for batch processing:

```bash
# Process PDFs
./run_pdf.sh

# Process URLs
./run_url.sh
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the LICENSE file for details.
