Metadata-Version: 2.1
Name: sl_sources
Version: 0.0.3
Summary: Code for Society Library Sources
Home-page: https://github.com/SocietyLibrary/Sources
Author: SocietyLibrary
Author-email: info@societylibrary.org
License: Proprietary
Keywords: data sources
Classifier: Development Status :: 3 - Alpha
Classifier: License :: Other/Proprietary License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.27.1
Requires-Dist: aiohttp==3.9.5
Requires-Dist: PyPDF2==3.0.1
Requires-Dist: Requests==2.32.3
Requires-Dist: tqdm==4.66.4
Requires-Dist: pytest==8.2.2
Requires-Dist: pybtex==0.24.0
Requires-Dist: twikit==2.1.0
Requires-Dist: pyalex==0.14
Requires-Dist: youtube-search2==2.1.7
Requires-Dist: python-dotenv==1.0.1
Requires-Dist: yt-dlp==2024.7.25
Requires-Dist: playwright==1.45.1
Requires-Dist: transformers==4.43.3
Requires-Dist: huggingface_hub==0.24.5
Requires-Dist: librosa==0.10.2
Requires-Dist: torch==2.4.0

# Society Library Sources

This repo contains the source document collectors for the Society Library.

## Use the package directly
```bash
# Note: you will need to set your environment variables for this to work, see .env.template
export GOOGLE_API_KEY=<your api key>
python -m sl_sources search google_scholar "artificial intelligence" --num_results 5
export SEMANTIC_SCHOLAR_API_KEY=<your api key>
python -m sl_sources download semantic_scholar <paper id>
python -m sl_sources search youtube "machine learning tutorial" --num_results 3 --output results.json
```

## Library
```bash
pip install sl_sources
from sl_sources import search_google_scholar, download_from_google_scholar
# Note: You will need to set your environment variables for this to work, see .env.template
```

You can update the library with the following commands:
```
# rmrf the dist folder if it exists
rm -rf dist

# build the library
python setup.py sdist bdist_wheel

# upload to pypi
twine upload dist/*
```

## Worker
The Media Worker wraps the search and download functions for all sources, and is especially good for scraping websites and downloading videos. 

### Setup
You will need to make a Google account, enable "Google Cloud Functions" and "Google Cloud Build". However, if you initialize gcloud with your Google account and log in, these can be enabled for you automatically when the worker is deployed.

Download and install the gcloud cli
```
brew install --cask google-cloud-sdk # mac

# Linux
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz
tar -xf google-cloud-cli-linux-x86_64.tar.gz
./google-cloud-sdk/install.sh
```

Then initialize gcloud and authenticate:
```
gcloud init
gcloud auth login
```

### Local development
You can run the worker locally using functions-framework
```bash
pip install functions-framework
functions-framework --target handle_request --debug
```

Make sure you have set `CLOUD_FUNCTION_URL=http://127.0.0.1:8080` and `CLOUD_FUNCTION_ENABLED=true` in your .env file.

You can now call the function using curl
```bash
# search
curl -X POST http://127.0.0.1:8080 -H "Content-Type: application/json" -d '{"source": "google", "query": "artificial intelligence in neuroscience", "type": "search", "num_results": 10}'

# download
curl -X POST http://127.0.0.1:8080 -H "Content-Type: application/json" -d '{"search_result": {"url": "https://www.google.com", "title": "Google", "source": "google"}, "type": "download"}'
# note that the search_result object is the result of the search function
```

### Deploy the worker
```bash
bash deploy_worker.sh
```

The worker will be deployed using the environment variables in the .env, so make sure those are what you want them to be.

You will need to update your .env and set `CLOUD_FUNCTION_ENABLED` to "true" and `CLOUD_FUNCTION_URL` to your deployed worker URL, which will be shown at deployment time. It should look like this:
```bash
CLOUD_FUNCTION_ENABLED=true
CLOUD_FUNCTION_URL=https://us-<region>-<project>.cloudfunctions.net/media_worker
```

You can initialize and run many workers simultaneously. The one limitation is that cloud functions can run for a maximum of 9 minutes (540 seconds) so make sure that your work is split into smaller chunks than would require that much processing time.

### Testing worker locally
You can test the worker using functions_framework
```bash
pip install functions-framework
functions_framework --target handle_request --debug
```
