Metadata-Version: 2.4
Name: worai
Version: 6.2.2
Summary: AI-powered CLI for WordLift knowledge graph and SEO workflows.
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: copier<10.0.0,>=9.7.1
Requires-Dist: jinja2>=3.1.0
Requires-Dist: morph-kgc>=2.7.0
Requires-Dist: playwright>=1.48.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyshacl>=0.26.0
Requires-Dist: typer>=0.12.5
Requires-Dist: wordlift-sdk<7.0.0,>=6.0.0
Provides-Extra: dev
Requires-Dist: pytest>=8.3.4; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"

# worai

Command-line toolkit for WordLift operations and SEO checks.
Pronunciation: "waw-RYE"

Docs: https://docs.wordlift.io/worai/

## Install

- `pipx install worai`
- `pip install worai`

Full docs: https://docs.wordlift.io/worai/

Runtime dependency note:
- `wordlift-sdk>=6.0.0,<7.0.0` (installed automatically by pip)
- `copier` (required by `worai graph sync create`, installed automatically by pip)

If you plan to run `seocheck`, install Playwright browsers:
- `playwright install chromium`

## Quick Start

- `worai --help`
- `worai seocheck https://example.com/sitemap.xml`
- `worai google-search-console --site sc-domain:example.com --client-secrets ./client_secrets.json`
- `worai <command> --help`

## Configuration

Config file (TOML) discovery order:
- `--config`
- `WORAI_CONFIG`
- nearest `worai.toml` from current directory upward (for example `./worai.toml`, `../worai.toml`, `../../worai.toml`)
- `~/.config/worai/config.toml`
- `~/.worai.toml`

Profiles:
- `[profiles.<name>]` with `--profile` or `WORAI_PROFILE`

Common keys:
- `profiles.<name>.api_key`
- `profiles.<name>.mapping` (SDK profile contract)
- one source per profile (`urls`, `sitemap_url`, or `sheets_url` + `sheets_name` + `sheets_service_account`) for SDK profile validity
- `postprocessor_runtime` (graph sync runtime: `oneshot` or `persistent`; profile override supported)
- `ingest.source` (`auto|urls|sitemap|sheets|local`)
- `ingest.loader` (`auto|simple|proxy|playwright|premium_scraper|web_scrape_api|passthrough`)
- `ingest.passthrough_when_html` (default: `true`)
- command-specific OAuth/GSC/GA options should be passed via CLI flags or environment variables.

Supported environment variables:
- `WORAI_CONFIG` — path to a config TOML file (overrides discovery order).
- `WORAI_PROFILE` — profile name under `[profiles.<name>]`.
- `WORAI_LOG_LEVEL` — default log level (`debug|info|warning|error`).
- `WORAI_LOG_FORMAT` — default log format (`text|json`).
- `WORDLIFT_API_KEY` — WordLift API key for entity operations.
- `GSC_CLIENT_SECRETS` — path to OAuth client secrets JSON for GSC.
- `GSC_ID` — GSC property URL.
- `OAUTH_TOKEN` — path to store the shared OAuth token (GSC + GA).
- `GSC_OUTPUT` — default output CSV path for GSC export.
- `GA_ID` — GA4 property ID for Analytics sections.
- `GA_CLIENT_SECRETS` — path to OAuth client secrets JSON for GA4.
- `GSC_TOKEN` / `GA_TOKEN` — legacy aliases for `OAUTH_TOKEN` (must point to the same file if used).
- `WORAI_DISABLE_UPDATE_CHECK` — set to `1|true|yes|on` to disable startup update checks.

`.env` support:
- `worai` loads `.env` from the current working directory (and parent lookup) at startup.
- values from `.env` are treated as environment variables.
- existing environment variables take precedence over `.env` values.

Example environment setup:
```
export WORDLIFT_API_KEY="wl_..."
export WORAI_CONFIG="~/worai.toml"
export WORAI_PROFILE="dev"
export GSC_CLIENT_SECRETS="~/client_secrets.json"
export OAUTH_TOKEN="~/oauth_token.json"
```

Example `worai.toml`:
```toml
[profiles.default]
api_key = "${WORDLIFT_API_KEY}"
mapping = "default.yarrrml"
sitemap_url = "https://example.com/sitemap.xml"
ingest_loader = "web_scrape_api"
```

Ingestion profile examples:
```toml
[profiles.inventory_local]
api_key = "${WORDLIFT_API_KEY}"
mapping = "default.yarrrml"
urls = ["https://example.com/page"]
ingest_source = "local"
ingest_loader = "passthrough"

[profiles.inventory_remote]
api_key = "${WORDLIFT_API_KEY}"
mapping = "default.yarrrml"
sitemap_url = "https://example.com/sitemap.xml"
ingest_source = "sitemap"
ingest_loader = "web_scrape_api"

[profiles.graph_sync_proxy]
api_key = "${WORDLIFT_API_KEY}"
mapping = "default.yarrrml"
urls = ["https://example.com/a", "https://example.com/b"]
ingest_source = "urls"
ingest_loader = "proxy"
web_page_import_timeout = "60s"
```

## Commands

Full docs: https://docs.wordlift.io/worai/

- `seocheck` — run SEO checks for sitemap URLs and URL lists.
- `google-search-console` — export GSC page metrics as CSV.
- `dedupe` — deduplicate WordLift entities by schema:url.
- `canonicalize-duplicate-pages` — select canonical URLs using GSC KPIs.
- `delete-entities-from-csv` — delete entities listed in a CSV.
- `find-faq-page-wrong-type` — find and patch FAQPage typing issues.
- `find-missing-names` — find entities missing schema:name/headline.
- `find-url-by-type` — list schema:url values by type from RDF.
- `graph` — run graph-specific workflows.
- `link-groups` — build or apply LinkGroup data from CSV.
- `patch` — patch entities from RDF.
- `structured-data` — generate JSON-LD/YARRRML mappings or materialize RDF from YARRRML.
- `validate` — validate JSON-LD with SHACL shapes (use `structured-data validate page` for webpage URLs).
- `self update` — check for new worai versions and optionally run the upgrade command.
- `upload-entities-from-turtle` — upload .ttl files with resume.
- `dil-import` - upload DILs from a CSV file.

Command help:
- `worai <command> --help`

Autocompletion:
- `worai --install-completion`
- `worai --show-completion`

Updates:
- `worai` checks for new versions periodically and prints a non-blocking notice when an update is available.
- run `worai self update` to check manually and see/apply the suggested upgrade command.

## Examples

seocheck
- `worai seocheck https://example.com/sitemap.xml`
- `worai seocheck https://example.com/sitemap.xml --output-dir ./seocheck-report --save-html`
- `worai seocheck https://example.com/sitemap.xml --output-dir ./seocheck-report --no-open-report`
- `worai seocheck https://example.com/sitemap.xml --user-agent "Mozilla/5.0 ..."`
- `worai seocheck https://example.com/sitemap.xml --sitemap-fetch-mode browser`
- `worai seocheck https://example.com/sitemap.xml --no-report-ui`
- `worai seocheck https://example.com/sitemap.xml --recheck-failed --recheck-from ./seocheck-report`

google-search-console
- `worai google-search-console --site sc-domain:example.com --client-secrets ./client_secrets.json`
  - Uses OAuth redirect port 8080 by default.

seoreport (with Analytics)
- `worai seoreport --site sc-domain:example.com --ga-id 123456789 --format html`

canonicalize-duplicate-pages
- `worai canonicalize-duplicate-pages --input gsc_pages.csv --output canonical_targets.csv --kpi-window 28d --kpi-metric clicks`
- `worai canonicalize-duplicate-pages --input gsc_pages.csv --entity-type Product`

dedupe
- `worai dedupe --dry-run`

find-faq-page-wrong-type
- `worai find-faq-page-wrong-type ./data.ttl --dry-run --replace-type`
- `worai find-faq-page-wrong-type ./data.ttl --patch --replace-type`

find-missing-names
- `worai find-missing-names ./data.ttl`

find-url-by-type
- `worai find-url-by-type ./data.ttl schema:Service schema:Product`

link-groups
- `worai link-groups ./links.csv --format turtle`
- `worai link-groups ./links.csv --apply --dry-run --concurrency 4`

graph
- `worai --config ./worai.toml graph sync run --profile acme`
- `worai graph sync run --profile acme --debug`
- `worai graph sync create ./acme-graph`
- `worai graph sync create ./acme-graph --template ./graph-sync-template --defaults`
- `worai graph sync create ./acme-graph --data-file ./answers.yml --non-interactive`
- `worai graph sync create ./acme-graph --vcs-ref v1.2.3`
- `worai graph export`
- `worai graph export --profile acme`
- `worai graph export ./acme-export.jsonld --profile acme`
- `worai graph export ./acme-export.ttl --validate`
- `worai graph property delete seovoc:html --dry-run`
- `worai graph property delete https://w3id.org/seovoc/html --yes --workers 4`
  - `graph export` reads API key from `worai.toml` profile (`--profile`, default `default`) and calls `/dataset/export`.
  - `graph export` output format is inferred from extension: `.ttl`, `.nt`, `.nq`, `.rdf`/`.xml`, `.jsonld`/`.json`.
  - `graph export` default filename: `export_<profile>_<yyyyMMdd>_<seq>.ttl` (sequence starts at `1`).
  - `graph export --validate` runs SHACL validation on the exported file and fails on SHACL errors/warnings.
  - `graph property delete` sends `X-include-Private: true` by default for both GraphQL match discovery and entity PATCH requests.
  - `graph sync create` runs Copier in trusted mode by default so template `_tasks` execute.
  - Mapping docs (for `[profiles.<name>]`): `docs/graph-sync-mappings-reference.md`, `docs/graph-sync-mappings-guide.md`, `docs/graph-sync-mappings-examples.md`
  - Internal template-agent workflow docs: `specs/graph-sync/AGENTS.md`, `specs/graph-sync/INDEX.md`, `specs/graph-sync/developer-agent-workflow.md`
  - Profile loading standard for non-sync commands: `specs/profile-loading-standard.md`
  - Configure exactly one source mode per run: `urls`, `sitemap_url` (+ optional pattern), or `sheets_url` + `sheets_name`.
  - `web_page_import_timeout` is configured in seconds in `worai.toml` (`60` -> `60000` ms in SDK).
  - SDK 6 defaults to persistent postprocessor runtime.
  - set `postprocessor_runtime = "oneshot"` in `worai.toml` to keep old one-process-per-callback behavior.
  - SDK `wordlift-sdk` 5.1.1+ postprocessor context migration:
    - `context.settings` -> `context.profile` (for example `context.profile["settings"]["api_url"]`)
    - `context.account.key` -> `context.account_key`
    - `context.account` remains the clean `/me` account object
  - SDK 6 ingestion uses explicit keys:
    - `INGEST_SOURCE` (`urls|sitemap|sheets|local|auto`)
    - `INGEST_LOADER` (`web_scrape_api|proxy|premium_scraper|playwright|simple|passthrough|auto`)
    - `INGEST_TIMEOUT_MS` (milliseconds)
  - SDK 6 migration deprecates integration use of `WEB_PAGE_IMPORT_MODE` and `WEB_PAGE_IMPORT_TIMEOUT`.
  - `graph sync run` uses `run_cloud_workflow` and emits per-graph progress and final KPI summaries through CLI logs (`on_info`, `on_progress`, `on_kpi`).

patch
- `worai patch ./data.ttl --dry-run --add-types`

structured-data
- `worai structured-data create https://example.com/article Review --output-dir ./structured-data`
- `worai structured-data create https://example.com/article --type Review --output-dir ./structured-data`
- `worai structured-data create https://example.com/article --type Review --debug`
- `worai structured-data create https://example.com/article --type Review --max-xhtml-chars 40000 --max-nesting-depth 2`
- `worai structured-data generate https://example.com/sitemap.xml --yarrrml ./mapping.yarrrml --output-dir ./out`
- `worai structured-data generate https://example.com/page --yarrrml ./mapping.yarrrml --format jsonld`
- `worai structured-data inventory https://example.com/sitemap.xml --output ./structured-data-inventory.csv`
- `worai structured-data inventory ./urls.txt --output ./structured-data-inventory.csv`
- `worai structured-data inventory https://docs.google.com/spreadsheets/d/<id>/edit --sheet-name URLs_US --output ./structured-data-inventory.csv`
- `worai structured-data inventory https://example.com/sitemap.xml --destination-sheet-id <spreadsheet_id> --destination-sheet-name Inventory`
- `worai structured-data inventory https://example.com/sitemap.xml --output ./structured-data-inventory.csv --concurrency auto`
- `worai structured-data inventory /path/to/debug_cloud/us --source-type debug-cloud --output ./structured-data-inventory.csv`
- `worai structured-data inventory /path/to/debug_cloud/us --ingest-source local --ingest-loader passthrough --output ./structured-data-inventory.csv`
- `worai structured-data inventory https://example.com/sitemap.xml --ingest-loader web_scrape_api --output ./structured-data-inventory.csv`

validate
- `worai validate jsonld --shape review-snippet --shape schema-review ./data.jsonld`
- `worai validate jsonld --format raw https://api.wordlift.io/data/example.jsonld`
- `worai structured-data validate page https://example.com/article --shape review-snippet`

self update
- `worai self update --check-only`
- `worai self update --yes`

upload-entities-from-turtle
- `worai upload-entities-from-turtle ./entities --recursive --limit 50`

dil-import
- `worai dil-import <wordlift_key> <path_to_csv_file>`

## Troubleshooting

- Playwright missing browsers:
  - `playwright install chromium`
- YARRRML conversion:
  - `npm install -g @rmlio/yarrrml-parser`
- RML execution:
  - `morph-kgc` is included in project dependencies
- Dependency notes:
  - Common runtime libs (e.g., `requests`, `rdflib`, `tqdm`, `advertools`, Google auth helpers) are provided transitively by `wordlift-sdk`.
- OAuth token issues:
  - Remove the token file and re-run `worai google-search-console`.
  - If you are prompted to re-auth every run, delete the token file to force a new consent flow that includes a refresh token.
