Metadata-Version: 2.4
Name: clawsqlite
Version: 0.1.7
Summary: SQLite toolbox and knowledge base CLI for OpenClaw
Author: Ernest Yu
License: 
                                         Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
              "control" means (i) the power, direct or indirect, to cause the
              direction or management of such entity, whether by contract or
              otherwise, or (ii) ownership of fifty percent (50%) or more of the
              outstanding shares, or (iii) beneficial ownership of such entity.
        
              "You" (or "Your") shall mean an individual or Legal Entity
              exercising permissions granted by this License.
        
              "Source" form shall mean the preferred form for making modifications,
              including but not limited to software source code, documentation
              source, and configuration files.
        
              "Object" form shall mean any form resulting from mechanical
              transformation or translation of a Source form, including but
              not limited to compiled object code, generated documentation,
              and conversions to other media types.
        
              "Work" shall mean the work of authorship, whether in Source or
              Object form, made available under the License, as indicated by a
              copyright notice that is included in or attached to the work
              (an example is provided in the Appendix below).
        
              "Derivative Works" shall mean any work, whether in Source or Object
              form, that is based on (or derived from) the Work and for which the
              editorial revisions, annotations, elaborations, or other modifications
              represent, as a whole, an original work of authorship. For the purposes
              of this License, Derivative Works shall not include works that remain
              separable from, or merely link (or bind by name) to the interfaces of,
              the Work and Derivative Works thereof.
        
              "Contribution" shall mean any work of authorship, including
              the original version of the Work and any modifications or additions
              to that Work or Derivative Works thereof, that is intentionally
              submitted to Licensor for inclusion in the Work by the copyright owner
              or by an individual or Legal Entity authorized to submit on behalf of
              the copyright owner. For the purposes of this definition, "submitted"
              means any form of electronic, verbal, or written communication sent
              to the Licensor or its representatives, including but not limited to
              communication on electronic mailing lists, source code control systems,
              and issue tracking systems that are managed by, or on behalf of, the
              Licensor for the purpose of discussing and improving the Work, but
              excluding communication that is conspicuously marked or otherwise
              designated in writing by the copyright owner as "Not a Contribution."
        
              "Contributor" shall mean Licensor and any individual or Legal Entity
              on behalf of whom a Contribution has been received by Licensor and
              subsequently incorporated within the Work.
        
           2. Grant of Copyright License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              copyright license to reproduce, prepare Derivative Works of,
              publicly display, publicly perform, sublicense, and distribute the
              Work and such Derivative Works in Source or Object form.
        
           3. Grant of Patent License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              (except as stated in this section) patent license to make, have made,
              use, offer to sell, sell, import, and otherwise transfer the Work,
              where such license applies only to those patent claims licensable
              by such Contributor that are necessarily infringed by their
              Contribution(s) alone or by combination of their Contribution(s)
              with the Work to which such Contribution(s) was submitted. If You
              institute patent litigation against any entity (including a
              cross-claim or counterclaim in a lawsuit) alleging that the Work
              or a Contribution incorporated within the Work constitutes direct
              or contributory patent infringement, then any patent licenses
              granted to You under this License for that Work shall terminate
              as of the date such litigation is filed.
        
           4. Redistribution. You may reproduce and distribute copies of the
              Work or Derivative Works thereof in any medium, with or without
              modifications, and in Source or Object form, provided that You
              meet the following conditions:
        
              (a) You must give any other recipients of the Work or
                  Derivative Works a copy of this License; and
        
              (b) You must cause any modified files to carry prominent notices
                  stating that You changed the files; and
        
              (c) You must retain, in the Source form of any Derivative Works
                  that You distribute, all copyright, patent, trademark, and
                  attribution notices from the Source form of the Work,
                  excluding those notices that do not pertain to any part of
                  the Derivative Works; and
        
              (d) If the Work includes a "NOTICE" text file as part of its
                  distribution, then any Derivative Works that You distribute must
                  include a readable copy of the attribution notices contained
                  within such NOTICE file, excluding those notices that do not
                  pertain to any part of the Derivative Works, in at least one
                  of the following places: within a NOTICE text file distributed
                  as part of the Derivative Works; within the Source form or
                  documentation, if provided along with the Derivative Works; or,
                  within a display generated by the Derivative Works, if and
                  wherever such third-party notices normally appear. The contents
                  of the NOTICE file are for informational purposes only and
                  do not modify the License. You may add Your own attribution
                  notices within Derivative Works that You distribute, alongside
                  or as an addendum to the NOTICE text from the Work, provided
                  that such additional attribution notices cannot be construed
                  as modifying the License.
        
              You may add Your own copyright statement to Your modifications and
              may provide additional or different license terms and conditions
              for use, reproduction, or distribution of Your modifications, or
              for any such Derivative Works as a whole, provided Your use,
              reproduction, and distribution of the Work otherwise complies with
              the conditions stated in this License.
        
           5. Submission of Contributions. Unless You explicitly state otherwise,
              any Contribution intentionally submitted for inclusion in the Work
              by You to the Licensor shall be under the terms and conditions of
              this License, without any additional terms or conditions.
              Notwithstanding the above, nothing herein shall supersede or modify
              the terms of any separate license agreement you may have executed
              with Licensor regarding such Contributions.
        
           6. Trademarks. This License does not grant permission to use the trade
              names, trademarks, service marks, or product names of the Licensor,
              except as required for reasonable and customary use in describing the
              origin of the Work and reproducing the content of the NOTICE file.
        
           7. Disclaimer of Warranty. Unless required by applicable law or
              agreed to in writing, Licensor provides the Work (and each
              Contributor provides its Contributions) on an "AS IS" BASIS,
              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
              implied, including, without limitation, any warranties or conditions
              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
              PARTICULAR PURPOSE. You are solely responsible for determining the
              appropriateness of using or redistributing the Work and assume any
              risks associated with Your exercise of permissions under this License.
        
           8. Limitation of Liability. In no event and under no legal theory,
              whether in tort (including negligence), contract, or otherwise,
              unless required by applicable law (such as deliberate and grossly
              negligent acts) or agreed to in writing, shall any Contributor be
              liable to You for damages, including any direct, indirect, special,
              incidental, or consequential damages of any character arising as a
              result of this License or out of the use or inability to use the
              Work (including but not limited to damages for loss of goodwill,
              work stoppage, computer failure or malfunction, or any and all
              other commercial damages or losses), even if such Contributor
              has been advised of the possibility of such damages.
        
           9. Accepting Warranty or Additional Liability. While redistributing
              the Work or Derivative Works thereof, You may choose to offer,
              and charge a fee for, acceptance of support, warranty, indemnity,
              or other liability obligations and/or rights consistent with this
              License. However, in accepting such obligations, You may act only
              on Your own behalf and on Your sole responsibility, not on behalf
              of any other Contributor, and only if You agree to indemnify,
              defend, and hold each Contributor harmless for any liability
              incurred by, or claims asserted against, such Contributor by reason
              of your accepting any such warranty or additional liability.
        
           END OF TERMS AND CONDITIONS
        
           APPENDIX: How to apply the Apache License to your work.
        
              To apply the Apache License to your work, attach the following
              boilerplate notice, with the fields enclosed by brackets "[]"
              replaced with your own identifying information. (Don't include
              the brackets!)  The text should be enclosed in the appropriate
              comment syntax for the file format. We also recommend that a
              file or class name and description of purpose be included on the
              same "printed page" as the copyright notice for easier
              identification within third-party archives.
        
           Copyright [2026] [Ernest Yu]
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
               http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           See the License for the specific language governing permissions and
           limitations under the License.
Project-URL: Homepage, https://github.com/ernestyu/clawsqlite
Project-URL: Source, https://github.com/ernestyu/clawsqlite
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: License :: OSI Approved :: MIT License
Classifier: Intended Audience :: Developers
Classifier: Environment :: Console
Classifier: Topic :: Database
Classifier: Topic :: Text Processing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: httpx>=0.23
Requires-Dist: jieba>=0.42
Requires-Dist: pypinyin>=0.55
Dynamic: license-file

# clawsqlite (knowledge)

**Languages:** English | [中文说明](README_zh.md)

`clawsqlite` is a CLI toolbox for SQLite‑based applications in
[OpenClaw](https://github.com/openclaw/openclaw). The first built‑in
application is a local Markdown + SQLite knowledge base.

This repo currently focuses on the **knowledge** app:

- commands are exposed under `clawsqlite knowledge ...` for users/skills.

A local Markdown + SQLite knowledge base for OpenClaw, designed for both
humans and agents.

The knowledge app helps you:

- Ingest URLs or raw text as Markdown files + SQLite records
- Run fast full‑text search over your notes and scraped articles
- Optionally enable vector search via an external embedding service
- Regenerate titles/summaries via heuristics; optionally use a small LLM for tags
- Keep the KB healthy with explicit maintenance commands (reindex/check/fix + maintenance prune/gc)

> **Status**: already used in real OpenClaw setups. The schema and CLI are kept small and stable on purpose.

---

## 1. Features

- **Pure SQLite backend**
  - `articles` table as source of truth
  - `articles_fts` FTS5 table for full‑text search
  - `articles_vec` vec0 table for vector search (optional)
- **Markdown storage**
  - Each article is stored as `articles/<id>__<slug>.md`
  - Markdown files include a small METADATA header + MARKDOWN body section
- **Configurable root**
  - All data lives under a single root directory
  - DB and articles dir default to `<root>/knowledge.sqlite3` and `<root>/articles` (see env overrides below)
- **Embeddings + LLM (optional)**
  - Embeddings: OpenAI‑compatible `/v1/embeddings` API
  - Small LLM: OpenAI‑compatible `/v1/chat/completions` API
- **Tag generation & search ranking**
  - Tags are generated from the long summary via a small LLM (when configured)
    or a jieba-based heuristic extractor
  - If the LLM is not configured, tag generation degrades to jieba and emits
    a `NEXT` hint
  - Search ranking uses tag/query matching as an additional signal on top
    of FTS and vector similarity. Internally, the scorer:
    - embeds both the article summary and the tag string into vec0 tables
      (`articles_vec` and `articles_tag_vec`)
    - normalizes vector distances via a **logistic sigmoid** over a
      1/(1+d) transform, so that truly close neighbors get much higher
      scores than merely okay matches
    - splits the tag channel into **semantic (vector) tag score** and
      **lexical tag match score**, with the split controlled by
      `CLAWSQLITE_TAG_VEC_FRACTION` (default 0.7)
    - applies an optional **log compression** to the lexical tag score
      (`ln(1 + alpha*x) / ln(1 + alpha)`, `alpha` from
      `CLAWSQLITE_TAG_FTS_LOG_ALPHA`, default 5.0) so that many partial
      tag hits don’t overpower the semantic channels
- **CLI first**
  - Simple subcommands: `ingest`, `search`, `show`, `export`, `update`, `delete`, `reindex`

---

## 2. Requirements

The knowledge app expects an environment similar to the OpenClaw container:

- Python 3.10+ with `sqlite3` and FTS5 enabled
- Python dependencies:
  - `jieba` (optional but strongly recommended for Chinese tag extraction)
  - `pypinyin` (optional; used to generate pinyin slugs for CJK titles)
- sqlite extensions (optional but recommended):
  - `libsimple.so` (tokenizer `simple`) for better CJK tokenization
  - `vec0.so` from [sqlite-vec](https://github.com/asg017/sqlite-vec)
- Network access to your embedding / small LLM HTTP endpoints (if you enable those features)

The repo assumes these paths by default (you can override them):

- Tokenizer extension: `/usr/local/lib/libsimple.so`
- vec0 extension: auto-discovered under `/app/node_modules/**/vec0.so` or system lib dirs

In a fresh environment you typically need to:

- Install `jieba` via pip:

  ```bash
  pip install jieba
  ```

- For `libsimple.so` and `vec0.so`:
  - In the OpenClaw container these are preinstalled.
  - On a custom system you can either:
    - Use distro packages if available (check your Linux distribution), or
    - Build from source following the upstream docs:
      - sqlite-vec: <https://github.com/asg017/sqlite-vec>
      - simple tokenizer: see the OpenClaw docs for building `libsimple.so`.
  - If these extensions are missing, the knowledge app will automatically degrade to:
    - SQLite built‑in tokenizer for FTS
    - If `jieba` is available, optionally pre-segment CJK text in Python for better
      Chinese recall (controlled by `CLAWSQLITE_FTS_JIEBA=auto|on|off`)
    - FTS‑only mode when vec0 is unavailable.

---

## 3. Installation

### 3.1 From PyPI (recommended for general use)

Once published, the simplest way to use `clawsqlite` is via PyPI:

```bash
pip install clawsqlite

# Then
clawsqlite knowledge --help
```

This installs the `clawsqlite` console script so you can call the CLI from
anywhere in your environment.

### 3.2 From source (development / OpenClaw workspace)

Clone the repo:

```bash
git clone git@github.com:ernestyu/clawsqlite.git
cd clawsqlite
```

(Inside OpenClaw’s workspace this repo may already be present at
`/home/node/.openclaw/workspace/clawsqlite`.)

You can run the knowledge app via the main shell entrypoint:

```bash
# From the repo root
./bin/clawsqlite knowledge --help

# Or explicitly choose a Python binary (e.g. your venv)
CLAWSQLITE_PYTHON=/opt/venv/bin/python ./bin/clawsqlite knowledge --help
```

The recommended CLI entrypoint for skills/users is:

```bash
clawsqlite knowledge ...
```

Use:

```bash
clawsqlite knowledge ...
```

---

## 4. Configuration

### 4.1 Root & paths

The knowledge app determines its root + DB + articles directory via CLI
flags + env + defaults.

Priority for root:

1. CLI: `--root`
2. Env: `CLAWSQLITE_ROOT`
3. Default: `<current working dir>/knowledge_data`

DB path:

- `--db` > `CLAWSQLITE_DB` > `<root>/knowledge.sqlite3`

Articles dir:

- `--articles-dir` > `CLAWSQLITE_ARTICLES_DIR` > `<root>/articles`

### 4.2 Project `.env`

A **project‑level `.env` file** is the primary configuration source. Start from the example:

```bash
cp ENV.example .env
# then edit .env
```

`ENV.example` contains fields like:

```env
# Embedding service (required for vector search)
EMBEDDING_BASE_URL=https://embed.example.com/v1
EMBEDDING_MODEL=your-embedding-model
EMBEDDING_API_KEY=sk-your-embedding-key
CLAWSQLITE_VEC_DIM=1024

# Small LLM (optional)
SMALL_LLM_BASE_URL=https://llm.example.com/v1
SMALL_LLM_MODEL=your-small-llm
SMALL_LLM_API_KEY=sk-your-small-llm-key

# Root override (optional)
# CLAWSQLITE_ROOT=/path/to/knowledge_root
# CLAWSQLITE_DB=/path/to/knowledge.sqlite3
# CLAWSQLITE_ARTICLES_DIR=/path/to/articles
```

At runtime, `clawsqlite knowledge ...` (and `python -m clawsqlite_cli`) will
auto‑load a project‑level `.env` from the current working directory.
Existing environment variables are **not** overridden. In OpenClaw containers
this is usually done via the agent’s env config instead of editing `.env`.

### 4.3 Embedding configuration

Embeddings are used for vector search (`articles_vec`) and can be disabled if you only want FTS.

Required env (typically via `.env`):

- `EMBEDDING_MODEL` – model name used by your embedding endpoint
- `EMBEDDING_BASE_URL` – base URL, e.g. `https://embed.example.com/v1`
- `EMBEDDING_API_KEY` – bearer token
- `CLAWSQLITE_VEC_DIM` – embedding dimension (e.g. `1024` for BAAI/bge-m3)

The knowledge app will:

- Use these env vars in `clawsqlite_knowledge.embed.get_embedding()` (via httpx POST to `/v1/embeddings`)
- Use `CLAWSQLITE_VEC_DIM` to define `embedding float[DIM]` in `articles_vec`

If any of these are missing, **vector features are treated as disabled**:

- `embedding_enabled()` returns `False`
- `ingest` will not call the embedding API
- `search` in `mode=hybrid` will auto‑downgrade to FTS‑only and print a `NEXT` hint

### 4.4 Tag generation (LLM + jieba fallback)

By default the knowledge app builds a **long summary** from the first ~1200
word‑units (ending at a paragraph boundary) plus the final paragraph. Tags
are generated from this long summary.

- If `SMALL_LLM_*` is configured and you use `--gen-provider llm`, tags are
  generated by the small LLM.
- Otherwise tags fall back to the jieba‑based v6 heuristic extractor (and
  emit a `NEXT` hint). If `jieba` is missing, we fall back to a lightweight
  keyword extractor.

Search ranking also uses tags as a small but important signal:

- When `jieba` is available (tags are ordered by importance), we compute a
  continuous tag match score in [0,1] based on how many query keywords
  exactly match the top tags and how early they appear.
- When `jieba` is not available, we fall back to a simple 0/1 bonus for
  any exact tag match, to avoid over‑interpreting a noisy tag order.

For FTS queries we also extract a small set of keywords from the
natural‑language query using the v4 heuristic extractor (jieba‑based, no
embeddings). When `jieba` is missing, query keyword extraction falls back
to a lightweight heuristic extractor.

For vector search, we embed the original natural‑language query **plus**
the extracted keywords appended as a short tail, so the embedding is less
sensitive to filler words while retaining context.

The final hybrid score is a weighted blend of signals::

    score = w_vec * vec_score + w_fts * fts_score
            + w_tag * tag_score + w_priority * priority_bonus
            + w_recency * recency_bonus

By default we use::

    CLAWSQLITE_SCORE_WEIGHTS=vec=0.55,fts=0.25,tag=0.15,priority=0.03,recency=0.02

which roughly means:

- ~55% vector similarity for deep semantic anchoring (summary vectors)
- ~25% BM25 keywords for textual sanity checks (FTS over title/tags/summary)
- ~15% tag channel (split between semantic tag vectors and lexical tag match)
- ~3% priority as a manual pinning mechanism
- ~2% recency to keep new knowledge slightly favored without dominating

You can override these weights via `CLAWSQLITE_SCORE_WEIGHTS` (see
`ENV.example` for details).

For **mixed Chinese/English knowledge bases** you may want to bias more
strongly towards tag semantics. A common production shape is::

    CLAWSQLITE_SCORE_WEIGHTS=vec=0.30,fts=0.10,tag=0.55,priority=0.03,recency=0.02
    CLAWSQLITE_TAG_VEC_FRACTION=0.82

which yields approximately:

- ~45% tag semantic (vector) score
- ~30% summary semantic (vector) score
- ~10% lexical tag match
- ~10% full-text FTS
- ~5% priority/recency

And you can tune the lexical tag compression via::

    CLAWSQLITE_TAG_FTS_LOG_ALPHA=5.0  # larger = stronger compression, 0 = disable

### 4.4 Small LLM configuration (optional)

For better tags (and optional keyword expansion) you can configure a small LLM endpoint:

```env
SMALL_LLM_BASE_URL=https://llm.example.com/v1
SMALL_LLM_MODEL=your-small-llm
SMALL_LLM_API_KEY=sk-your-small-llm-key
```

Then use `--gen-provider llm` when ingesting or updating records. The
knowledge app will call an OpenAI‑compatible chat completions API to
generate `tags` from the long summary.

If these env vars are not set, `provider=openclaw` uses heuristics only (no network calls).

### 4.5 Scraper configuration

The knowledge app does **not** implement web scraping itself. For `--url`
ingest it runs an external scraper command, configured via:

- CLI: `--scrape-cmd`
- Env: `CLAWSQLITE_SCRAPE_CMD`

Recommended usage:

```env
CLAWSQLITE_SCRAPE_CMD="node /path/to/scrape.js --some-flag"
```

The knowledge app will:

- Load this value from `.env` (stripping outer quotes)
- Use `shlex.split()` to build argv (no `shell=True` by default)
- Append the URL as the last argument if you don’t use `{url}`

Scraper output formats:

- **New format (recommended):**

  ```text
  --- METADATA ---
  Title: Some article title
  Author: Someone
  ...

  --- MARKDOWN ---
  # Markdown heading
  Body...
  ```

- **Old format (still supported):**

  ```text
  Title: Some article title
  # Markdown heading
  Body...
  ```

The knowledge app will parse these into `title` and markdown body.

---

## 5. Quickstart

### 5.1 Minimal setup

1. **Clone & cd**

   ```bash
   git clone git@github.com:ernestyu/clawsqlite.git
   cd clawsqlite
   ```

2. （可选）在你的外层环境中配置好根目录和 Embedding/LLM 相关 env，或
   直接在运行时通过 `--root/--db/--articles-dir` 传入。这里不再强依赖
   项目内置 `.env` 加载逻辑。

3. **First ingest (text)** – this also creates the DB and basic tables:

   ```bash
   clawsqlite knowledge ingest \
     --text "Hello clawsqlite" \
     --title "First note" \
     --category dev \
     --tags test \
     --gen-provider off \
     --json
   ```

   This will:

   - Create `<root>/knowledge.sqlite3`
   - Create `<root>/articles/000001__first-note.md`
   - Index the record in FTS (and vec if embedding is configured)

4. **Search it back**:

   ```bash
   clawsqlite knowledge search "Hello" --mode fts --json
   ```

   You should see the record you just created.

### 5.2 Ingest a URL

Assuming you have a scraper command set in `.env`:

```bash
clawsqlite knowledge ingest \
  --url "https://example.com/article" \
  --category web \
  --tags example \
  --gen-provider openclaw \
  --json
```

This will:

- Call your scraper
- Extract title + markdown body
- Optionally generate `summary`/`tags`
- Store everything in DB + markdown

### 5.3 Updating an existing URL (`--update-existing`)

If you know a URL’s content has changed and you want to refresh the existing record:

```bash
clawsqlite knowledge ingest \
  --url "https://example.com/article" \
  --gen-provider openclaw \
  --update-existing \
  --json
```

Semantics:

- If a record with this `source_url` exists (and is not deleted), and `--update-existing` is set:
  - The knowledge app updates that record’s `title` / `summary` / `tags` / `category` / `priority`
  - Keeps the same `id`
  - Rewrites the markdown file
  - Updates FTS and vec indexes
- If no such record exists, it behaves like a normal ingest.

Note: `source_url` has a UNIQUE index for non‑empty, non‑`Local` values, so each URL maps to at
most one active record.

---

## 6. CLI Overview

All commands share common flags (`--root`, `--db`, `--articles-dir`, `--json`, `--verbose`).
Run `clawsqlite knowledge <command> --help` for full details.

### 6.1 ingest

```bash
clawsqlite knowledge ingest --url URL [options]
clawsqlite knowledge ingest --text TEXT [options]
```

Key options:

- `--url` / `--text`
- `--title`, `--summary`, `--tags`, `--category`, `--priority`
- `--gen-provider {openclaw,llm,off}`
- `--scrape-cmd` (or env `CLAWSQLITE_SCRAPE_CMD`)
- `--update-existing` (for URL mode)

### 6.2 search

```bash
clawsqlite knowledge search "query" --mode hybrid --topk 20 --json
```

Modes:

- `hybrid` – combine vec + FTS
- `fts` – full‑text only
- `vec` – vector only (requires embedding enabled)

Other flags:

- `--candidates` – candidate pool before re‑ranking
- `--llm-keywords {auto,on,off}` – FTS query expansion
- `--gen-provider` – used only if `llm-keywords=auto` and FTS hits are too few
- Filters: `--category`, `--tag`, `--since`, `--priority`, `--include-deleted`

### 6.3 show / export / update / delete

- `show` – dump one record (optionally with full markdown content)
- `export` – write a record to a `.md` or `.json` file
- `update` – patch fields or regenerate via generator (id/source_url/created_at are treated as read-only)
- `delete` – soft delete by default (sets `deleted_at`); `--hard` for permanent removal

All of these commands now **check that the DB file exists** before opening it:

- If `--root/--db` or `.env` point to a non‑existent DB path, they will report:

  ```text
  ERROR: db not found at /path/to/db. Check --root/--db or .env configuration.
  ```

  instead of silently creating an empty DB and then failing with `id not found`.

### 6.4 reindex

Maintenance operations:

- `reindex --check` – report missing fields/indexes
- `reindex --fix-missing` – regen fields/indexes using current generator
- `reindex --rebuild --fts` – rebuild FTS index (via `clawsqlite index rebuild`)
- `reindex --rebuild --vec` – clear vec index **only** (no embedding); use
  `clawsqlite knowledge embed-from-summary` to refill embeddings.

The check output includes flags like `vec_available` and `embedding_enabled` to help you
understand whether vec features are actually usable for the current DB.

---

### 6.5 maintenance

```bash
clawsqlite knowledge maintenance prune --days 3 --dry-run
```

This scans for orphaned files, old `.bak_*` backups, and broken DB paths.
Use `--dry-run` to preview deletions; `maintenance gc` is an alias of `prune`.

---

## 7. Notes on File Naming & Titles

Markdown files are named:

```text
<id:06d>__<slugified-title>.md
```

- `id` comes from the `articles` table
- `slugified-title` is derived from the title:
  - If `pypinyin` is available, CJK tokens are converted to pinyin;
  - ASCII letters/digits are preserved; other symbols become `-`;
  - Repeated `-` are collapsed; empty results fall back to `untitled`.

For CJK titles, the filename is typically a pinyin‑based slug. If `pypinyin`
is not installed, it may fall back to `untitled`. This does not affect functionality:

- The real title is stored in the DB
- Search operates on DB fields and FTS, not filenames

We may refine this strategy in future (e.g. supporting CJK slugs or ID‑only filenames), but the
current format is stable and works well with existing tools.

---

## 8. 中文说明（简要）

`clawsqlite` 提供的知识库应用是一个基于 SQLite + FTS5 + sqlite-vec
的本地知识库 CLI，主要特性：

- 文章元数据存到 `articles` 表，全文索引用 FTS5，向量检索用 sqlite-vec
- 每篇文章同时会写成一个 markdown 文件，包含 `--- METADATA ---` 和 `--- MARKDOWN ---`
- 支持：
  - `ingest`：从 URL 或纯文本入库
  - `search`：FTS / 向量 / 混合模式
  - `show` / `export` / `update` / `delete` / `reindex` / `maintenance`

### 快速开始

1. 克隆仓库：

   ```bash
   git clone git@github.com:ernestyu/clawsqlite.git
   cd clawsqlite
   ```

2. 在运行环境中配置好 `CLAWSQLITE_ROOT` / `CLAWSQLITE_DB` 等路径以及
   EMBEDDING_* / SMALL_LLM_* 等变量，或直接在命令行通过
   `--root/--db/--articles-dir` 传入。这一步通常在 OpenClaw 的 agent
   配置中完成。

3. 第一次入库（文本）：

   ```bash
   clawsqlite knowledge ingest \
     --text "你好，clawsqlite" \
     --title "第一次笔记" \
     --category test \
     --tags demo \
     --gen-provider off
   ```

4. 搜索：

   ```bash
   clawsqlite knowledge search "clawsqlite" --mode fts
   ```

### URL 重抓 & 刷新

当你知道某篇文章更新了，并且之前已经用 URL 入过库，可以用：

```bash
clawsqlite knowledge ingest \
  --url "https://example.com/article" \
  --update-existing
```

行为：

- `source_url` 已存在时：更新同一条记录（title/summary/tags/category/priority），并重写 markdown + 索引
- `source_url` 不存在时：正常新增一条记录

`articles.source_url` 对非空、非 `Local` 的 URL 有唯一约束，保证一个 URL 对应最多一条记录。

---

如果你在使用过程中遇到问题或有改进建议，欢迎在仓库里开 issue / PR。

---

## License

MIT © Ernest Yu
