Metadata-Version: 2.4
Name: syl
Version: 0.1.1
Summary: A CLI tool for managing code-search semantic LLM tool projects and dependencies
Author-email: Andy Evans <syl@moog.sh>
License: MIT
Project-URL: Repository, https://gitlab.com/ohtz/syl
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: tabulate==0.9.0
Requires-Dist: requests<2.33.0
Requires-Dist: loguru<0.8.0
Requires-Dist: pydantic<2.12.0
Requires-Dist: pydantic-settings<2.12.0
Requires-Dist: pydantic_core<2.34.0
Requires-Dist: docker==7.1.0

## Syl
Syl provides semantic code search for local LLMs via an MCP proxy server. You can work with any codebase, or its large dependencies, without updating your LLM application's MCP configuration.  

It supports pgvector datastores with S3 Vector buckets and ChromaDB in the works-ish.  

**FAQ**  
  - *Why create this?*
    - I needed a faster way to search across random, large, projects (mostly dependencies of my own projects).
  - *Why run a proxy container?*  
    - I don't want to update my Claude/OpenWebUI tool configurations every time I want to search a new codebase.  
  - *Why Docker?*
    - When I first had the idea, Docker seemed like a simple solution. Now I'm questioning my own sanity.  
  - *Isn't this overkill?*
    - Probably. But it was fun to build.
  - *Can I use a custom embedding model?*  
    - Yes. Pass `--embed-model $YOUR_MODEL`  
  - *Can I use a local embedding model?*  
    - Yes. Pass `--embed-model $YOUR_MODEL` (and `--hf-cache-dir $DIR` if non-default HF cache dir)
  - *Does this support OpenAI-format tool servers?*  
    - Yes. It handles both MCP over HTTP stream (for clients like Claude Code) as well as OpenAI-format HTTP requests (for clients like OpenWebUI, etc.).  
  - *Is this local-only?*  
    - That's what it's meant for; there's **no authentication**, **no TLS**, etc.  
  - *Does it support private repos?*
    - No. Clone the repo to your machine and `--mount-local-code-dir $DIR` it instead of providing git details
  - *What's with the name?*
    - It's a [Brando Sando](https://stormlightarchive.fandom.com/wiki/Sylphrena) reference.

## Bugs/Limitations/TODO
So, so many.  I quickly scraped this together, paying little attention to things like backend table structure, indexes, Pydantic models, etc. once I got it functioning.  
Expect bugs and inefficiencies.  

That being said, some known items of import:
  - Far too much data is returned from certain MCP tool calls
    - This should be reduced, with optional args to include additional fields in responses
  - Errors encountered while indexing aren't propagated back to the server, so it's useful to either tail the indexing container logs, or run it in the foreground via `--no-daemon`
    - TODO It wouldn't be terribly difficult to send those back to the server and show the node as [yellow]warning[/yellow] or similar
  - `ack -li todo`... uh oh.
  - The CLI args are a little unidiomatic after pivoting so many times; they probably warrant a do-over
  - I've only added sitting-tree AST parsers for _Go_, _Python_, and _JavaScript_
    - Other files are still embedded for semantic search (if they have a [supported extension]), you just won't get function metadata
  - Error handling/logging/consistency is a bit of a mess (sorry future self)
  - The layout of the tools across the different HTTP servers is annoying; there's probably a more Pythonic way to handle this..
  - I haven't built a Python package in nearly a decade so I just threw `__init__.py` files everywhere and prayed to the snake gods
  - The complexity and maintainability metrics are generated by [radon](https://radon.readthedocs.io/en/latest/) for Python and by simplified manual calculations for Go/JS/TS, which may or may not be useful
  - The Docker base image is fairly large (~400MB), mostly due to the Torch/ML dependencies; I squashed it as best I could
  - Need to add `register` command to register remote datastores that already have the expected table structure/data
    - More useful if the Syl server crashes/is restarted so you don't have to index again
  - File watcher paths need to be normalized to match initial index paths
  - Directory exclusions would be useful..

## Requirements
  - Docker
  - Python -- Only tested 3.12

## Getting Started
```bash
pip install syl
syl create server  # Optional; server is auto-started on first datastore creation; skip with --no-server
syl create datastore pgvector my-project \
  ...args
```

Once the datastore has indexed and the container is `available` (via `syl status`), you can begin querying the data within the datastore.   

## Connecting LLMs
There are two ports available on the Syl server that LLMs can connect to:
  - `8000` -- The OpenAI-format HTTP API (served via FastAPI)
  - `8001` -- The MCP HTTP stream API (served via FastMCP)

### Connect (docker) OpenWebUI to HTTP MCP (8000)
1. Navigate to `Settings` -> `Tools`
2. Add tool URL: `http://host.docker.internal:8000`

I haven't tested with the non-docker version of OpenWebUI, but it _should_ function the same with `http://localhost:8000`.  

### Connect Claude Code to MCP (8001)
```bash
claude mcp add --transport http code-search http://localhost:8001/mcp
```

### Connect Crush to MCP (8001)
```json
{
  "$schema": "https://charm.land/crush.json",
  "providers": {
    ...providers
  },
  "mcp": {
    "code_search": {
      "type": "http",
      "url": "http://localhost:8001/mcp"
    }
  }
}

```

## Syl Commands
`syl --help` is your friend.  
```bash
$ syl --help
usage: syl [-h] {status,s,logs,query,start,stop,restart,remove,rm,create} ...

Manage Syl containers and datastores

positional arguments:
  {status,s,logs,query,start,stop,restart,remove,rm,create}
                        Available actions
    status (s)          Show status of Syl resources
    logs                Show/tail logs for a Syl container
    query               Perform a query against a datastore
    start               Start a stopped Syl container
    stop                Stop a running Syl container
    restart             Restart a running Syl container
    remove (rm)         Remove a Syl container
    create              Create the Syl server, a new datastore, or file watcher

options:
  -h, --help            show this help message and exit
```

## Datastores
The datastore is what stores the vectors that are generated from the code that's ingested, typically a `pgvector` database.  
In addition to the code embeddings themselves, we also store metadata around the code files that are indexed.

Additionally _additionally_, we store the complete file source which LLMs can request.  
In my testing this has been very useful in general, but it's also needed for code classes, statements, etc. that exist outside of functions (`for`, `if` in JS/Python).  

Most of this data is acquired via AST parsers for each supported language.  

### pgvector Databases
pgvector databases are the primary way of using Syl at this time. All features are supported when using a pgvector datastore.  
You can choose to run a local pgvector instance via Docker, or connect to a remote pgvector database.  


<details>
<summary>pgvector Args</summary>

```bash
$ syl create datastore pgvector --help

usage: syl create datastore pgvector [-h] [--log-level LOG_LEVEL] [--verbose]
                                   [--description DESCRIPTION]
                                   [--embed_model EMBED_MODEL] [--no-daemon]
                                   [--vector-size VECTOR_SIZE]
                                   [--git-repo-url GIT_REPO_URL]
                                   [--git-branch GIT_BRANCH]
                                   [--mount-local-code-dir MOUNT_LOCAL_CODE_DIR]
                                   [--no-server] [--num-workers NUM_WORKERS]
                                   [--skip-embed] [--no-register]
                                   [--watch-dir [WATCH_DIR ...]]
                                   [--exclude-ext [EXCLUDE_EXT ...]] [--local]
                                   [--pg-func-table PG_FUNC_TABLE]
                                   [--pg-file-table PG_FILE_TABLE]
                                   [--pg-host PG_HOST] [--pg-port PG_PORT]
                                   [--pg-database PG_DATABASE]
                                   [--pg-user PG_USER]
                                   [--pg-password PG_PASSWORD]
                                   name

positional arguments:
  name                  datastore name

options:
  -h, --help            show this help message and exit
  --log-level LOG_LEVEL
                        Log level
  --verbose, -v         Verbose logging
  --description DESCRIPTION
                        Datastore description; useful for LLMs when listing
                        available datastores
  --embed_model EMBED_MODEL, -e EMBED_MODEL
                        Embedding model
  --no-daemon           Don't run in daemon mode
  --vector-size VECTOR_SIZE
                        Vector size; this MUST match your index/table for
                        remote data stores
  --git-repo-url GIT_REPO_URL
                        Git repository URL to clone if this is a remote
                        project
  --git-branch GIT_BRANCH
                        Git repository branch to checkout after cloning
  --mount-local-code-dir MOUNT_LOCAL_CODE_DIR
                        Container mount path for the code directory to index
                        if this is a local project
  --no-server           Don't start the server if it's not running
  --num-workers NUM_WORKERS
                        Number of workers to use for embeddings
  --skip-embed          Skip embedding/indexing if the datastore already
                        contains data
  --no-register         Do not register this datastore with the Syl server
  --watch-dir [WATCH_DIR ...]
                        After indexing a datastore, create a directory file
                        watcher to automatically update embeddings for
                        created/modified files
  --exclude-ext [EXCLUDE_EXT ...]
                        Exclude certain file extensions
  --local               Create a local pgvector container to store the
                        embeddings
  --pg-func-table PG_FUNC_TABLE
                        Custom function embeddings table name
  --pg-file-table PG_FILE_TABLE
                        Custom file embeddings table name
  --pg-host PG_HOST     Remote postgres host
  --pg-port PG_PORT     Remote postgres port
  --pg-database PG_DATABASE
                        Remote postgres database name
  --pg-user PG_USER     Remote postgres user
  --pg-password PG_PASSWORD
                        Remote postgres password
```
</details>


### S3 Vector Buckets (deprecated before this even started?)
S3 Vector buckets being released by AWS is what spawned the idea for this project.  
Ironically, I quickly ended up moving to pgvector due to limitations with the S3 Vector bucket metadata, as well as the inability to store source code directly alongside the embeddings.

S3 Vector buckets limit to 10 filterable metadata items with a combined size < 2KB, so some metadata items are not included.  
Additionally, if the supported metadata items are larger than 2KB, we start deleting them until we're under this size limit.  

Due to the above limitations, S3 Vector buckets are **not** fully supported.  
You will **not** get full MCP tool support when using an S3 Vector bucket.  
Full source code files will **not** be available when using an S3 Vector bucket.  

In fact, you probably shouldn't use this at all.  There's a good chance it will be removed.  

<details>
<summary>S3 Vector Args</summary>

```bash
$ syl create datastore s3vector --help

usage: syl create datastore s3vector [-h] [--log-level LOG_LEVEL] [--verbose]
                                   [--description DESCRIPTION]
                                   [--embed_model EMBED_MODEL] [--no-daemon]
                                   [--vector-size VECTOR_SIZE]
                                   [--git-repo-url GIT_REPO_URL]
                                   [--git-branch GIT_BRANCH]
                                   [--mount-local-code-dir MOUNT_LOCAL_CODE_DIR]
                                   [--no-server] [--num-workers NUM_WORKERS]
                                   [--skip-embed] [--no-register]
                                   [--watch-dir [WATCH_DIR ...]]
                                   [--exclude-ext [EXCLUDE_EXT ...]]
                                   [--s3-bucket S3_BUCKET]
                                   [--s3-region S3_REGION]
                                   [--s3-index S3_INDEX]
                                   [--aws-access-key AWS_ACCESS_KEY]
                                   [--aws-secret-key AWS_SECRET_KEY]
                                   [--aws-profile AWS_PROFILE]
                                   [--aws-mount-config-dir AWS_MOUNT_CONFIG_DIR]
                                   name

positional arguments:
  name                  datastore name

options:
  -h, --help            show this help message and exit
  --log-level LOG_LEVEL
                        Log level
  --verbose, -v         Verbose logging
  --description DESCRIPTION
                        Datastore description; useful for LLMs when listing
                        available datastores
  --embed_model EMBED_MODEL, -e EMBED_MODEL
                        Embedding model
  --no-daemon           Don't run in daemon mode
  --vector-size VECTOR_SIZE
                        Vector size; this MUST match your index/table for
                        remote data stores
  --git-repo-url GIT_REPO_URL
                        Git repository URL to clone if this is a remote
                        project
  --git-branch GIT_BRANCH
                        Git repository branch to checkout after cloning
  --mount-local-code-dir MOUNT_LOCAL_CODE_DIR
                        Container mount path for the code directory to index
                        if this is a local project
  --no-server           Don't start the server if it's not running
  --num-workers NUM_WORKERS
                        Number of workers to use for embeddings
  --skip-embed          Skip embedding/indexing if the datastore already
                        contains data
  --no-register         Do not register this datastore with the Syl server
  --watch-dir [WATCH_DIR ...]
                        After indexing a datastore, create a directory file
                        watcher to automatically update embeddings for
                        created/modified files
  --exclude-ext [EXCLUDE_EXT ...]
                        Exclude certain file extensions
  --s3-bucket S3_BUCKET
                        S3 bucket name
  --s3-region S3_REGION
                        S3 region
  --s3-index S3_INDEX   S3 Vector bucket index
  --aws-access-key AWS_ACCESS_KEY
                        AWS access key
  --aws-secret-key AWS_SECRET_KEY
                        AWS secret key
  --aws-profile AWS_PROFILE
                        AWS profile name to use if config dir is mounted
  --aws-mount-config-dir AWS_MOUNT_CONFIG_DIR
                        Mount aws config directory to the container (read-
                        only)
```
</details>


### ChromaDB Databases
I began adding support for ChromaDB, but haven't finished/tested it much so it likely doesn't work yet.  


<details>
<summary>ChromaDB Args</summary>

```bash
$ syl create datastore chromadb --help

usage: syl create datastore chromadb [-h] [--log-level LOG_LEVEL] [--verbose]
                                   [--description DESCRIPTION]
                                   [--embed_model EMBED_MODEL] [--no-daemon]
                                   [--vector-size VECTOR_SIZE]
                                   [--git-repo-url GIT_REPO_URL]
                                   [--git-branch GIT_BRANCH]
                                   [--mount-local-code-dir MOUNT_LOCAL_CODE_DIR]
                                   [--no-server] [--num-workers NUM_WORKERS]
                                   [--skip-embed] [--no-register]
                                   [--watch-dir [WATCH_DIR ...]]
                                   [--exclude-ext [EXCLUDE_EXT ...]] [--local]
                                   [--persistent-chroma PERSISTENT_CHROMA]
                                   [--persistent-chroma-dir PERSISTENT_CHROMA_DIR]
                                   name

positional arguments:
  name                  datastore name

options:
  -h, --help            show this help message and exit
  --log-level LOG_LEVEL
                        Log level
  --verbose, -v         Verbose logging
  --description DESCRIPTION
                        Datastore description; useful for LLMs when listing
                        available datastores
  --embed_model EMBED_MODEL, -e EMBED_MODEL
                        Embedding model
  --no-daemon           Don't run in daemon mode
  --vector-size VECTOR_SIZE
                        Vector size; this MUST match your index/table for
                        remote data stores
  --git-repo-url GIT_REPO_URL
                        Git repository URL to clone if this is a remote
                        project
  --git-branch GIT_BRANCH
                        Git repository branch to checkout after cloning
  --mount-local-code-dir MOUNT_LOCAL_CODE_DIR
                        Container mount path for the code directory to index
                        if this is a local project
  --no-server           Don't start the server if it's not running
  --num-workers NUM_WORKERS
                        Number of workers to use for embeddings
  --skip-embed          Skip embedding/indexing if the datastore already
                        contains data
  --no-register         Do not register this datastore with the Syl server
  --watch-dir [WATCH_DIR ...]
                        After indexing a datastore, create a directory file
                        watcher to automatically update embeddings for
                        created/modified files
  --exclude-ext [EXCLUDE_EXT ...]
                        Exclude certain file extensions
  --local               Create a local ChromaDB container to store the
                        embeddings
  --persistent-chroma PERSISTENT_CHROMA
                        Run ChromaDB in persistent mode (default is in-memory)
  --persistent-chroma-dir PERSISTENT_CHROMA_DIR
                        ChromaDB persistent data directory; only applicable
                        with --persistent-chroma
```
</details>

## Components
Syl is made up of several components:
  - `syl` -- The main CLI script used for navigating the application stack
  - `syl-network` -- Docker network
  - `syl-server` -- FastAPI Docker container that handles LLM tool calls, datastore registrations, and proxies data requests to the appropriate datastore
  - `syl-index-$NAME` -- Temporary containers created to index codebases and load the embeddings into datastores
  - `syl-$DATASTORE-$NAME` -- A local datastore container, if you've chosen to run one
  - `syl-watcher-$NAME-$ID` -- Directory filewatcher to trigger reindexing of new/modified files


### syl
This CLI script is used to boostrap and manage Syl components.  

### syl-network
A Docker network is created named `syl-network` that all containers run under.  

### syl-server
The `syl-server` container runs an OpenAI-compatible FastAPI server as a tool server.  
It exposes every supported tool with an additional parameter for `datastore_name`. This datastore name is used to route the request to the appropriate datastore.  

### syl-index-$NAME
The indexing process involves:
  - Pulling down the requested embedding model
  - Cloning/mounting the repository containing the code to index
  - Generating embeddings for the codebase
  - Offloading the embeddings to the selected datastore
  - Registers itself with the `syl-server`

All in all, the bootstrap process for a 150k line project using the default model takes ~5 minutes on my M3 Max.  
Results will vary.  

### syl-watcher-$NAME
Optional file watchers to trigger reindexing of new/modified files in a local directory.  

### Diagram
<details>
<summary>Click to expand</summary>

![SYL Diagram](./README_SRC/syl_diagram.png)

</details>

## Connecting to a Local pgvector Datastore
You can connect to the local Postgres pgvector datastore with the username `postgres` and the _datastore name_ as the password.  
Very secure.

## Querying Data
If you enjoy looking at large amounts of JSON, you can query datastores manually using the `syl query` command.  
This is more-or-less what an LLM has access to via the MCP tool servers.  

```bash
$ syl query --help

usage: syl query [-h] [--docker-network-cidr DOCKER_NETWORK_CIDR]
                 [--log-level LOG_LEVEL] [--verbose] --query QUERY
                 [--distance-metric {cosine,euclidean}]
                 [--max-results MAX_RESULTS] [--file-filter FILE_FILTER]
                 [--function-filter FUNCTION_FILTER]
                 [--min-complexity MIN_COMPLEXITY] [--is-async] [--is-method]
                 datastore

positional arguments:
  datastore             Datastore name

options:
  -h, --help            show this help message and exit
  --docker-network-cidr DOCKER_NETWORK_CIDR
                        Use a custom Docker network CIDR when starting the Syl
                        server
  --log-level LOG_LEVEL
                        Log level
  --verbose, -v         Verbose logging
  --query QUERY, -q QUERY
                        Query keywords
  --distance-metric, -d {cosine,euclidean}
                        Search method; not applicable to S3 (it uses the
                        metric created on the bucket index)
  --max-results MAX_RESULTS, -m MAX_RESULTS
                        Maximum number of results to return
  --file-filter FILE_FILTER
                        Filter results by file name
  --function-filter FUNCTION_FILTER
                        Filter results by function name
  --min-complexity MIN_COMPLEXITY, -c MIN_COMPLEXITY
                        Minimum complexity to return
  --is-async, -a        Filter results by async functions
  --is-method           Filter results by methods

```

```bash
syl query my-project --query "parse string using regex" | jq
```

## Testing Your Tools
You can test the OpenAI-format HTTP tool endpoint using your favorite HTTP tool (cURL, requests, etc.).  
```bash
```bash
$ curl -s http://172.20.0.2:8000/openapi.json | jq
{
  "openapi": "3.1.0",
  "info": {
    "title": "OpenAI-Compatible Tools Server",
    "version": "0.1.0"
  },
  "paths": {
    "/tools/list_datastores": {
      "post": {
        "summary": "List Datastores Openai",
        "operationId": "list_datastores_openai_tools_list_datastores_post",
        "responses": {
          "200": {
            "description": "Successful Response",
            "content": {
              "application/json": {
                "schema": {
                  "$ref": "#/components/schemas/ToolResponse"
                }
              }
            }
          }
        }
      }
    },
    "/tools/search_code_semantic": {
      ...


$ curl -s -X POST http://172.20.0.2:8000/tools/list_datastores | jq
{
  "result": {
    "projects": [
      {
        "name": "my-cool-project",
        "description": "Testing things and stuff"
      },
      {
        "name": "another",
        "description": "Another really rad project that does even _cooler_ things and stuff"
      },
      ...
    ]
  },
  "error": null

}}
```

For the MCP HTTP stream endpoint, I prefer using [MCP Inspector](https://github.com/modelcontextprotocol/inspector).  

## Examples
#### Status of Syl resources
You can retrieve the status of existing Syl resources via `syl status`:
```bash
$ syl status
┌────────────────────────┬────────────┬──────────────┬────────────┬────────────┬────────────┬────────────────────────┬───────────────┐
│     CONTAINER NAME     │  PROJECT   │ CONTAINER ID │ REGISTERED │   STATUS   │ CONTAINER  │         IMAGE          │   IP ADDRESS  │
├────────────────────────┼────────────┼──────────────┼────────────┼────────────┼────────────┼────────────────────────┼───────────────┤
│syl-network             │syl         │N/A           │N/A         │✓ available │✓ N/A       │docker-network          │172.20.0.0/16  │
│syl-server              │syl         │332c5a8fde3a  │N/A         │✓ available │✓ running   │syl-server:latest       │172.20.0.2     │
│syl-pgvector-megaparse  │megaparse   │41b465c78a78  │True        │✓ available │✓ running   │pgvector/pgvector:pg17  │172.20.0.3     │
│syl-index-vllm          │vllm        │dc254d4f1d76  │False       │~ indexing  │✓ running   │syl-index:latest        │172.20.0.5     │
│syl-pgvector-vllm       │vllm        │6a72873ff3b8  │False       │~ indexing  │✓ running   │pgvector/pgvector:pg17  │172.20.0.4     │
└────────────────────────┴────────────┴──────────────┴────────────┴────────────┴────────────┴────────────────────────┴───────────────┘                                                                       docker2.py:57
```

Outdated images will show as yellow. Use `syl pull` to pull the latest images which will be used for the next created project.  


#### A local repo with a Syl Postgres pgvector container
This example mounts a local directory containing code files, generates embeddings using the default model, and offloads to a local Postgres pgvector database running in the Syl Docker network.  
The project is then made available to the proxy service under the name `my-local-project`.  
```bash
syl create datastore pgvector my-local-project \
  --local \
  --mount-path "/Users/ohtz/my_super_cool_project"
```

#### A remote repo with S3 Vector bucket datastore
This example clones the [vLLM](https://github.com/vllm-project/vllm) Github repository, generates embeddings using the default model, and offloads them to an S3 Vector bucket using my AWS credentials from `~/.aws/`.  
The project is then made available to the proxy service under the name `vllm`.  
```bash
syl create datastore s3vector vllm \
  --git-repo-url "https://github.com/vllm-project/vllm.git" \
  --git-branch "0.9.2" \
  --s3-bucket "ohtz-test-bucket-123" \
  --s3-region "us-east-1" \
  --s3-index "vllm-index" \
  --aws-mount-config-dir
```
