Metadata-Version: 2.1
Name: llm-explorer
Version: 0.0.1
Summary: A Lakehouse LLM Explorer. Wrapper for spark, databricks and langchain processes
Home-page: https://github.com/Occlusion-Solutions/occlussion_llm_explorer.git
Author: Carlos D. Escobar-Valbuena
Author-email: carlosdavidescobar@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: streamlit
Requires-Dist: streamlit-chat
Requires-Dist: chromadb
Requires-Dist: pyautogui
Requires-Dist: databricks-sql-connector
Requires-Dist: watchdog
Requires-Dist: psycopg2-binary
Requires-Dist: py4j
Requires-Dist: pyspark
Requires-Dist: build
Requires-Dist: setuptools
Requires-Dist: twine
Requires-Dist: paramiko
Requires-Dist: pydeequ
Requires-Dist: databricks-feature-store
Requires-Dist: Flask
Requires-Dist: Werkzeug
Requires-Dist: pytest
Requires-Dist: responses
Requires-Dist: azure-storage-blob
Requires-Dist: jupyter
Requires-Dist: aiohttp
Requires-Dist: aiosignal
Requires-Dist: altair
Requires-Dist: anyio
Requires-Dist: argilla
Requires-Dist: asn1crypto
Requires-Dist: astroid
Requires-Dist: async-timeout
Requires-Dist: attrs
Requires-Dist: autopep8
Requires-Dist: backoff
Requires-Dist: blinker
Requires-Dist: cachetools
Requires-Dist: cffi
Requires-Dist: charset-normalizer
Requires-Dist: click
Requires-Dist: commonmark
Requires-Dist: cryptography
Requires-Dist: cycler
Requires-Dist: dataclasses-json
Requires-Dist: decorator
Requires-Dist: Deprecated
Requires-Dist: docopt
Requires-Dist: docstring-to-markdown
Requires-Dist: entrypoints
Requires-Dist: et-xmlfile
Requires-Dist: faiss-cpu
Requires-Dist: filelock
Requires-Dist: flake8
Requires-Dist: fonttools
Requires-Dist: frozenlist
Requires-Dist: gitdb
Requires-Dist: GitPython
Requires-Dist: h11
Requires-Dist: httpcore
Requires-Dist: httpx
Requires-Dist: idna
Requires-Dist: importlib-metadata
Requires-Dist: importlib-resources
Requires-Dist: isort
Requires-Dist: Jinja2
Requires-Dist: joblib
Requires-Dist: jsonschema
Requires-Dist: kiwisolver
Requires-Dist: langchain
Requires-Dist: lxml
Requires-Dist: Markdown
Requires-Dist: markdown-it-py
Requires-Dist: MarkupSafe
Requires-Dist: marshmallow
Requires-Dist: marshmallow-enum
Requires-Dist: matplotlib
Requires-Dist: mccabe
Requires-Dist: mdurl
Requires-Dist: monotonic
Requires-Dist: msg-parser
Requires-Dist: multidict
Requires-Dist: mypy-extensions
Requires-Dist: nltk
Requires-Dist: numpy
Requires-Dist: olefile
Requires-Dist: openai
Requires-Dist: openapi-schema-pydantic
Requires-Dist: openpyxl
Requires-Dist: oscrypto
Requires-Dist: pandas
Requires-Dist: Pillow
Requires-Dist: protobuf
Requires-Dist: pyarrow
Requires-Dist: pycodestyle
Requires-Dist: pycparser
Requires-Dist: pycryptodomex
Requires-Dist: pydantic
Requires-Dist: pydeck
Requires-Dist: pydocstyle
Requires-Dist: pyflakes
Requires-Dist: Pygments
Requires-Dist: PyJWT
Requires-Dist: pylint
Requires-Dist: Pympler
Requires-Dist: pyOpenSSL
Requires-Dist: pypandoc
Requires-Dist: pyparsing
Requires-Dist: pyrsistent
Requires-Dist: python-dateutil
Requires-Dist: python-docx
Requires-Dist: python-dotenv
Requires-Dist: python-lsp-jsonrpc
Requires-Dist: python-lsp-server
Requires-Dist: python-magic
Requires-Dist: python-pptx
Requires-Dist: pytoolconfig
Requires-Dist: pytz
Requires-Dist: pytz-deprecation-shim
Requires-Dist: PyYAML
Requires-Dist: regex
Requires-Dist: requests
Requires-Dist: rfc3986
Requires-Dist: rich
Requires-Dist: rope
Requires-Dist: scikit-learn
Requires-Dist: scipy
Requires-Dist: six
Requires-Dist: smmap
Requires-Dist: sniffio
Requires-Dist: snowballstemmer
Requires-Dist: SQLAlchemy
Requires-Dist: sqlparse
Requires-Dist: tenacity
Requires-Dist: threadpoolctl
Requires-Dist: tiktoken
Requires-Dist: toml
Requires-Dist: tomlkit
Requires-Dist: toolz
Requires-Dist: tornado
Requires-Dist: tqdm
Requires-Dist: typing-inspect
Requires-Dist: typing-extensions
Requires-Dist: tzdata
Requires-Dist: tzlocal
Requires-Dist: ujson
Requires-Dist: unstructured
Requires-Dist: urllib3
Requires-Dist: validators
Requires-Dist: whatthepatch
Requires-Dist: wrapt
Requires-Dist: XlsxWriter
Requires-Dist: yapf
Requires-Dist: yarg
Requires-Dist: yarl
Requires-Dist: zipp
Requires-Dist: packaging
Requires-Dist: diffusers
Requires-Dist: accelerate
Requires-Dist: datasets
Requires-Dist: torch
Requires-Dist: soundfile
Requires-Dist: sentencepiece
Requires-Dist: opencv-python
Requires-Dist: sqlalchemy-databricks
Requires-Dist: transformers
Requires-Dist: huggingface-hub
Requires-Dist: streamlit-authenticator
Requires-Dist: tabulate

# Occlusion LLM Explorer

**Lakehouse Analytics &amp; Advanced ML**
![llm_explorer_sample.png](/docs/.attachments/llm_explorer_sample.png)

## Setup

Create a virtual environment

```shell
conda create -n occlusion python=3.10
conda activate occlusion
```

Install the requirements

```shell
pip install -r requirements.txt
```

Run the main.py script using streamlit:

```shell
python -m streamlit run main.py
```

## Usage

Use the `demo@occlusion.solutions` user and `DEMO@occlusion` password to login.

The deployment requires a secrets.toml file created under .streamlit/:

```shell
touch .streamlit/secrets.toml
```

It should have a schema like this:

```toml
[connections.openai]
api_key="sk-..." # OpenAI API Key

[connections.huggingface]
api_key="shf_..." # HuggingFace API Key

[connections.databricks]
server_hostname="your databricks host"
http_path="http path under cluster JDBC/ODBC connectivity"
access_token="your databricks access token"
```

## Lakehouse Agent Sample

Agent is queried for the top 10 producing wells. It identifies the tables it has access to and understands that the request could be satified by the padalloc table. It then creates a query that returns the top 10 producing assets and return the results.

```shell

> Entering new AgentExecutor chain...

Observation: microchip_logs, padalloc
Thought: I should look at the schema of the microchip_logs and padalloc tables to see what columns I can use.

Action: schema_sql_db
Action Input: "microchip_logs, padalloc"
Observation: 
CREATE TABLE `microchip_logs` (
        `file_path` STRING, 
        `content` STRING
)


CREATE TABLE `padalloc` (
        `ZONE_CODE` STRING, 
        `ZONE_NAME` STRING, 
        `ZONE_HID` DECIMAL, 
        `WELL_HID` DECIMAL, 
        `WELL_CODE` STRING, 
        `PROD_DATE` TIMESTAMP, 
        `PROD_GAS_VOLUME_MCF` DECIMAL, 
        `PROD_OIL_VOLUME_BBL` DECIMAL, 
        `PROD_WATER_VOLUME_BBL` DECIMAL, 
        `ALLOCATED_FLAG` STRING, 
        `SALE_GAS_VOLUME_MCF` DECIMAL, 
        `SALE_OIL_VOLUME_BBL` DECIMAL, 
        `LGL_VOLUME_MCF` DECIMAL, 
        `OTHER_USES_GAS_MCF` DECIMAL
)

Thought: I should query the padalloc table to get the top 10 producing wells.

Action: query_sql_db
Action Input: "SELECT WELL_CODE, SUM(PROD_GAS_VOLUME_MCF) AS total_gas_volume_mcf FROM padalloc GROUP BY WELL_CODE ORDER BY total_gas_volume_mcf DESC LIMIT 10"
Observation: [('1222344             ', Decimal('8429191.6172')), ('1212560             ', Decimal('8211108.4867')), ('1222345             ', Decimal('8163411.9976')), ('1212503             ', Decimal('6621501.8683')), ('1222335             ', Decimal('4773668.6216')), ('1222340             ', Decimal('4276560.8228')), ('1222338             ', Decimal('4153258.1434')), ('1222367             ', Decimal('4018012.2406')), ('1220189             ', Decimal('3965394.4453')), ('1222352             ', Decimal('3786076.4127'))]
Thought: I now know the top 10 producing wells.

Final Answer: The top 10 producing wells are 1222344, 1212560, 1222345, 1212503, 1222335, 1222340, 1222338, 1222367, 1220189, and 1222352.

> Finished chain.
```
