Metadata-Version: 2.1
Name: logadu
Version: 0.2.11
Summary: Log Anomaly Detection Ultimate: A package for log parsing, feature representation, and model training.
Home-page: https://github.com/AhmedCoolProjects/logadu-py
Author: Ahmed BARGADY
Author-email: ahmed.bargady@um6p.ma
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: click>=7.0
Requires-Dist: pandas>=1.0
Requires-Dist: tqdm>=4.0
Requires-Dist: regex>=2020.0
Requires-Dist: numpy>=1.0
Requires-Dist: torch>=2.3.1
Requires-Dist: pytorch-lightning>=2.0
Requires-Dist: torchmetrics>=1.0
Requires-Dist: scikit-learn>=1.0
Requires-Dist: gensim>=4.0
Requires-Dist: wandb>=0.15
Requires-Dist: joblib>=1.0
Requires-Dist: transformers>=4.0
Requires-Dist: sentencepiece>=0.1
Provides-Extra: visualization
Requires-Dist: pygraphviz>=1.5; extra == "visualization"

This python package 'logadu' _(Log Anomaly Detection Ultimate)_ is designed to facilitate the analysis and processing of log data for anomaly detection tasks. It provides utilities for cleaning, transforming, analyzing log data, then parsing it into structured formats suitable for machine learning models _(3 main parsers: Drain, Spell, FT-Tree)_. The package also includes a command-line interface for easy interaction and usage.

## Parsers

Drain does take the log data in specific format _(since it can handle it in any format specified by the user, e.g., <Timestamp> <Content>, at the end Content is the log message that we want to parse)_, along with a depth param, a similarity threshold and maximum children count, and outputs a `structured.csv` file containing the original log message, the parsed template of the log, and the template ID, along with a `template.csv` file containing the template ID and the template itself and then the Occurrence count of each template in the log data.

Spell on the other hand takes the same log data as in Drain, but we can specify only the similarity threshold for it since it is the hyperparameter that controls the spell parsing process, and it outputs a `structured.csv` file containing the original log message, the parsed template of the log, and the template ID, along with a `template.csv` file containing the template ID and the template itself and then the Occurrence count of each template in the log data.

FT-Tree takes the same log data as Spell and Drain, but requires different parameters: leaf number and short threshold. The output of FT-Tree is totally different from the other two parsers, it outputs a `.fre` file that contains the words in the log data, and a `.template` file that contains the templates generated by the FT-Tree algorithm.

### Command

To parse your cleaned log data using one of the parsers, we can run the following command after installing the package:

```bash
logadu parse <input_log_file> --parser <parser_name> --no-parameters
```

Where `<input_log_file>` is the path to the cleaned log data file, and `<parser_name>` is one of the parsers: `drain`, `spell`, or `ft-tree`. The `--no-parameters` flag is optional and can be used to skip storing extracted parameters from the log data.

## Feature Representation

After parsing, raw log messages are converted into structured templates. The next crucial step is Feature Representation, which transforms these textual templates into numerical vectors that machine learning models can process. Different anomaly detection models leverage different representation strategies to capture various aspects of the log data, from simple event occurrence to deep semantic meaning.

The logadu package implements the specific feature representation techniques required by the state-of-the-art models evaluated in this study. This ensures that our comparative analysis is fair and accurately replicates the original methodologies.

### 1 DeepLog: Sequential and Parameter-based Representation

DeepLog \cite{Du_DeepLog_2017} introduced a dual-stream approach to detect anomalies, analyzing not only the sequence of events but also their numerical parameters. This allows it to identify two distinct types of anomalies:

- Execution Path Anomaly: This is detected by analyzing the sequence of log templates. Each unique template is mapped to a unique integer index (e.g., Template ID from the parsing step). A log sequence is thus converted into a sequence of integers, which is fed into an LSTM model to predict the next log event. If the actual event is not among the top predictions, it is flagged as an execution path anomaly. This method effectively captures the system's control flow.
- Performance Anomaly: This is detected by analyzing the parameter values associated with each log template. For each template, DeepLog extracts its corresponding numerical parameters into a vector. It then trains separate models (typically using another LSTM) to learn the normal patterns of these parameter values over time. An anomaly is flagged if a new log's parameter vector deviates significantly from the learned normal patterns (e.g., an unusually high response time or abnormal resource usage).

The logadu package supports generating the sequential indices and extracting the parameter vectors necessary to fully replicate the DeepLog methodology.

### 2 LogAnomaly: Unsupervised Sequential and Quantitative Detection

The logadu package also implements the feature representation for LogAnomaly, an unsupervised model designed to detect both sequential and quantitative anomalies simultaneously. Its methodology is distinct and more comprehensive than that of DeepLog.

- Semantic Template Vectors (template2vec): The core innovation of LogAnomaly \cite{Meng_LogAnomaly_2019} is template2vec, a method designed to learn rich, semantic vector representations for log templates. Inspired by word2vec, template2vec learns a vector for each log template based on its context—that is, the other templates that typically appear alongside it in log sequences. This allows templates that are part of similar program execution flows to have similar vector representations, capturing the sequential logic of the system. To make these vectors robust, template2vec first generates word vectors that consider semantic information, including synonyms and antonyms, before aggregating them into a vector for the entire template. This makes the representation resilient to minor changes in log messages.

- Quantitative Count Vectors: In addition to the semantic representation, LogAnomaly also computes a quantitative count vector for each log sequence. This vector's dimension equals the total number of unique log templates, with each entry representing the frequency of a specific template within that sequence.

By combining these two streams, LogAnomaly can flag an anomaly if either i) the sequence of semantic vectors deviates from normal execution paths (a sequential anomaly), or ii) the event count vector breaks a learned quantitative relationship (a quantitative anomaly).

### 3 Semantic Vector Representation (for LogRobust, DQNLog & LogBERT)

This category of methods moves beyond simple indices to capture the semantic meaning of the log text, making the models more robust to slight variations in log messages (e.g., "connection failed" vs. "connection error").

- TF-IDF Weighted Word Embeddings (LogRobust): The approach used by LogRobust \cite{Zhang_LogRobust_2019} treats each log template as a document. It uses a pre-trained word embedding model (like FastText \cite{Joulin_FastText_2016}) to get vectors for each word in the template. These word vectors are then aggregated into a single vector for the template by taking a weighted average, where the weights are determined by the Term Frequency-Inverse Document Frequency (TF-IDF) score of each word \cite{Salton_tfidf_1988}. This gives more importance to words that are rare and informative.

- Transformer-based Contextual Embeddings (LogBERT & DQNLog): LogBERT \cite{Guo_LogBERT_2021} and DQNLog \cite{He_DQN_SemiSupervised_2024} utilize powerful Transformer models like BERT and RoBERTa. Instead of just embedding single words, these models process the entire log template (or sequence) and generate contextual embeddings. The output vector for a log event represents its meaning within the context of the words around it. This is typically achieved by taking the hidden state of the [CLS] token or averaging the hidden states of all tokens in the template. This method provides the richest semantic representation.

### 4 Pre-trained and Generative Representation (for PreLog & RAGLog)

This category represents the cutting edge, leveraging large-scale, pre-trained models that are fine-tuned or prompted for the anomaly detection task.

- Log-centric Pre-trained Embeddings (PreLog): PreLog \cite{Le_Prelog_2024} is pre-trained specifically on a massive corpus of diverse log data. Its representation is derived from the hidden states of its sequence-to-sequence Transformer architecture. Because it has learned the "language" of logs, its embeddings are highly specialized and effective at capturing log-specific syntax and semantics. The features are not generated by a simple script but by inference through the pre-trained model itself.
- Retrieval-Augmented Dense Vectors (RAGLog): RAGLog \cite{Pan_RAGLog_2024} uses a fundamentally different paradigm. Here, feature representation is designed for retrieval, not direct classification. Normal log templates are encoded into dense vectors using a state-of-the-art embedding model (e.g., from OpenAI, Hugging Face) and stored in a vector database. The "feature" of a new log message is its own vector embedding, which is then used to query the database for semantically similar normal logs. The anomaly decision is made by a Large Language Model (LLM) based on the retrieved results.

## Sequence Construction

# Notes

/home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/preprocessing/Linux24APT/drain/Linux24APT_10_1_seq_template_vectors.pt

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/preprocessing/Linux24APT/drain/Linux24APT_10_1_seq_template_vectors.pt -output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "logrobust_Linux24_seq_10_1" --model logrobust

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/preprocessing/Linux24APT/drain/Linux24APT_10_1_seq_template_vectors.pt --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "Autoencoder_Linux24_seq_10_1" --model autoencoder

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/preprocessing/Linux24APT/drain/Linux24APT_10_1_seq_raw_vectors_neurallog.pt \
--model neurallog \
--epochs 50 \
--output-dir ./trained_models \
--wandb-project "lad_in_apts" \
--wandb-run-name "neurallog_Linux_run1"

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/preprocessing/Linux24APT/drain/Linux24APT_10_1_seq_raw_vectors_neurallog.pt \
--model neurallog \
--epochs 50 \
--output-dir ./trained_models \
--wandb-project "lad_in_apts" \
--wandb-run-name "neurallog_Linux_run1"

## DEEPLOG - TRAIN

### FOX

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Fox/drain/Fox_10_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_fox_index_10" --dataset-name fox_10

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Fox/drain/Fox_20_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_fox_index_20" --dataset-name fox_20

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Fox/drain/Fox_50_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_fox_index_50" --dataset-name fox_50

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Fox/drain/Fox_100_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_fox_index_100" --dataset-name fox_100

### LINUX24APT

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Linux24APT/drain/Linux24APT_10_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_linux24_index_10" --dataset-name linux24_10

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Linux24APT/drain/Linux24APT_20_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_linux24_index_20" --dataset-name linux24_20

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Linux24APT/drain/Linux24APT_50_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_linux24_index_50" --dataset-name linux24_50

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Linux24APT/drain/Linux24APT_100_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_linux24_index_100" --dataset-name linux24_100

### RUSSELLMITCHELL

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Russellmitchell/drain/Russellmitchell_10_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_russellmitchell_index_10" --dataset-name russellmitchell_10

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Russellmitchell/drain/Russellmitchell_20_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_russellmitchell_index_20" --dataset-name russellmitchell_20

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Russellmitchell/drain/Russellmitchell_50_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_russellmitchell_index_50" --dataset-name russellmitchell_50

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Russellmitchell/drain/Russellmitchell_100_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_russellmitchell_index_100" --dataset-name russellmitchell_100

### Win25Ch

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Win25ChAPT/drain/Win25ChAPT_10_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_win25ch_index_10" --dataset-name win25ch_10

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Win25ChAPT/drain/Win25ChAPT_20_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_win25ch_index_20" --dataset-name win25ch_20

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Win25ChAPT/drain/Win25ChAPT_50_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_win25ch_index_50" --dataset-name win25ch_50

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Win25ChAPT/drain/Win25ChAPT_100_1_seq_index.csv --model deeplog --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "deeplog_win25ch_index_100" --dataset-name win25ch_100

## DEEPLOG - PREDICT

- logadu predict /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Fox/drain/Fox_10_1_seq_index.csv --model-type deeplog --model-checkpoint /home/ahmed.bargady/lustre/nlp_team-um6p-st-sccs-id7fz1zvotk/IDS/ahmed.bargady/data/github/logs-ad-ultimate/logadu-package/trained_models/deeplog-fox_10-best-checkpoint.ckpt --top-k 9

logadu predict /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Russellmitchell/drain/Russellmitchell_10_1_seq_index.csv --model-type deeplog --model-checkpoint /home/ahmed.bargady/lustre/nlp_team-um6p-st-sccs-id7fz1zvotk/IDS/ahmed.bargady/data/github/logs-ad-ultimate/logadu-package/trained_models/deeplog-fox_10-best-checkpoint.ckpt --top-k 9

logadu predict /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Linux24APT/drain/Linux24APT_10_1_seq_index.csv --model-type deeplog --model-checkpoint /home/ahmed.bargady/lustre/nlp_team-um6p-st-sccs-id7fz1zvotk/IDS/ahmed.bargady/data/github/logs-ad-ultimate/logadu-package/trained_models/deeplog-linux24_10-best-checkpoint.ckpt --top-k 9

## LOGBERT - TRAIN

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Fox/drain/Fox_10_1_seq_index.csv --model logbert --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "logbert_fox_index_10" --dataset-name fox_10

logadu train /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Linux24APT/drain/Linux24APT_10_1_seq_index.csv --model logbert --epochs 100 --output-dir ./trained_models --wandb-project "lad_in_apts" --wandb-run-name "logbert_linux24_index_10" --dataset-name linux24_10

## LOGBERT - PREDICT

logadu predict /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Fox/drain/Fox_10_1_seq_index.csv --model-type logbert --model-checkpoint /home/ahmed.bargady/lustre/nlp_team-um6p-st-sccs-id7fz1zvotk/IDS/ahmed.bargady/data/github/logs-ad-ultimate/logadu-package/trained_models/logbert/logbert-fox_10-best-checkpoint.ckpt --top-k 9 --anomaly-threshold 2

logadu predict /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Russellmitchell/drain/Russellmitchell_10_1_seq_index.csv --model-type logbert --model-checkpoint /home/ahmed.bargady/lustre/nlp_team-um6p-st-sccs-id7fz1zvotk/IDS/ahmed.bargady/data/github/logs-ad-ultimate/logadu-package/trained_models/logbert-fox_10-best-checkpoint.ckpt --top-k 9 --anomaly-threshold 2

logadu predict /home/ahmed.bargady/lustre/data_sec-um6p-st-sccs-6sevvl76uja/IDS/ahmed.bargady/datasets/AITv2/implementation/Linux24APT/drain/Linux24APT_10_1_seq_index.csv --model-type logbert --model-checkpoint /home/ahmed.bargady/lustre/nlp_team-um6p-st-sccs-id7fz1zvotk/IDS/ahmed.bargady/data/github/logs-ad-ultimate/logadu-package/trained_models/logbert-linux24_10-best-checkpoint.ckpt --top-k 9 --anomaly-threshold 2

LOGBERT - RESULTS
