ALACTIC AGI Framework

Enterprise AI Dataset Processing Platform
Local Setup & Installation Guide
Version 1.0.0 | Documentation v1.0

🎯 Overview

ALACTIC AGI Framework is an enterprise-grade AI dataset processing platform designed for scalable data acquisition, validation, and structuring. Built with production-ready monitoring and enterprise observability.

Key Features

  • Automated Data Acquisition: Web-scale scraping with Scrapy integration
  • Intelligent Data Validation: AI-powered quality scoring and duplicate detection
  • Structured Data Output: Multiple formats (JSON, Parquet, HDF5) for ML pipelines
  • Enterprise Monitoring: Prometheus + Grafana + AlertManager integration
  • Production Scalability: Handles 100M+ sources with distributed processing
  • Windows-First Design: Optimized for enterprise Windows environments

Architecture Overview

Component Technology Purpose Port
Core Framework Python 3.10+ Main processing engine 5000
Web Crawler Scrapy Data acquisition -
Search Engine Apache Solr 9 Data indexing & retrieval 8983
API Server Node.js REST API interface 3000
Monitoring Prometheus + Grafana Observability & alerting 9090, 3001

📋 System Requirements

🖥️ Operating System

  • Windows 10/11 (Recommended)
  • Windows Server 2019/2022
  • Linux (Ubuntu 20.04+, CentOS 8+)
  • macOS 10.15+ (Development only)

🔧 Software Dependencies

  • Python 3.10 or higher
  • Node.js 18.0+ (for API server)
  • Java 17+ (for Apache Solr)
  • Git 2.30+ (for source control)

💻 Hardware Specifications

  • Minimum: 8GB RAM, 4 CPU cores
  • Recommended: 16GB RAM, 8 CPU cores
  • Enterprise: 32GB+ RAM, 16+ CPU cores
  • Storage: 100GB+ available space

🌐 Network Requirements

  • Internet connectivity for data scraping
  • Open ports: 3000, 5000, 8080, 8983, 9090
  • Firewall exceptions for HTTP/HTTPS
  • DNS resolution for external APIs
Enterprise Environments: For production deployments processing 1M+ documents daily, we recommend 32GB+ RAM and SSD storage for optimal performance.

🚀 Installation Guide

Prerequisites Installation

Python 3.10+ Setup

PowerShell
# Download Python 3.10+ from python.org # Or use Windows Package Manager winget install Python.Python.3.10 # Verify installation python --version # Expected output: Python 3.10.x or higher

Node.js Installation

PowerShell
# Download from nodejs.org or use package manager winget install OpenJS.NodeJS # Verify installation node --version npm --version

Java 17+ Installation

PowerShell
# Install Java 17 (required for Apache Solr) winget install Eclipse.Temurin.17.JDK # Verify installation java -version # Expected: openjdk version "17.x.x" or higher

Framework Download & Setup

Git
# Clone the ALACTIC AGI repository git clone https://github.com/AlacticAI/alactic-agi.git cd alactic-agi # Create and activate virtual environment python -m venv .venv # Windows activation .venv\Scripts\activate # Linux/macOS activation source .venv/bin/activate
Note: Always use a virtual environment to avoid dependency conflicts with other Python projects.

Dependencies Installation

PowerShell
# Install Python dependencies pip install -r requirements.txt # Install monitoring dependencies (optional) pip install prometheus_client grafana-api psutil # Verify core installations python -c "import scrapy; print('Scrapy installed successfully')" python -c "import requests; print('Requests installed successfully')"

Key Dependencies Installed:

Package Version Purpose
scrapy 2.10+ Web crawling framework
requests 2.31+ HTTP client library
beautifulsoup4 4.12+ HTML/XML parsing
pandas 2.0+ Data manipulation
flask 2.3+ Web framework for APIs

Apache Solr Setup

PowerShell
# Solr is included in the tools directory cd tools/solr9 # Start Solr (Windows) .\bin\solr.cmd start # Start Solr (Linux/macOS) ./bin/solr start # Create ALACTIC AGI core .\bin\solr.cmd create -c super_rag # Verify Solr is running curl http://localhost:8983/solr/admin/info/system
Success! Solr should now be accessible at http://localhost:8983

Node.js API Server Setup

PowerShell
# Navigate to API directory cd api # Install Node.js dependencies npm install # Verify dependencies npm list # Start API server (development) npm run dev # Start API server (production) npm start

⚙️ Configuration

Core Configuration File

The main configuration is stored in config.ini:

INI
[DEFAULT] # Core Framework Settings debug = false log_level = INFO max_workers = 4 [DATABASE] # Solr Configuration solr_url = http://localhost:8983/solr/super_rag solr_timeout = 30 [SCRAPING] # Web Crawling Settings max_pages = 10000 delay = 1.0 respect_robots = true user_agent = ALACTIC-AGI/1.0 [API] # API Server Configuration host = 0.0.0.0 port = 5000 cors_enabled = true [MONITORING] # Monitoring & Observability metrics_enabled = true prometheus_port = 8080 grafana_enabled = true alert_email = support@alacticai.com

Environment Variables

For production deployments, use environment variables to override configuration:

PowerShell
# Set environment variables (Windows) $env:ALACTIC_DEBUG = "false" $env:ALACTIC_SOLR_URL = "http://production-solr:8983/solr/super_rag" $env:ALACTIC_LOG_LEVEL = "WARNING" $env:ALACTIC_MAX_WORKERS = "8" # Linux/macOS export ALACTIC_DEBUG=false export ALACTIC_SOLR_URL=http://production-solr:8983/solr/super_rag export ALACTIC_LOG_LEVEL=WARNING export ALACTIC_MAX_WORKERS=8

Solr Configuration

Custom Solr schema for optimal performance:

XML
<!-- solr_config/schema.xml --> <schema name="alactic-agi" version="1.0"> <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/> <field name="title" type="text_general" indexed="true" stored="true"/> <field name="content" type="text_general" indexed="true" stored="true"/> <field name="url" type="string" indexed="true" stored="true"/> <field name="timestamp" type="pdate" indexed="true" stored="true" default="NOW"/> <field name="quality_score" type="pfloat" indexed="true" stored="true"/> <uniqueKey>id</uniqueKey> </schema>

📊 Enterprise Monitoring Setup

Enterprise Feature: ALACTIC AGI includes comprehensive monitoring with Prometheus, Grafana, and AlertManager for production-grade observability.

Quick Monitoring Setup

Install Monitoring Dependencies

PowerShell
# Install monitoring packages pip install prometheus_client psutil grafana-api # Verify installation python -c "import prometheus_client; print('Monitoring ready!')"

Start Monitoring Demo

PowerShell
# Start the monitoring demonstration python monitoring_demo.py # Access monitoring endpoints: # http://localhost:8080/ - Dashboard # http://localhost:8080/metrics - Prometheus metrics # http://localhost:8080/health - Health check

Test Core Monitoring

PowerShell
# Run comprehensive monitoring tests python test_monitoring_core.py # Expected output: All tests PASSED # Metrics rate: 500K+ metrics/second # Memory usage: Stable under load

Monitoring Metrics

Metric Type Examples Description
Business Metrics query_total, documents_processed Core business KPIs
Performance query_duration_seconds, api_response_time System performance tracking
System Health system_cpu_percent, system_memory_percent Infrastructure monitoring
Error Tracking query_error_total, pipeline_failures Error rates and types

Integration with Existing Systems

Python
from monitoring import MetricsCollector, timed # Initialize monitoring metrics = MetricsCollector() # Track custom business metrics @timed("document_processing") def process_documents(documents): for doc in documents: # Your processing logic result = process_single_document(doc) # Record metrics metrics.record_counter("documents_processed", 1, {"type": doc.type, "status": "success"}) metrics.record_gauge("processing_queue_size", len(queue)) return results

🎯 Usage Guide

Quick Start

Python
from alactic_framework import AlacticAGI # Initialize the framework agi = AlacticAGI(config_file='config.ini') # Process a simple query result = agi.run_pipeline("artificial intelligence datasets") # Display results for item in result: print(f"Title: {item['title']}") print(f"URL: {item['url']}") print(f"Quality Score: {item['quality_score']}")

Advanced Usage Examples

1. Batch Document Processing

Python
import asyncio from alactic_framework import AlacticAGI async def batch_process(): agi = AlacticAGI(retriever_type='async') queries = [ "machine learning research papers", "natural language processing datasets", "computer vision benchmarks" ] # Process multiple queries concurrently tasks = [agi.run_pipeline(query) for query in queries] results = await asyncio.gather(*tasks) return results # Run batch processing results = asyncio.run(batch_process())

2. Custom Data Pipeline

Python
from alactic_framework import AlacticAGI from monitoring import PerformanceProfiler class CustomDataPipeline: def __init__(self): self.agi = AlacticAGI() self.profiler = PerformanceProfiler(self.agi.metrics) def process_with_custom_validation(self, query, min_quality=0.8): with self.profiler.profile_operation("custom_pipeline"): # Get raw results raw_results = self.agi.run_pipeline(query) # Apply custom filtering filtered_results = [ result for result in raw_results if result.get('quality_score', 0) >= min_quality ] # Record custom metrics self.agi.metrics.record_gauge("custom_filter_ratio", len(filtered_results) / len(raw_results)) return filtered_results # Use custom pipeline pipeline = CustomDataPipeline() high_quality_results = pipeline.process_with_custom_validation( "AI research datasets", min_quality=0.9 )

3. API Integration

Python
from flask import Flask, request, jsonify from alactic_framework import AlacticAGI app = Flask(__name__) agi = AlacticAGI() @app.route('/api/search', methods=['POST']) def search_endpoint(): data = request.get_json() query = data.get('query') limit = data.get('limit', 10) try: results = agi.run_pipeline(query) # Format response response = { 'query': query, 'total_results': len(results), 'results': results[:limit], 'status': 'success' } return jsonify(response) except Exception as e: return jsonify({'error': str(e), 'status': 'error'}), 500 if __name__ == '__main__': app.run(host='0.0.0.0', port=5000, debug=True)

Command Line Interface

PowerShell
# Basic query python alactic_agi.py --query "machine learning datasets" # Advanced options python alactic_agi.py --query "AI research" --max-results 100 --format json # Batch processing from file python alactic_agi.py --batch-file queries.txt --output results.json # Start monitoring dashboard python alactic_agi.py --start-monitoring # Health check python alactic_agi.py --health-check

🔧 Troubleshooting

Common Issues & Solutions

Before troubleshooting: Always check the logs in alactic_agi.log for detailed error information.

Installation Issues

Problem Symptoms Solution
Python version incompatibility ImportError, syntax errors Install Python 3.10+ and recreate virtual environment
Scrapy installation fails pip install errors on Windows Install Visual C++ Build Tools or use conda
Java not found Solr fails to start Install Java 17+ and set JAVA_HOME environment variable
Permission denied Access errors on file operations Run as administrator or check folder permissions

Runtime Issues

PowerShell
# Common diagnostic commands # Check if all services are running python alactic_agi.py --health-check # Test Solr connectivity curl http://localhost:8983/solr/admin/ping # Check Python dependencies pip list | grep -E "(scrapy|requests|flask)" # Monitor system resources python -c "import psutil; print(f'CPU: {psutil.cpu_percent()}%, Memory: {psutil.virtual_memory().percent}%')" # Test monitoring system python test_monitoring_core.py

Performance Issues

  • Slow query responses: Check Solr index size and consider reindexing
  • High memory usage: Reduce max_workers in config.ini
  • Network timeouts: Increase timeout values in configuration
  • Disk space issues: Clear logs and temporary files regularly

Error Code Reference

Error Code Description Action
AGI-001 Configuration file not found Ensure config.ini exists in project root
AGI-002 Solr connection failed Check if Solr is running on port 8983
AGI-003 Scraping rate limit exceeded Increase delay in scraping configuration
AGI-004 Memory allocation error Increase system RAM or reduce batch size

Getting Help

Support Channels:

📚 API Reference

Core Classes

AlacticAGI Class

Python
class AlacticAGI: """ Main ALACTIC AGI Framework class for AI dataset processing. Args: config_file (str): Path to configuration file retriever_type (str): 'sync' or 'async' for retrieval mode Attributes: config: Configuration manager metrics: Metrics collection system components: Dictionary of initialized components """ def __init__(self, config_file='config.ini', retriever_type='async'): # Initialize framework components def run_pipeline(self, query: str) -> List[Dict]: """ Execute the complete data processing pipeline. Args: query (str): Search query for data acquisition Returns: List[Dict]: Processed and structured data results Raises: AGIException: If pipeline execution fails """ def get_health_status(self) -> Dict: """ Get comprehensive health status of all components. Returns: Dict: Health status with component details """

Monitoring Classes

Python
class MetricsCollector: """Enterprise-grade metrics collection system""" def record_counter(self, name: str, value: int = 1, labels: Dict = None): """Record counter metric""" def record_gauge(self, name: str, value: float, labels: Dict = None): """Record gauge metric""" def record_timer(self, name: str, duration: float, labels: Dict = None): """Record timing metric""" class PerformanceProfiler: """Performance profiling and analysis""" def profile_operation(self, operation_name: str, labels: Dict = None): """Context manager for operation profiling""" @timed(operation_name: str = None) def decorator_function(): """Decorator for automatic timing of functions"""

Configuration Parameters

Parameter Type Default Description
debug boolean false Enable debug logging and verbose output
max_workers integer 4 Maximum concurrent processing threads
solr_url string http://localhost:8983/solr/super_rag Apache Solr endpoint URL
metrics_enabled boolean true Enable enterprise monitoring and metrics

REST API Endpoints

Data Processing

HTTP
POST /api/search Content-Type: application/json { "query": "machine learning datasets", "limit": 50, "filters": { "quality_min": 0.8, "date_from": "2023-01-01" } } Response: { "query": "machine learning datasets", "total_results": 1247, "results": [...], "status": "success", "processing_time": 2.34 }

Health & Monitoring

HTTP
GET /api/health Response: { "status": "healthy", "components": { "solr": "online", "crawler": "operational", "monitoring": "enabled" }, "uptime": 86400, "version": "1.0.0" } GET /api/metrics Response: Prometheus-formatted metrics GET /api/stats Response: { "documents_processed": 1234567, "queries_today": 5678, "average_response_time": 1.23, "system_health": "excellent" }