🎯 Overview
ALACTIC AGI Framework is an enterprise-grade AI dataset processing platform designed for scalable data acquisition, validation, and structuring. Built with production-ready monitoring and enterprise observability.
Key Features
- Automated Data Acquisition: Web-scale scraping with Scrapy integration
- Intelligent Data Validation: AI-powered quality scoring and duplicate detection
- Structured Data Output: Multiple formats (JSON, Parquet, HDF5) for ML pipelines
- Enterprise Monitoring: Prometheus + Grafana + AlertManager integration
- Production Scalability: Handles 100M+ sources with distributed processing
- Windows-First Design: Optimized for enterprise Windows environments
Architecture Overview
| Component | Technology | Purpose | Port |
|---|---|---|---|
| Core Framework | Python 3.10+ | Main processing engine | 5000 |
| Web Crawler | Scrapy | Data acquisition | - |
| Search Engine | Apache Solr 9 | Data indexing & retrieval | 8983 |
| API Server | Node.js | REST API interface | 3000 |
| Monitoring | Prometheus + Grafana | Observability & alerting | 9090, 3001 |
📋 System Requirements
🖥️ Operating System
- Windows 10/11 (Recommended)
- Windows Server 2019/2022
- Linux (Ubuntu 20.04+, CentOS 8+)
- macOS 10.15+ (Development only)
🔧 Software Dependencies
- Python 3.10 or higher
- Node.js 18.0+ (for API server)
- Java 17+ (for Apache Solr)
- Git 2.30+ (for source control)
💻 Hardware Specifications
- Minimum: 8GB RAM, 4 CPU cores
- Recommended: 16GB RAM, 8 CPU cores
- Enterprise: 32GB+ RAM, 16+ CPU cores
- Storage: 100GB+ available space
🌐 Network Requirements
- Internet connectivity for data scraping
- Open ports: 3000, 5000, 8080, 8983, 9090
- Firewall exceptions for HTTP/HTTPS
- DNS resolution for external APIs
Enterprise Environments: For production deployments processing 1M+ documents daily, we recommend 32GB+ RAM and SSD storage for optimal performance.
🚀 Installation Guide
Prerequisites Installation
Python 3.10+ Setup
PowerShell
# Download Python 3.10+ from python.org
# Or use Windows Package Manager
winget install Python.Python.3.10
# Verify installation
python --version
# Expected output: Python 3.10.x or higher
Node.js Installation
PowerShell
# Download from nodejs.org or use package manager
winget install OpenJS.NodeJS
# Verify installation
node --version
npm --version
Java 17+ Installation
PowerShell
# Install Java 17 (required for Apache Solr)
winget install Eclipse.Temurin.17.JDK
# Verify installation
java -version
# Expected: openjdk version "17.x.x" or higher
Framework Download & Setup
Git
# Clone the ALACTIC AGI repository
git clone https://github.com/AlacticAI/alactic-agi.git
cd alactic-agi
# Create and activate virtual environment
python -m venv .venv
# Windows activation
.venv\Scripts\activate
# Linux/macOS activation
source .venv/bin/activate
Note: Always use a virtual environment to avoid dependency conflicts with other Python projects.
Dependencies Installation
PowerShell
# Install Python dependencies
pip install -r requirements.txt
# Install monitoring dependencies (optional)
pip install prometheus_client grafana-api psutil
# Verify core installations
python -c "import scrapy; print('Scrapy installed successfully')"
python -c "import requests; print('Requests installed successfully')"
Key Dependencies Installed:
| Package | Version | Purpose |
|---|---|---|
| scrapy | 2.10+ | Web crawling framework |
| requests | 2.31+ | HTTP client library |
| beautifulsoup4 | 4.12+ | HTML/XML parsing |
| pandas | 2.0+ | Data manipulation |
| flask | 2.3+ | Web framework for APIs |
Apache Solr Setup
PowerShell
# Solr is included in the tools directory
cd tools/solr9
# Start Solr (Windows)
.\bin\solr.cmd start
# Start Solr (Linux/macOS)
./bin/solr start
# Create ALACTIC AGI core
.\bin\solr.cmd create -c super_rag
# Verify Solr is running
curl http://localhost:8983/solr/admin/info/system
Success! Solr should now be accessible at http://localhost:8983
Node.js API Server Setup
PowerShell
# Navigate to API directory
cd api
# Install Node.js dependencies
npm install
# Verify dependencies
npm list
# Start API server (development)
npm run dev
# Start API server (production)
npm start
⚙️ Configuration
Core Configuration File
The main configuration is stored in config.ini:
INI
[DEFAULT]
# Core Framework Settings
debug = false
log_level = INFO
max_workers = 4
[DATABASE]
# Solr Configuration
solr_url = http://localhost:8983/solr/super_rag
solr_timeout = 30
[SCRAPING]
# Web Crawling Settings
max_pages = 10000
delay = 1.0
respect_robots = true
user_agent = ALACTIC-AGI/1.0
[API]
# API Server Configuration
host = 0.0.0.0
port = 5000
cors_enabled = true
[MONITORING]
# Monitoring & Observability
metrics_enabled = true
prometheus_port = 8080
grafana_enabled = true
alert_email = support@alacticai.com
Environment Variables
For production deployments, use environment variables to override configuration:
PowerShell
# Set environment variables (Windows)
$env:ALACTIC_DEBUG = "false"
$env:ALACTIC_SOLR_URL = "http://production-solr:8983/solr/super_rag"
$env:ALACTIC_LOG_LEVEL = "WARNING"
$env:ALACTIC_MAX_WORKERS = "8"
# Linux/macOS
export ALACTIC_DEBUG=false
export ALACTIC_SOLR_URL=http://production-solr:8983/solr/super_rag
export ALACTIC_LOG_LEVEL=WARNING
export ALACTIC_MAX_WORKERS=8
Solr Configuration
Custom Solr schema for optimal performance:
XML
<!-- solr_config/schema.xml -->
<schema name="alactic-agi" version="1.0">
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="title" type="text_general" indexed="true" stored="true"/>
<field name="content" type="text_general" indexed="true" stored="true"/>
<field name="url" type="string" indexed="true" stored="true"/>
<field name="timestamp" type="pdate" indexed="true" stored="true" default="NOW"/>
<field name="quality_score" type="pfloat" indexed="true" stored="true"/>
<uniqueKey>id</uniqueKey>
</schema>
📊 Enterprise Monitoring Setup
Enterprise Feature: ALACTIC AGI includes comprehensive monitoring with Prometheus, Grafana, and AlertManager for production-grade observability.
Quick Monitoring Setup
Install Monitoring Dependencies
PowerShell
# Install monitoring packages
pip install prometheus_client psutil grafana-api
# Verify installation
python -c "import prometheus_client; print('Monitoring ready!')"
Start Monitoring Demo
PowerShell
# Start the monitoring demonstration
python monitoring_demo.py
# Access monitoring endpoints:
# http://localhost:8080/ - Dashboard
# http://localhost:8080/metrics - Prometheus metrics
# http://localhost:8080/health - Health check
Test Core Monitoring
PowerShell
# Run comprehensive monitoring tests
python test_monitoring_core.py
# Expected output: All tests PASSED
# Metrics rate: 500K+ metrics/second
# Memory usage: Stable under load
Monitoring Metrics
| Metric Type | Examples | Description |
|---|---|---|
| Business Metrics | query_total, documents_processed | Core business KPIs |
| Performance | query_duration_seconds, api_response_time | System performance tracking |
| System Health | system_cpu_percent, system_memory_percent | Infrastructure monitoring |
| Error Tracking | query_error_total, pipeline_failures | Error rates and types |
Integration with Existing Systems
Python
from monitoring import MetricsCollector, timed
# Initialize monitoring
metrics = MetricsCollector()
# Track custom business metrics
@timed("document_processing")
def process_documents(documents):
for doc in documents:
# Your processing logic
result = process_single_document(doc)
# Record metrics
metrics.record_counter("documents_processed", 1,
{"type": doc.type, "status": "success"})
metrics.record_gauge("processing_queue_size", len(queue))
return results
🎯 Usage Guide
Quick Start
Python
from alactic_framework import AlacticAGI
# Initialize the framework
agi = AlacticAGI(config_file='config.ini')
# Process a simple query
result = agi.run_pipeline("artificial intelligence datasets")
# Display results
for item in result:
print(f"Title: {item['title']}")
print(f"URL: {item['url']}")
print(f"Quality Score: {item['quality_score']}")
Advanced Usage Examples
1. Batch Document Processing
Python
import asyncio
from alactic_framework import AlacticAGI
async def batch_process():
agi = AlacticAGI(retriever_type='async')
queries = [
"machine learning research papers",
"natural language processing datasets",
"computer vision benchmarks"
]
# Process multiple queries concurrently
tasks = [agi.run_pipeline(query) for query in queries]
results = await asyncio.gather(*tasks)
return results
# Run batch processing
results = asyncio.run(batch_process())
2. Custom Data Pipeline
Python
from alactic_framework import AlacticAGI
from monitoring import PerformanceProfiler
class CustomDataPipeline:
def __init__(self):
self.agi = AlacticAGI()
self.profiler = PerformanceProfiler(self.agi.metrics)
def process_with_custom_validation(self, query, min_quality=0.8):
with self.profiler.profile_operation("custom_pipeline"):
# Get raw results
raw_results = self.agi.run_pipeline(query)
# Apply custom filtering
filtered_results = [
result for result in raw_results
if result.get('quality_score', 0) >= min_quality
]
# Record custom metrics
self.agi.metrics.record_gauge("custom_filter_ratio",
len(filtered_results) / len(raw_results))
return filtered_results
# Use custom pipeline
pipeline = CustomDataPipeline()
high_quality_results = pipeline.process_with_custom_validation(
"AI research datasets", min_quality=0.9
)
3. API Integration
Python
from flask import Flask, request, jsonify
from alactic_framework import AlacticAGI
app = Flask(__name__)
agi = AlacticAGI()
@app.route('/api/search', methods=['POST'])
def search_endpoint():
data = request.get_json()
query = data.get('query')
limit = data.get('limit', 10)
try:
results = agi.run_pipeline(query)
# Format response
response = {
'query': query,
'total_results': len(results),
'results': results[:limit],
'status': 'success'
}
return jsonify(response)
except Exception as e:
return jsonify({'error': str(e), 'status': 'error'}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=True)
Command Line Interface
PowerShell
# Basic query
python alactic_agi.py --query "machine learning datasets"
# Advanced options
python alactic_agi.py --query "AI research" --max-results 100 --format json
# Batch processing from file
python alactic_agi.py --batch-file queries.txt --output results.json
# Start monitoring dashboard
python alactic_agi.py --start-monitoring
# Health check
python alactic_agi.py --health-check
🔧 Troubleshooting
Common Issues & Solutions
Before troubleshooting: Always check the logs in alactic_agi.log for detailed error information.
Installation Issues
| Problem | Symptoms | Solution |
|---|---|---|
| Python version incompatibility | ImportError, syntax errors | Install Python 3.10+ and recreate virtual environment |
| Scrapy installation fails | pip install errors on Windows | Install Visual C++ Build Tools or use conda |
| Java not found | Solr fails to start | Install Java 17+ and set JAVA_HOME environment variable |
| Permission denied | Access errors on file operations | Run as administrator or check folder permissions |
Runtime Issues
PowerShell
# Common diagnostic commands
# Check if all services are running
python alactic_agi.py --health-check
# Test Solr connectivity
curl http://localhost:8983/solr/admin/ping
# Check Python dependencies
pip list | grep -E "(scrapy|requests|flask)"
# Monitor system resources
python -c "import psutil; print(f'CPU: {psutil.cpu_percent()}%, Memory: {psutil.virtual_memory().percent}%')"
# Test monitoring system
python test_monitoring_core.py
Performance Issues
- Slow query responses: Check Solr index size and consider reindexing
- High memory usage: Reduce max_workers in config.ini
- Network timeouts: Increase timeout values in configuration
- Disk space issues: Clear logs and temporary files regularly
Error Code Reference
| Error Code | Description | Action |
|---|---|---|
| AGI-001 | Configuration file not found | Ensure config.ini exists in project root |
| AGI-002 | Solr connection failed | Check if Solr is running on port 8983 |
| AGI-003 | Scraping rate limit exceeded | Increase delay in scraping configuration |
| AGI-004 | Memory allocation error | Increase system RAM or reduce batch size |
Getting Help
Support Channels:
- 📧 Email: support@alacticai.com
- 📖 Documentation: docs.alacticai.com
- 🐛 Bug Reports: GitHub Issues
- 💬 Community: Discord Server
📚 API Reference
Core Classes
AlacticAGI Class
Python
class AlacticAGI:
"""
Main ALACTIC AGI Framework class for AI dataset processing.
Args:
config_file (str): Path to configuration file
retriever_type (str): 'sync' or 'async' for retrieval mode
Attributes:
config: Configuration manager
metrics: Metrics collection system
components: Dictionary of initialized components
"""
def __init__(self, config_file='config.ini', retriever_type='async'):
# Initialize framework components
def run_pipeline(self, query: str) -> List[Dict]:
"""
Execute the complete data processing pipeline.
Args:
query (str): Search query for data acquisition
Returns:
List[Dict]: Processed and structured data results
Raises:
AGIException: If pipeline execution fails
"""
def get_health_status(self) -> Dict:
"""
Get comprehensive health status of all components.
Returns:
Dict: Health status with component details
"""
Monitoring Classes
Python
class MetricsCollector:
"""Enterprise-grade metrics collection system"""
def record_counter(self, name: str, value: int = 1, labels: Dict = None):
"""Record counter metric"""
def record_gauge(self, name: str, value: float, labels: Dict = None):
"""Record gauge metric"""
def record_timer(self, name: str, duration: float, labels: Dict = None):
"""Record timing metric"""
class PerformanceProfiler:
"""Performance profiling and analysis"""
def profile_operation(self, operation_name: str, labels: Dict = None):
"""Context manager for operation profiling"""
@timed(operation_name: str = None)
def decorator_function():
"""Decorator for automatic timing of functions"""
Configuration Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| debug | boolean | false | Enable debug logging and verbose output |
| max_workers | integer | 4 | Maximum concurrent processing threads |
| solr_url | string | http://localhost:8983/solr/super_rag | Apache Solr endpoint URL |
| metrics_enabled | boolean | true | Enable enterprise monitoring and metrics |
REST API Endpoints
Data Processing
HTTP
POST /api/search
Content-Type: application/json
{
"query": "machine learning datasets",
"limit": 50,
"filters": {
"quality_min": 0.8,
"date_from": "2023-01-01"
}
}
Response:
{
"query": "machine learning datasets",
"total_results": 1247,
"results": [...],
"status": "success",
"processing_time": 2.34
}
Health & Monitoring
HTTP
GET /api/health
Response:
{
"status": "healthy",
"components": {
"solr": "online",
"crawler": "operational",
"monitoring": "enabled"
},
"uptime": 86400,
"version": "1.0.0"
}
GET /api/metrics
Response: Prometheus-formatted metrics
GET /api/stats
Response:
{
"documents_processed": 1234567,
"queries_today": 5678,
"average_response_time": 1.23,
"system_health": "excellent"
}