Performance Guide
This guide provides recommendations for optimizing Shugur Relay performance in production environments.
Capacity Planning
Hardware Recommendations
Standalone Deployment
| Load Level | CPU | RAM | Storage | Network |
|---|---|---|---|---|
| Light (< 1K events/day) | 2 cores | 4 GB | 50 GB SSD | 100 Mbps |
| Medium (< 100K events/day) | 4 cores | 8 GB | 200 GB SSD | 1 Gbps |
| Heavy (< 1M events/day) | 8 cores | 16 GB | 500 GB NVMe | 10 Gbps |
| Enterprise (> 1M events/day) | 16+ cores | 32+ GB | 1+ TB NVMe | 10+ Gbps |
Distributed Deployment (per node)
| Component | CPU | RAM | Storage | Notes |
|---|---|---|---|---|
| Relay Node | 4-8 cores | 8-16 GB | 50 GB SSD | Stateless, can scale horizontally |
| Database Node | 8-16 cores | 16-64 GB | 500 GB+ NVMe | Storage grows with data retention |
| Load Balancer | 2-4 cores | 4-8 GB | 20 GB SSD | Nginx/HAProxy/Caddy |
Network Requirements
- Latency: < 10ms between database nodes
- Bandwidth: 100 Mbps minimum per 1000 concurrent connections
- IPv6: Recommended for global accessibility
Configuration Optimization
Relay Configuration
RELAY: EVENT_CACHE_SIZE: 50000 # Increase for better read performance SEND_BUFFER_SIZE: 16384 # Larger buffer for high-throughput WRITE_TIMEOUT: 60s # Adjust based on network conditions IDLE_TIMEOUT: 300s # Connection timeout THROTTLING: MAX_CONNECTIONS: 10000 # Scale based on server capacity MAX_CONTENT_LENGTH: 4096 # Allow larger events (Time Capsules need ~4KB) RATE_LIMIT: ENABLED: true MAX_EVENTS_PER_SECOND: 50 # Balance spam protection vs. usability MAX_REQUESTS_PER_SECOND: 100 # Allow more requests for active users BURST_SIZE: 20 # Allow bursts for normal usage patterns PROGRESSIVE_BAN: true # Escalating ban durations BAN_DURATION: 5m # Initial ban duration MAX_BAN_DURATION: 24h # Maximum ban duration
# Enable advanced featuresCAPSULES: ENABLED: true # Time Capsules support MAX_WITNESSES: 9 # For compatibilityDatabase Optimization
CockroachDB Settings
-- Increase cache sizes for better performanceSET CLUSTER SETTING kv.snapshot_rebalance.max_rate = '128MiB';SET CLUSTER SETTING kv.snapshot_recovery.max_rate = '128MiB';SET CLUSTER SETTING sql.stats.histogram_collection.enabled = true;
-- Optimize for write-heavy workloadsSET CLUSTER SETTING kv.range_merge.queue_interval = '50ms';SET CLUSTER SETTING kv.raft.command.max_size = '64MiB';Storage Layout
-- Optimize table storage for eventsALTER TABLE shugur.events CONFIGURE ZONE USING range_min_bytes = 134217728, -- 128MB range_max_bytes = 536870912, -- 512MB gc.ttlseconds = 604800; -- 7 days GC
-- Separate hot and cold dataCREATE TABLE shugur.events_archive ASSELECT * FROM shugur.events WHERE created_at < extract(epoch from now() - interval '30 days');Operating System Optimization
Linux Kernel Parameters
net.core.somaxconn = 65536net.core.netdev_max_backlog = 5000net.ipv4.tcp_max_syn_backlog = 65536net.ipv4.tcp_fin_timeout = 30net.ipv4.tcp_keepalive_time = 120net.ipv4.tcp_keepalive_probes = 3net.ipv4.tcp_keepalive_intvl = 15net.ipv4.tcp_rmem = 4096 87380 6291456net.ipv4.tcp_wmem = 4096 65536 4194304vm.max_map_count = 262144System Limits
* soft nofile 1000000* hard nofile 1000000* soft nproc 1000000* hard nproc 1000000
# /etc/systemd/system.confDefaultLimitNOFILE=1000000DefaultLimitNPROC=1000000CPU Optimization
# Set CPU governor to performanceecho performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Disable CPU frequency scalingsystemctl disable cpufreq
# NUMA optimization (if applicable)echo never > /sys/kernel/mm/transparent_hugepage/enabledMonitoring and Metrics
Key Metrics to Monitor
Relay Metrics
- Connection Count: Current active WebSocket connections
- Event Rate: Events processed per second
- Response Time: Average response time for queries
- Error Rate: HTTP/WebSocket error rates
- Memory Usage: Go heap and system memory usage
- NIP Metrics: Time Capsules creation/unlock rates, Cashu operations
- Feature Usage: COUNT commands, search queries, specialized NIPs
Database Metrics
- Query Latency: P50, P95, P99 query response times
- Throughput: Queries per second (QPS)
- Connection Pool: Active/idle database connections
- Disk I/O: Read/write IOPS and throughput
- Replication Lag: For distributed deployments
System Metrics
- CPU Usage: Per-core utilization
- Memory Usage: Available vs. used memory
- Disk Usage: Space and I/O utilization
- Network: Bandwidth and packet rates
Prometheus Configuration
scrape_configs: - job_name: 'shugur-relay' static_configs: - targets: ['localhost:8181'] # Updated metrics port scrape_interval: 15s metrics_path: '/metrics'
- job_name: 'cockroachdb' static_configs: - targets: ['localhost:9090'] # CockroachDB admin UI scrape_interval: 30s metrics_path: '/_status/vars'Grafana Dashboards
Create dashboards for:
- Relay Overview: Connections, events, performance
- Database Health: Query performance, storage, replication
- System Resources: CPU, memory, disk, network
- Alert Summary: Current alerts and system status
Load Testing
Test Setup
# Install bombardier for load testinggo install github.com/codesenberg/bombardier@latest
# Test WebSocket connectionsbombardier -c 100 -d 60s -l ws://localhost:8080
# Test HTTP endpointsbombardier -c 50 -d 30s -l http://localhost:8080/api/statsBenchmark Results
Based on production deployments and testing with v1.3.x:
| Metric | Standalone | Distributed (3-node) |
|---|---|---|
| Concurrent Connections | 10,000+ | 50,000+ |
| Events/Second | 5,000+ | 25,000+ |
| Query Latency (P95) | < 10ms | < 15ms |
| Memory Usage | ~200MB | ~150MB per node |
| Database Throughput | 2,000 writes/sec | 10,000+ writes/sec |
Real-world performance varies based on:
- Event complexity and size
- NIP features used (Time Capsules, search, etc.)
- Database configuration and hardware
- Network latency and bandwidth
Scaling Strategies
Vertical Scaling
- CPU: Add more cores for better concurrent processing
- Memory: Increase RAM for larger caches and buffers
- Storage: Use faster NVMe drives for better I/O performance
Horizontal Scaling
- Relay Nodes: Add more stateless relay instances behind a load balancer
- Database Nodes: Scale CockroachDB cluster for better distribution
- Geographic Distribution: Deploy relay nodes in multiple regions
Caching Strategies
- Event Cache: Configure larger
EVENT_CACHE_SIZEfor frequently accessed events - CDN: Use a CDN for static assets and NIP-11 documents
- Redis: Consider external Redis for session management in distributed setups
Troubleshooting Performance Issues
High CPU Usage
# Check process CPU usagetop -p $(pgrep shugur-relay)
# Profile CPU usagego tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Check system loadvmstat 1 10iostat -x 1 10High Memory Usage
# Check memory usagefree -hps aux | grep shugur-relay
# Profile memory usagego tool pprof http://localhost:6060/debug/pprof/heap
# Check for memory leaksvalgrind --tool=memcheck --leak-check=full ./shugur-relaySlow Database Queries
-- Check slow queries in CockroachDBSELECT query, count, avg_latency, max_latencyFROM crdb_internal.statement_statisticsWHERE avg_latency > interval '100ms'ORDER BY avg_latency DESC;
-- Check table statisticsSHOW STATISTICS FOR TABLE events;
-- Analyze query plansEXPLAIN (ANALYZE, VERBOSE) SELECT * FROM events WHERE pubkey = $1;Network Issues
# Check network connectivityping -c 5 database-servertelnet database-server 26257
# Monitor network trafficnetstat -iiftop -i eth0
# Check DNS resolutiondig relay.example.comnslookup database-serverBest Practices
Security
- Rate Limiting: Configure appropriate limits to prevent abuse
- Monitoring: Set up alerts for unusual activity patterns
- Updates: Keep software and dependencies up to date
- Backups: Regular database backups and disaster recovery testing
Reliability
- Health Checks: Implement comprehensive health monitoring
- Circuit Breakers: Handle database connection failures gracefully
- Graceful Shutdown: Ensure clean shutdown procedures
- Rolling Updates: Deploy updates without downtime
Operational
- Documentation: Maintain runbooks for common operations
- Automation: Automate routine maintenance tasks
- Testing: Regular performance and disaster recovery testing
- Capacity Planning: Monitor trends and plan for growth
Next Steps
For High-Load Deployments
- Implement Caching: Add Redis or Memcached for session/event caching
- Database Sharding: Consider partitioning strategies for very large datasets
- CDN Integration: Use CloudFlare or similar for global content distribution
- Multi-Region: Deploy across multiple geographic regions
For Enterprise Deployments
- High Availability: Implement full redundancy and automated failover
- Disaster Recovery: Regular backups and cross-region replication
- Compliance: Implement audit logging and data retention policies
- Support: Establish monitoring, alerting, and incident response procedures
Related Documentation
- Installation Guide: Choose your deployment method
- Architecture Overview: Understand the system design
- Configuration Guide: Configure your relay settings
- Troubleshooting Guide: Resolve performance issues
- API Reference: WebSocket and HTTP endpoint documentation