Performance Guide
This guide provides recommendations for optimizing Shugur Relay performance in production environments.
Capacity Planning
Hardware Recommendations
Standalone Deployment
Load Level | CPU | RAM | Storage | Network |
---|---|---|---|---|
Light (< 1K events/day) | 2 cores | 4 GB | 50 GB SSD | 100 Mbps |
Medium (< 100K events/day) | 4 cores | 8 GB | 200 GB SSD | 1 Gbps |
Heavy (< 1M events/day) | 8 cores | 16 GB | 500 GB NVMe | 10 Gbps |
Enterprise (> 1M events/day) | 16+ cores | 32+ GB | 1+ TB NVMe | 10+ Gbps |
Distributed Deployment (per node)
Component | CPU | RAM | Storage | Notes |
---|---|---|---|---|
Relay Node | 4-8 cores | 8-16 GB | 50 GB SSD | Stateless, can scale horizontally |
Database Node | 8-16 cores | 16-64 GB | 500 GB+ NVMe | Storage grows with data retention |
Load Balancer | 2-4 cores | 4-8 GB | 20 GB SSD | Nginx/HAProxy/Caddy |
Network Requirements
- Latency: < 10ms between database nodes
- Bandwidth: 100 Mbps minimum per 1000 concurrent connections
- IPv6: Recommended for global accessibility
Configuration Optimization
Relay Configuration
RELAY: EVENT_CACHE_SIZE: 50000 # Increase for better read performance SEND_BUFFER_SIZE: 16384 # Larger buffer for high-throughput WRITE_TIMEOUT: 30s # Adjust based on network conditions THROTTLING: MAX_CONNECTIONS: 5000 # Scale based on server capacity MAX_CONTENT_LENGTH: 16384 # Allow larger events if needed RATE_LIMIT: MAX_EVENTS_PER_SECOND: 20 # Balance spam protection vs. usability MAX_REQUESTS_PER_SECOND: 50 # Allow more requests for active users BURST_SIZE: 10 # Allow bursts for normal usage patterns
Database Optimization
CockroachDB Settings
-- Increase cache sizes for better performanceSET CLUSTER SETTING kv.snapshot_rebalance.max_rate = '128MiB';SET CLUSTER SETTING kv.snapshot_recovery.max_rate = '128MiB';SET CLUSTER SETTING sql.stats.histogram_collection.enabled = true;
-- Optimize for write-heavy workloadsSET CLUSTER SETTING kv.range_merge.queue_interval = '50ms';SET CLUSTER SETTING kv.raft.command.max_size = '64MiB';
Storage Layout
-- Optimize table storage for eventsALTER TABLE shugur.events CONFIGURE ZONE USING range_min_bytes = 134217728, -- 128MB range_max_bytes = 536870912, -- 512MB gc.ttlseconds = 604800; -- 7 days GC
-- Separate hot and cold dataCREATE TABLE shugur.events_archive ASSELECT * FROM shugur.events WHERE created_at < extract(epoch from now() - interval '30 days');
Operating System Optimization
Linux Kernel Parameters
net.core.somaxconn = 65536net.core.netdev_max_backlog = 5000net.ipv4.tcp_max_syn_backlog = 65536net.ipv4.tcp_fin_timeout = 30net.ipv4.tcp_keepalive_time = 120net.ipv4.tcp_keepalive_probes = 3net.ipv4.tcp_keepalive_intvl = 15net.ipv4.tcp_rmem = 4096 87380 6291456net.ipv4.tcp_wmem = 4096 65536 4194304vm.max_map_count = 262144
System Limits
* soft nofile 1000000* hard nofile 1000000* soft nproc 1000000* hard nproc 1000000
# /etc/systemd/system.confDefaultLimitNOFILE=1000000DefaultLimitNPROC=1000000
CPU Optimization
# Set CPU governor to performanceecho performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Disable CPU frequency scalingsystemctl disable cpufreq
# NUMA optimization (if applicable)echo never > /sys/kernel/mm/transparent_hugepage/enabled
Monitoring and Metrics
Key Metrics to Monitor
Relay Metrics
- Connection Count: Current active WebSocket connections
- Event Rate: Events processed per second
- Response Time: Average response time for queries
- Error Rate: HTTP/WebSocket error rates
- Memory Usage: Go heap and system memory usage
Database Metrics
- Query Latency: P50, P95, P99 query response times
- Throughput: Queries per second (QPS)
- Connection Pool: Active/idle database connections
- Disk I/O: Read/write IOPS and throughput
- Replication Lag: For distributed deployments
System Metrics
- CPU Usage: Per-core utilization
- Memory Usage: Available vs. used memory
- Disk Usage: Space and I/O utilization
- Network: Bandwidth and packet rates
Prometheus Configuration
scrape_configs: - job_name: 'shugur-relay' static_configs: - targets: ['localhost:2112'] # Relay metrics port scrape_interval: 15s
- job_name: 'cockroachdb' static_configs: - targets: ['localhost:8080'] # CockroachDB metrics port scrape_interval: 30s
Grafana Dashboards
Create dashboards for:
- Relay Overview: Connections, events, performance
- Database Health: Query performance, storage, replication
- System Resources: CPU, memory, disk, network
- Alert Summary: Current alerts and system status
Load Testing
Test Setup
# Install bombardier for load testinggo install github.com/codesenberg/bombardier@latest
# Test WebSocket connectionsbombardier -c 100 -d 60s -l ws://localhost:8080
# Test HTTP endpointsbombardier -c 50 -d 30s -l http://localhost:8080/api/stats
Benchmark Results
Expected performance on recommended hardware:
Metric | Standalone | Distributed (3-node) |
---|---|---|
Concurrent Connections | 1,000-5,000 | 10,000-50,000 |
Events/Second | 500-2,000 | 5,000-20,000 |
Query Latency (P95) | < 10ms | < 5ms |
Memory Usage | 512MB-2GB | 1GB-4GB per node |
Scaling Strategies
Vertical Scaling
- CPU: Add more cores for better concurrent processing
- Memory: Increase RAM for larger caches and buffers
- Storage: Use faster NVMe drives for better I/O performance
Horizontal Scaling
- Relay Nodes: Add more stateless relay instances behind a load balancer
- Database Nodes: Scale CockroachDB cluster for better distribution
- Geographic Distribution: Deploy relay nodes in multiple regions
Caching Strategies
- Event Cache: Configure larger
EVENT_CACHE_SIZE
for frequently accessed events - CDN: Use a CDN for static assets and NIP-11 documents
- Redis: Consider external Redis for session management in distributed setups
Troubleshooting Performance Issues
High CPU Usage
# Check process CPU usagetop -p $(pgrep shugur-relay)
# Profile CPU usagego tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Check system loadvmstat 1 10iostat -x 1 10
High Memory Usage
# Check memory usagefree -hps aux | grep shugur-relay
# Profile memory usagego tool pprof http://localhost:6060/debug/pprof/heap
# Check for memory leaksvalgrind --tool=memcheck --leak-check=full ./shugur-relay
Slow Database Queries
-- Check slow queries in CockroachDBSELECT query, count, avg_latency, max_latencyFROM crdb_internal.statement_statisticsWHERE avg_latency > interval '100ms'ORDER BY avg_latency DESC;
-- Check table statisticsSHOW STATISTICS FOR TABLE events;
-- Analyze query plansEXPLAIN (ANALYZE, VERBOSE) SELECT * FROM events WHERE pubkey = $1;
Network Issues
# Check network connectivityping -c 5 database-servertelnet database-server 26257
# Monitor network trafficnetstat -iiftop -i eth0
# Check DNS resolutiondig relay.example.comnslookup database-server
Best Practices
Security
- Rate Limiting: Configure appropriate limits to prevent abuse
- Monitoring: Set up alerts for unusual activity patterns
- Updates: Keep software and dependencies up to date
- Backups: Regular database backups and disaster recovery testing
Reliability
- Health Checks: Implement comprehensive health monitoring
- Circuit Breakers: Handle database connection failures gracefully
- Graceful Shutdown: Ensure clean shutdown procedures
- Rolling Updates: Deploy updates without downtime
Operational
- Documentation: Maintain runbooks for common operations
- Automation: Automate routine maintenance tasks
- Testing: Regular performance and disaster recovery testing
- Capacity Planning: Monitor trends and plan for growth
Next Steps
For High-Load Deployments
- Implement Caching: Add Redis or Memcached for session/event caching
- Database Sharding: Consider partitioning strategies for very large datasets
- CDN Integration: Use CloudFlare or similar for global content distribution
- Multi-Region: Deploy across multiple geographic regions
For Enterprise Deployments
- High Availability: Implement full redundancy and automated failover
- Disaster Recovery: Regular backups and cross-region replication
- Compliance: Implement audit logging and data retention policies
- Support: Establish monitoring, alerting, and incident response procedures
Related Documentation
- Installation Guide: Choose your deployment method
- Architecture Overview: Understand the system design
- Configuration Guide: Configure your relay settings
- Troubleshooting Guide: Resolve performance issues
- API Reference: WebSocket and HTTP endpoint documentation