Adrian Lita
Arista Network Monitoring System
Key Highlights
- DMF & CloudVision
- System monitoring
- Fabric management
Overview
Advanced network monitoring and fabric management system for Arista Networks' data center solutions. Implements comprehensive system health monitoring, LLDP-based topology discovery, and integration with CloudVision for centralized management. The system handles thousands of network devices, processing millions of metrics per second while maintaining low latency and high reliability.
Technical Details
The monitoring system is built using a microservices architecture with services written in Java (core monitoring) and Python (data processing and analytics). LLDP implementation follows IEEE 802.1AB specifications with Arista-specific extensions. Data collection uses streaming telemetry with gRPC for low-latency metric delivery. The YANG-based database provides a standardized interface for querying network state. System monitoring agents run on switches and controllers, collecting process stats, memory utilization, RAID status, and network interface metrics. High-frequency metrics (sampled every second) are aggregated and downsampled for long-term storage. The system implements automatic anomaly detection using statistical methods and machine learning.
Challenges Overcome
- • Handling massive metric volumes (millions per second) while maintaining query performance
- • Ensuring monitoring system reliability doesn't impact switch forwarding performance
- • Maintaining consistency in distributed state across thousands of devices
- • Implementing LLDP across diverse hardware platforms with different capabilities
- • Detecting and recovering from network partitions and device failures
- • Optimizing database queries for real-time dashboard updates
- • Minimizing false positive alerts while catching real problems quickly
Outcomes & Impact
- ✓ Enabled proactive maintenance by detecting issues before customer impact
- ✓ Reduced mean time to resolution (MTTR) for network problems by 60%
- ✓ Processed 10M+ metrics/second with sub-second query latency
- ✓ Achieved 99.99% monitoring system uptime across deployments
- ✓ Automated topology discovery reduced network documentation effort by 90%
- ✓ Predictive analytics prevented failures in 70% of at-risk devices