OpenShift CNF Reference Test Harness¶
SIG-GW / IBCF Validation Architecture
Author: Daniel
Platform: OpenShift CNF
1. Architecture Overview¶
+------------------------------+
| Traffic Generator (SIPp) |
+------------------------------+
| Chaos Injector (Litmus) |
+------------------------------+
| Metrics Stack (Prom + Grafana)|
+------------------------------+
| SIG-GW Pods (IBCF) |
+------------------------------+
| STIR Service + Vault |
+------------------------------+
| Media Relay Pods |
+------------------------------+
1.1 Component Placement¶
- Traffic Generator: Separate namespace (
testrig) - Chaos Injector: Separate namespace (
chaos) - Metrics Stack: Monitoring namespace (
monitoring) - SIG-GW Pods: Production namespace (
ims) - STIR Service: Production namespace (
ims) - Media Relay: Production namespace (
ims)
2. Components¶
2.1 Traffic Generation¶
SIPp¶
- Purpose: SIP traffic generation
- Deployment: StatefulSet or DaemonSet
- Configuration: Test scenario scripts
- Metrics: Generated CPS, call success rate
Custom Load Injector¶
- Purpose: Advanced load patterns
- Features:
- Ramp-up/ramp-down
- Burst patterns
- Emergency call injection
- STIR-aware traffic
STIR-Aware Traffic Scripts¶
- Purpose: Generate STIR-signed traffic
- Features:
- A/B/C attestation levels
- Certificate rotation scenarios
- Invalid signature testing
2.2 Chaos Engineering¶
Litmus Chaos¶
- Purpose: Chaos injection framework
- Experiments:
- Pod kill
- Network partition
- CPU throttling
- Memory pressure
- DNS corruption
Custom Chaos Scenarios¶
- Purpose: IMS-specific chaos
- Scenarios:
- STIR service crash
- LI mediation device failure
- Certificate expiration
- State store partition
2.3 Metrics Stack¶
Prometheus¶
- Purpose: Metrics collection
- Scrape Interval: 15 seconds
- Retention: 30 days
- Targets: All IMS components
Grafana Dashboards¶
- Purpose: Visualization
- Dashboards:
- IBCF Signaling
- STIR/SHAKEN
- LI Intercepts
- Emergency Calls
- Infrastructure
Alertmanager¶
- Purpose: Alert routing
- Channels: PagerDuty, Slack, Email
- Rules: KPI threshold violations
3. CI/CD Integration¶
3.1 Automated Regression Pipeline¶
Trigger: Push to main/develop
Stages: 1. Build: Container images 2. Deploy: Test environment 3. Functional Tests: Unit + integration 4. Load Tests: Baseline performance 5. Chaos Tests: Resilience validation 6. Report: Test results
3.2 Performance Baseline Comparison¶
Purpose: Detect performance regressions
Comparison: - Current run vs. baseline - PDD trends - CPU/Memory trends - Latency trends
Action: Alert if > 10% degradation
3.3 Canary Testing¶
Purpose: Validate new releases
Process: 1. Deploy canary (10% traffic) 2. Monitor KPIs 3. Compare to baseline 4. Promote or rollback
4. Key Metrics Collected¶
4.1 Signaling Metrics¶
- CPS: Calls per second
- PDD: Post dial delay (p50, p95, p99)
- CSSR: Call setup success rate
- Dialog Count: Active dialogs
- Error Rate: 4xx/5xx responses
4.2 Infrastructure Metrics¶
- CPU: Per pod, per node
- Memory: Per pod, per node
- Network: Bandwidth, packet loss
- Disk: I/O, space
4.3 STIR/SHAKEN Metrics¶
- Signing Latency: p50, p95, p99
- Verification Latency: p50, p95, p99
- Certificate Cache Hit Rate: %
- OCSP Response Time: p50, p95, p99
- Attestation Distribution: A/B/C counts
4.4 LI Metrics¶
- Intercept Coverage: % of warrants
- Intercept Latency: Overhead
- Media Duplication Rate: %
- Audit Log Completeness: %
4.5 Emergency Metrics¶
- Emergency PDD: p50, p95, p99
- Emergency Drop Rate: %
- Priority Preemption: Success rate
- Location Preservation: %
4.6 TLS Metrics¶
- TLS Handshake Time: p50, p95, p99
- mTLS Success Rate: %
- Certificate Rotation Time: Seconds
- OCSP Validation Time: p50, p95, p99
5. Scaling Strategy¶
5.1 HPA for Signaling Pods¶
Configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ibcf-signaling
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ibcf-signaling
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Triggers: - CPU > 70% - Memory > 80% - Custom metric (CPS)
5.2 Dedicated Media Plane¶
Architecture: - Separate namespace for media - Dedicated nodes (optional) - Independent scaling - SR-IOV enabled
5.3 SR-IOV Enabled Nodes¶
Purpose: Low-latency media handling
Configuration: - SR-IOV network attachments - DPDK support (optional) - CPU pinning - NUMA awareness
5.4 Node Affinity for Latency¶
Purpose: Optimize for low latency
Configuration:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/latency-optimized
operator: In
values:
- "true"
6. Test Harness Execution Flow¶
6.1 Deploy CNF¶
Steps: 1. Apply Kubernetes manifests 2. Wait for pods ready 3. Verify health checks 4. Confirm metrics collection
6.2 Deploy Test Harness¶
Steps: 1. Deploy traffic generator 2. Deploy chaos injector 3. Deploy metrics stack (if not existing) 4. Configure test scenarios
6.3 Inject Load¶
Steps: 1. Start baseline load (50% rated) 2. Ramp to 100% rated 3. Sustain for test duration 4. Monitor KPIs
6.4 Inject Chaos¶
Steps: 1. Select chaos scenario 2. Execute chaos experiment 3. Monitor system behavior 4. Validate recovery
6.5 Validate KPIs¶
Steps: 1. Check all KPIs within target 2. Verify no degradation 3. Confirm recovery 4. Document results
6.6 Generate Report¶
Steps: 1. Collect metrics 2. Generate dashboards 3. Create test report 4. Archive results
7. Test Scenarios¶
7.1 Baseline Performance¶
Configuration: - Load: 100% rated CPS - Duration: 1 hour - Metrics: All KPIs
Validation: - All KPIs within target - Stable performance - No errors
7.2 Burst Load¶
Configuration: - Load: 2× rated CPS - Duration: 60 seconds - Metrics: CPS, PDD, CPU
Validation: - No packet loss - PDD within target - No crash
7.3 Failover Test¶
Configuration: - Load: 100% rated CPS - Event: Kill primary pod - Metrics: Failover time, call loss
Validation: - Failover ≤ 1 second - Zero call loss - No emergency drop
7.4 Chaos Test¶
Configuration: - Load: 100% rated CPS - Chaos: Pod kill, network partition - Metrics: Recovery time, call handling
Validation: - Graceful degradation - Fast recovery - No data loss
8. Monitoring & Alerting¶
8.1 Prometheus Rules¶
Example:
groups:
- name: ibcf_alerts
rules:
- alert: HighPDD
expr: histogram_quantile(0.95, rate(ibcf_pdd_bucket[5m])) > 0.4
for: 5m
annotations:
summary: "PDD above threshold"
8.2 Grafana Dashboards¶
Dashboards: - Real-time IBCF metrics - Historical trends - Component health - KPI compliance
8.3 Alert Routing¶
Channels: - Critical: PagerDuty - Warning: Slack - Info: Email
9. Test Data Management¶
9.1 Test Scenarios¶
Storage: Git repository Format: YAML/JSON Versioning: Git tags
9.2 PIXIT Configurations¶
Storage: testrig/pixit/
Format: YAML
Usage: Test execution parameters
9.3 Test Results¶
Storage: Object storage (S3) Retention: 90 days Format: JSON, PCAP, logs
10. Continuous Improvement¶
10.1 Performance Optimization¶
- Identify bottlenecks
- Optimize hot paths
- Reduce latency
- Improve throughput
10.2 Test Coverage¶
- Add new test scenarios
- Cover edge cases
- Validate new features
- Improve chaos scenarios