Certification Criteria
Every agent must pass a rigorous 4-layer evaluation before earning certification. Each layer tests a different dimension of production readiness.
The 4 Evaluation Layers
Command Correctness
Validates that every command generated by the agent is syntactically correct, semantically meaningful, and executable in the target system. This is the foundation layer - an agent that generates invalid commands cannot proceed.
Evaluation Checks
- Syntax validation against target system grammar
- Parameter type and range verification
- Command dependency chain completeness
- Return value handling and error paths
- Idempotency verification for repeatable commands
Passing Threshold
Minimum 60% accuracy across test suite
Situational Appropriateness
Evaluates whether the agent's chosen action is appropriate for the current network state, time of day, traffic conditions, and maintenance windows. A correct command at the wrong time is still a wrong action.
Evaluation Checks
- Network state awareness (maintenance, degraded, normal)
- Traffic load sensitivity and peak hour avoidance
- Dependency impact assessment on neighboring cells
- Maintenance window compliance verification
- Concurrent operation conflict detection
Passing Threshold
Minimum 65% appropriateness score with zero critical-time violations
Anticipated Impact
Uses physics-based simulation models to predict the real-world impact of the agent's proposed actions before execution. The agent must demonstrate that it understands what will happen when its commands are applied to the physical network.
Evaluation Checks
- RF propagation model prediction accuracy
- KPI impact estimation (throughput, latency, coverage)
- Interference pattern prediction for neighbor cells
- Energy consumption delta estimation
- User experience impact modeling (QoE metrics)
Passing Threshold
Impact prediction accuracy within 15% of simulation results
DOIL Compliance
Verifies that the agent operates strictly within its Declarative Operational Intent Layer contract. The DOIL defines what the agent is allowed to do, its constraints, escalation procedures, and human oversight requirements.
Evaluation Checks
- Action boundary enforcement (no out-of-scope operations)
- Constraint adherence (parameter limits, rate limits)
- Escalation protocol compliance (human-in-the-loop triggers)
- Audit trail completeness and traceability
- Graceful degradation under constraint violations
Passing Threshold
100% compliance for critical constraints, 90% for advisory constraints
Certification Levels
Bronze
Composite >= 0.60
Minimum viable certification. Agent operates under supervised deployment with restricted scope and mandatory human approval for all actions.
Privileges
- Supervised deployment only
- Single-site operation
- Human approval required for all actions
- Weekly review cycle
Silver
Composite >= 0.70
Competent certification. Agent demonstrates reliable performance and can operate with reduced oversight. Suitable for multi-site deployment.
Privileges
- Reduced oversight deployment
- Multi-site operation
- Human approval for high-impact actions only
- Bi-weekly review cycle
Gold
Composite >= 0.80
Advanced certification. Agent has proven consistent performance across diverse scenarios. Eligible for autonomous operation within DOIL constraints.
Privileges
- Autonomous operation within DOIL
- Network-wide deployment
- Human notification for high-impact actions
- Monthly review cycle
Platinum
Composite >= 0.90
Elite certification. Agent demonstrates exceptional performance and reliability. May serve as a reference model for training other agents.
Privileges
- Full autonomous operation
- Cross-domain advisory capability
- Reference model status
- Quarterly review cycle
Production Readiness Requirements
Beyond the 4-layer evaluation, every agent must pass a production readiness assessment before deployment. These are non-negotiable requirements regardless of certification level.
Guardrails Enforced
Hard limits on all actuating parameters. Kill switches verified. Maximum impact boundaries defined and tested.
Rollback Tested
Automated rollback procedures verified in staging. Recovery time measured. State restoration confirmed.
Staged Deployment
Canary or blue-green deployment proven. Minimum 48h observation period. No anomalies detected.
Audit Trail Complete
Full decision logging with intent-to-action mapping. Queryable history. Tamper-evident records.
Human Review Approved
Certified engineer has reviewed agent behavior, edge cases, and failure modes. Written approval on file.
Human Review Process
Submission
Agent completes training in agym.ai and achieves minimum threshold scores. Automatic submission to certification review queue.
Automated Evaluation
The 4-layer evaluation runs against a standardized test suite. Results are compiled into a certification report.
Expert Review
A certified engineer reviews the evaluation results, examines edge cases, and tests failure scenarios manually.
Decision
The reviewer either certifies the agent at the appropriate level, requests additional training, or rejects with detailed feedback.
Monitoring
Certified agents enter continuous monitoring. Certification can be suspended or revoked if performance degrades.
Renewal & Revocation
Renewal
Certifications are not permanent. Agents must be re-evaluated at intervals determined by their certification level (weekly for Bronze, quarterly for Platinum). Re-evaluation uses the latest test suite and may result in certification level changes.
Bronze: Weekly re-evaluation
Silver: Bi-weekly re-evaluation
Gold: Monthly re-evaluation
Platinum: Quarterly re-evaluation
Revocation
Certification can be immediately revoked if an agent violates DOIL constraints, causes measurable service degradation, or fails a surprise audit. Revoked agents are removed from production and must restart the certification process.
DOIL constraint violation → Immediate revocation
KPI degradation >5% → Suspension pending review
Failed audit → Certification level downgrade
Vendor patch without re-certification → Suspension