• Skip to main content

Annielytics.com

I make data sexy

  • About
  • Tools
  • Blog
  • Portfolio
  • Contact
  • Log In

Jul 03 2025

Agentic Workflows: Balancing Automation with Oversight

Automating observability with agents

As organizations increasingly integrate AI into their operations, observability has shifted from nice-to-have to mission-critical. When models run in production—whether traditional machine learning algorithms or advanced AI systems—seemingly minor changes in data pipelines, infrastructure configurations, or data quality can silently erode performance. Unlike traditional applications, where failures often manifest as obvious errors, AI model degradation tends to be subtler. Predictions gradually become less accurate, confidence scores drift, and/or bias creeps in unnoticed.

Consider an online retailer deploying AI to predict inventory needs. The model might initially perform well, but as seasonal patterns shift, supplier data changes, or customer behavior evolves, predictions become increasingly unreliable. Without proper observability, the retailer might discover these issues only when facing stockouts or excess inventory—problems that directly impact revenue and customer satisfaction.

This is where intelligent observability comes into play. The goal is to create a system that detects issues early, triggers smart automated responses, and brings humans into the loop when their expertise is needed. This approach transforms observability from a reactive debugging mechanism that’s triggered because a client or senior leadership experiences odd behavior and shoots off a panicked email into an early warning system that prevents business disruption..

Sample Workflow

Let’s examine how this retailer’s inventory prediction system might operate in practice, tracing the journey from code commit to intelligent remediation. This workflow is derived from one I built for an online retailer.

Observability workflow for AI agents
Click for larger image

I mapped this workflow in Miro, with detailed notes at each juncture providing additional guidance and recommendations, like the one highlighted. While this represents one approach to incorporating AI agents elegantly and responsibly, modern observability’s strength lies in its adaptability. Each component can be replaced with alternatives that better fit your team’s existing tech stack and budget constraints. I’ll suggest alternative tools for each step.

Code is pushed to GitHub

The process begins when developers commit changes to the inventory prediction model or its supporting infrastructure. These changes might include adjustments to the machine learning algorithm itself, updates to data preprocessing pipelines, modifications to model serving infrastructure, or even configuration tweaks that affect how the system interprets inventory data. In AI systems, seemingly small code changes can have cascading effects—a minor adjustment to feature engineering could impact prediction accuracy, while infrastructure changes might affect model inference speed or resource consumption.

While GitHub is the de facto standard for version control in many organizations, GitLab, Bitbucket, and Azure Repos are worthy opponents. I’ve used Bitbucket, and its integration with Jira and Confluence are convenient.

CI/CD pipeline is triggered

GitHub Actions automatically initiates the Continuous Integration/Continuous Deployment (CI/CD) pipeline, running unit tests, integration tests, and model validation checks before promoting code to production. AI system regressions can be subtle and go undetected without some kind of automated validation process in place.

The pipeline might also include AI-specific validations, such as data drift detection, model performance benchmarking against historical baselines, and compatibility checks with downstream systems that ingest this inventory prediction data. These automated checks catch issues early in the development cycle, preventing problematic deployments that could degrade prediction accuracy or system reliability.

Alternatives to GitHub Actions include GitLab CI, CircleCI, Jenkins, and Azure Pipelines.

Monitoring platform checks for system changes

Once deployed, New Relic begins collecting telemetry data, tracking model inference times, prediction accuracy metrics, data pipeline health, and infrastructure performance. AI systems have unique observability requirements that go beyond traditional application monitoring. The platform not only captures standard metrics like CPU usage and memory consumption, but also AI-specific indicators, such as prediction confidence scores, feature drift detection, model version performance comparisons, and data quality metrics from upstream sources.

Image source: New Relic website

As shown in the diagram above, New Relic’s telemetry data platform collects four key types of observability data:

  • metrics from monitoring tools
  • events from application workflows
  • logs from system operations
  • traces from distributed services

This comprehensive data collection flows through various infrastructure components—from cloud platforms like AWS to container orchestration systems like Kubernetes—before being centralized in New Relic’s platform for analysis.

For an inventory prediction system, this might include monitoring the freshness of sales data, tracking seasonal pattern recognition accuracy, measuring the time between prediction requests and responses, and alerting on unusual patterns in supplier data that could skew predictions.

Alternative monitoring solutions include Datadog, Prometheus, Elastic Observability, and Splunk.

AI-powered anomaly detection identifies issues

Dynatrace Davis AI continuously analyzes this telemetry stream, learning normal patterns and flagging deviations that might be invisible to traditional rule-based monitoring. Unlike static thresholds that trigger alerts when metrics cross predefined limits, AI-powered anomaly detection adapts to the natural rhythms of your system, understanding that inventory prediction accuracy might naturally dip during holiday seasons or that inference times typically spike during end-of-month reporting periods.

Image source: Dynatrace website

Davis AI automatically identifies problems and traces them to their root cause through intelligent correlation. The system also displays timelines of performance degradation while mapping the complete dependency chain from infrastructure to individual services.

The system might notice subtle patterns like prediction confidence scores gradually dropping over several days, correlation changes between different data sources, or unusual clustering in prediction outputs that suggests model drift. For the inventory prediction system, this could mean detecting that the model is becoming less confident about fast-moving consumer goods, or that seasonal adjustments aren’t performing as expected compared to historical patterns. This intelligent analysis goes beyond simple threshold violations to identify complex, multi-dimensional anomalies that often precede more serious system failures.

Alternative intelligent monitoring solutions include Datadog Watchdog, Anodot, Zebrium, IBM Instana, and Moogsoft.

Alerting system notifies teams

When issues are detected, PagerDuty routes contextual alerts to the appropriate team members, ensuring that the right expertise addresses each type of problem. The system intelligently categorizes incidents based on their nature and severity:

  • Data scientists receive alerts about model performance degradation, feature drift, or accuracy drops.
  • DevOps teams get notified about infrastructure bottlenecks, service failures, or deployment issues.
  • Business stakeholders are alerted to critical prediction failures that could impact inventory decisions or customer experience.

AI system failures often require domain-specific knowledge to resolve effectively. PagerDuty’s contextual alerting includes relevant telemetry data, recent deployments, and suggested runbooks (e.g., steps to take when data quality drops below x%), enabling faster diagnosis and resolution. The platform also manages alert fatigue by correlating related incidents and suppressing redundant notifications during widespread outages.

Alternative alerting platforms include Opsgenie, Splunk On-Call, and Datadog Incident Management.

Automated remediation is triggered

For known issues, Shoreline executes pre-approved remediation scripts that have been tested and validated by the operations team. Many common failures follow predictable patterns that can be resolved without human intervention. The system might automatically restart a failing prediction service, scale up compute resources when inference queues grow too long, adjust memory allocation for data processing pipelines, or roll back to a previous model version if the current deployment shows performance degradation.

What makes this approach powerful is its potential integration with version control. For example, Shoreline can automatically create a pull request, generate a new Git branch (following your team’s established taxonomy), and document the remediation actions that were taken in the commit message. It can then notify authorized personnel who can review and approve the fixes for production deployment, effectively preserving an audit trail and ensuring emergency fixes don’t bypass standard development workflows.

Alternative automation platforms include Rundeck, StackStorm, Ansible AWX, and AWS Systems Manager Automation.

Logs are sent to a centralized store

All incident data, performance metrics, and remediation actions then flow into a data lake or warehouse like BigQuery for long-term storage and analysis, creating a comprehensive historical record. This centralized data gathering enables sophisticated post-incident analysis that can reveal subtle patterns that may not surface in real-time monitoring.

Teams can correlate model performance degradation with specific data sources, identify seasonal patterns in system behavior, track the effectiveness of different remediation strategies, and understand how infrastructure changes impact prediction accuracy. The data lake/warehouse also supports advanced analytics like cohort analysis of incidents, trend analysis of model performance over time, and predictive modeling to anticipate future operational challenges. There are many possible factors that could impact model accuracy. For the inventory prediction system, this could include fulfillment center capacity data or fluctuations in demand due to holiday seasons, e.g., Black Friday and Mother’s Day.

Alternative data warehousing solutions include Snowflake and Amazon Redshift.

Dashboards are updated for stakeholder visibility

Finally, for this particular workflow Tableau dashboards automatically refresh with the latest incident data, resolution metrics, and model performance trends, providing stakeholders with real-time visibility into both operational health and business impact. These dashboards serve different audiences with tailored views. Executives may want to see high-level metrics like system uptime and business impact, where data science teams may monitor model performance and drift indicators.

These dashboards could also include forward-looking analytics, such as capacity planning projections, seasonal performance forecasts, and risk assessments based on current system trends. Interactive features allow users to drill down from summary metrics to detailed views practitioners can take action on.

Alternative visualization platforms include Power BI, Looker, Metabase, Redash, and Apache Superset.

Why Automate AI Observability

The reality is many AI workflows rely on isolated monitoring or manual alerting. But as model drift, pipeline failures, and infrastructure bottlenecks become more common, proactive observability will become more mainstream. Incorporating humans in the loop is one of the best ways to protect your organization from becoming a cautionary tale like the FDA or Klarna (though props to Klarna for their transparency in sharing their regrets about going too far with AI agents). Pre-approved actions, versioned through Git, provide a safe way to automate remediation without removing team oversight. Engineers remain in the loop but are no longer required to respond to every alert manually.

Final Thoughts

As AI adoption scales across industries, observability becomes a critical foundation for resilience. Whether you use enterprise platforms or open-source tools, investing in a well-structured observability pipeline helps your team deploy faster, recover faster, maintain confidence in the systems they build, and steer clear of my ‘AI fail’ tag. 👀

Image credit: Patrick Federi

Written by Annie Cushing · Categorized: AI

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Copyright © 2025