Why We Built Arca AI-First — and What That Actually Means

If you run infrastructure for a company that isn't a large enterprise, you've probably had some version of this conversation: you know you need better observability, you've looked at the leading platforms, and then you've done the math. Datadog at $23 per host per month — before the per-GB ingestion fees, before APM, before log management as a separate SKU — adds up faster than it should. A 30-server environment with reasonable log volume can land you north of $3,000 a month before you've turned on half the features you actually wanted. Splunk is worse. New Relic has been repricing aggressively and isn't meaningfully different at the end of the calculation.

The alternative most teams land on is some combination of open-source tooling: Prometheus and Grafana for metrics, the ELK stack or Loki for logs, maybe Alertmanager wired up to PagerDuty. That combination can work well, but it's genuinely complex to operate. The ELK stack in particular is a significant investment to maintain — Elasticsearch cluster management is a specialty unto itself. You end up with a team that spends meaningful time running the monitoring infrastructure rather than using it.

Neither option is right for a 5-to-50 engineer organization that needs real visibility without a dedicated platform team. That gap is where Arca came from.

The design decision that shaped everything else

When I started building Arca, I made a deliberate choice that has influenced every subsequent architectural decision: the AI wouldn't be a feature — it would be the primary interface for investigation.

This sounds like a marketing claim, so let me be specific about what it means in practice and why it matters.

Most platforms that advertise AI capabilities have added a chat interface on top of an existing architecture. You can ask questions about your data, and the AI queries the platform on your behalf. The context it has is whatever it can retrieve at query time — typically some aggregate statistics and maybe recent events. That's useful, but it's fundamentally a search interface dressed up as a conversation.

Arca's architecture is different in a specific way: every collection is structured so that the AI has complete, standing context at the start of any conversation. Not search results — the actual state of that collection. The schema with all field definitions. The last 30 days of documents. Every configured alert and event rule with its parameters. Every active anomaly and what triggered it. The AI isn't fetching context when you ask a question. It already has it.

"The difference between a retrieval-augmented chatbot and an analyst who already read the file is the difference between getting an answer and getting a useful answer. We built Arca so the AI has already read the file."

This has practical consequences. When you ask "why did this alert fire at 3am?" you don't get a generic explanation of what threshold alerts do. You get an answer specific to your collection, your rule parameters, the exact document that triggered it, and a comparison against the recent trend that preceded it. The AI can tell you whether this looks like an isolated event or part of a pattern — because it's looking at the same data you are, not a summary of it.

The collection architecture: why it enables what it enables

To understand why Arca's AI integration works the way it does, it helps to understand how collections are structured.

Every collection in Arca has a defined schema with two field types: partition fields (up to five — high-cardinality identifiers like hostname, environment, or service name) and index fields (up to fifty — the searchable attributes of each document). Documents are stored as JSON on disk with the indexed fields written to PostgreSQL, which means full-text search, time-range queries, and field-specific filters all run against a real database rather than a document store that needs to be tuned and scaled separately.

The schema design isn't just a data structure decision — it's the thing that makes per-collection AI context tractable. Because every document in a collection shares the same indexed fields, the AI can reason about your data in aggregate. It knows what "normal" CPU utilization looks like for this collection because it has the distribution of that field across 30 days of documents, not just the value in the document that triggered an alert. It can tell you whether the authentication failures you're seeing are concentrated on a specific source IP or spread across multiple — because that's a simple group-by on a partition field it already has context about.

The detection engine: six layers that cover the patterns that matter

Anomaly detection in Arca operates on two tracks. Threshold alerts evaluate synchronously when a document is written — if a field value crosses a configured threshold, the alert fires immediately and a notification goes out. These handle the obvious cases: CPU over 90%, error count above N, response time exceeding a latency budget.

The more interesting track is the event rule engine, which evaluates six rule types every 60 seconds against recent data. These cover the detection patterns that threshold alerts miss:

Frequency — more than N occurrences of a specific value within M minutes. The canonical use case is brute-force detection: more than 10 failed SSH authentication attempts from a single source IP in five minutes. But the same rule type catches error bursts, scan activity, and any pattern where rate-of-occurrence is the signal.
Absence — no data from a known source within M minutes. This is heartbeat monitoring done properly: if a host stops sending data, you know immediately rather than noticing at the next manual review. For security teams, this also catches log source tampering — an attacker who disables logging doesn't disappear, they become conspicuously absent.
New Value — a value appeared that hasn't been seen before in this collection. New source IPs connecting to your infrastructure, a process name that wasn't in yesterday's inventory, an unknown user account. This rule type operationalizes the "know your normal" principle without requiring you to maintain and update an allow-list manually.
Spike — the average of a metric changed by more than N% between two consecutive time windows. This catches rate-of-change anomalies that threshold rules miss: a CPU that's at 75% is below most alert thresholds, but a CPU that jumped from 12% to 75% in five minutes is a different story.
Baseline — a value deviates from its rolling mean by more than N standard deviations. Statistical outlier detection without requiring you to define what normal is in advance — the platform learns it from your data. This is particularly useful for metrics with natural diurnal patterns where a fixed threshold would produce constant noise.
Compound — multiple conditions combined with AND/OR logic within a time window. Correlated multi-signal detection: a new source IP that also generated more than five authentication failures in ten minutes is a materially more interesting finding than either signal in isolation.

On the detection engine and AI

Each of these rule types produces an anomaly record that carries full context: the triggering document, the rule parameters, the time window, and the specific values that caused the evaluation to fire. When you open the AI assistant and ask about an active anomaly, it has that record. The conversation starts from a place of shared, specific context — not from "tell me about this alert ID."

What self-hosted actually gives you

Self-hosted is often framed as a cost story, and the cost story is real: no per-GB ingestion fees, no per-host licensing, no surprise bills when a runaway process floods your logs for an hour. A reasonably sized Arca deployment on a $40/month cloud VM handles the monitoring load for most small engineering teams without material infrastructure cost.

But for security teams specifically, self-hosted has a dimension that goes beyond cost: your security logs don't leave your environment. This matters more than it's usually acknowledged. The logs that Arca collects — authentication events, firewall activity, process execution records, network connections — are exactly the data that an attacker would most benefit from accessing. Sending that data to a third-party SaaS platform introduces a dependency you can't fully audit and a data residency question that compliance frameworks increasingly care about. When Arca runs on your hardware, that data stays on your hardware.

The AI component uses your own Anthropic API key, which means the data sent to the model for analysis flows through your account under your agreement with Anthropic — not through a shared infrastructure layer operated by the monitoring vendor. For organizations with strict data handling requirements, that distinction is meaningful.

The capabilities that have surprised people

A few capabilities that users consistently find more useful than they expected when they first look at the platform:

Schema inference

The classic friction point when adopting a new monitoring platform is the schema design conversation. What fields do you need? What should be partitioned? What's indexed? For teams instrumenting new data sources, that conversation delays deployment. Arca's inference mode eliminates it: send ten sample documents from your source, and Arca discovers the fields, classifies each as numeric or text, and suggests a schema for your review. You confirm, adjust if needed, and the buffer of sample documents becomes your first real records. New data sources go from "I should set this up" to actually monitored in minutes.

Synthetic collections

Correlation across data sources is where meaningful security analysis happens. An authentication failure in isolation is noise. An authentication failure from an IP that also shows up in your firewall logs as a port scanner earlier the same day is a finding. Synthetic collections let you combine fields from two or more source collections on a shared time bucket — without writing any custom code. The correlated collection works exactly like any other: dashboards, anomaly detection, AI analysis, all of it.

One-click recommended rules

Each of Arca's twelve pre-built agent types ships with a curated set of alert and event recommendations — typically 12 or more per agent — tuned to the data that agent produces. The Linux system agent gets CPU, memory, and disk threshold recommendations with sensible defaults, plus heartbeat monitoring and spike detection on CPU utilization. The auth log agent gets brute-force frequency rules, new source IP detection, and absence monitoring. Creating all of them takes a single click from the collections page. For teams that don't have a security analyst to design their detection coverage from scratch, this provides a solid baseline immediately.

Honest about what it isn't yet

Arca is not yet a SIEM. I want to be direct about that, because SIEM is a specific term with specific connotations — correlated event analysis across sources, threat intelligence integration, compliance-focused retention policies, case management and investigation workflows. Those capabilities are on the roadmap and are the next major development priority. They're not there today.

What Arca is today is a capable, self-hosted infrastructure monitoring and anomaly detection platform with an AI investigation layer that is genuinely differentiated at this price point. For the teams it's designed for — DevOps and platform engineers who need real visibility, and security teams who need anomaly detection without a six-figure annual contract — it covers the ground that matters most right now.

"The future of infrastructure monitoring isn't a better dashboard. It's a better conversation. Arca is built around that belief — and we're just getting started."

Where this goes

The SIEM roadmap is the natural extension of what Arca already does. The collection and indexing architecture, the cross-collection correlation through synthetic collections, the anomaly detection engine, and the AI integration layer are all foundations that SIEM capability builds on rather than replaces. The path from "anomaly detection platform with AI" to "AI-native SIEM" is shorter than building a SIEM from scratch — and the resulting product is meaningfully different from traditional SIEMs that are adding AI as an afterthought rather than designing around it.

The broader thesis behind Arca is that the gap between what enterprise security teams can afford and what small-to-mid engineering teams can access is closing — not because the enterprise tools are getting cheaper, but because AI changes the economics of what a small, well-designed platform can do. You don't need a team of Elasticsearch engineers to run Arca. You don't need a dedicated analyst to tune your detection rules. You need infrastructure to deploy it on and a willingness to have a conversation with your data.

That's what we built.

About the Author

Matthew Hogan

Founder & CTO, Twin Tech Labs

Matt is a technologist and engineering leader with 20+ years of experience across space systems, IoT, big data, and cybersecurity. He founded Twin Tech Labs to build Arca — an AI-first infrastructure monitoring platform — and to deliver senior-level security services to organizations that don't have enterprise-scale security budgets. Previously CTO of LifeRaft, acquired by Securitas in 2026.

All Articles Explore Arca Request Beta Access