Milton Hultgren

Reconciliation in Elastic Streams: A Robust Architecture Deep Dive

Learn how Elastic's engineering team refactored Streams using a reconciliation model inspired by Kubernetes & React to build a robust, extensible, and debuggable system.

Reconciliation in Elastic Streams: A Robust Architecture Deep Dive

Streams is a new, unified approach to data management in the Elastic Stack. It wraps a set of existing Elasticsearch building blocks—data streams, index templates, ingest pipelines, retention policies—into a single, coherent primitive: the Stream. Instead of configuring these parts individually and in the right order, users can now rely on Streams to orchestrate them safely and automatically. With a unified UI in Kibana and a simplified API, Streams reduces cognitive load, lowers the risk of misconfiguration, and supports more flexible workflows like late binding—where users can ingest data first and decide how to process and route it later.

But behind that clean user experience lies a fast-moving, evolving codebase. In this post, we’ll explore how we rethought its architecture to keep up with product demands—while laying the groundwork for future flexibility and scale.

Rapid experimentation often leads to messy code—but before shipping to customers, we have to ask: If this succeeds, can we continue evolving it? That question puts code health front and center. To move fast in the long term, we need a foundation that supports iteration.

When I joined the Streams team about six months ago, the project was moving fast through uncharted territory amid high uncertainty. This combination of speed and uncertainty created the perfect conditions for, well, spaghetti code—crafted by some of our most senior engineers, doing their best with a recipe missing a few ingredients.

The code was pragmatic and effective: it did exactly what it needed to do. But it was becoming increasingly difficult to understand and extend. Related logic was scattered across many files, with little separation of concerns, making it difficult to safely identify where and how to introduce changes. And the project still had a long road ahead.

Recently, we undertook a refactor of the underlying architecture—not just to bring greater clarity and structure to the codebase, but to establish clear phases that make it easier to debug and evolve. Our primary goal was to build a foundation that would let us continue moving quickly and confidently. As a secondary goal, we aimed to enable new capabilities like bulk updates, dry runs, and system diagnostics.

In this post, we’ll briefly explore the challenges that prompted a new approach, share the architectural patterns that inspired us, explain how the new design works under the hood, and highlight what it enables for the future.

The Challenges We Faced

Streams aims to be a declarative model for data management. Users describe how data should flow: where it should go, what processing should happen along the way, and which mappings should apply. Behind the scenes, each API request results in one or more Elasticsearch resources being changed.

Before the refactor, the underlying code was increasingly difficult to reason about. There was no clear lifecycle that each request followed. Data was loaded only when it happened to be needed, validation was scattered across different functions, and cascading changes—like child streams reacting to parent updates—were applied recursively and implicitly. Elasticsearch requests could happen at any point during a request.

This led to several key challenges:

  • No clear place for validation
    Without a single, centralized validation step, engineers weren’t sure where to add new checks—or whether existing ones would even run reliably. Some validations happened early, others late.

  • No clear picture of the overall system state
    Because there was no way to manage the system state as a whole it was hard to reason about or validate the state. We couldn’t easily check whether a change was valid in the context of all other existing streams or dependencies.

  • Unpredictable side effects
    Since Elasticsearch operations could occur at different points in the flow, failures were harder to handle or roll back. We didn’t have a clear “commit point” where the changes were executed.

  • Tangled stream logic
    Logic for different types of streams was mixed together in shared code paths, often guarded by conditionals. This made it hard to isolate behavior, test individual types, or add new ones without risking unintended consequences.

These challenges made it clear: we needed a more structured foundation, one capable of supporting both the current complexity and future growth.

What We Needed to Move Forward

To move faster yet with confidence, we needed a foundation that could evolve gracefully, make behavior easier to reason about, and reduce the likelihood of unexpected side effects.

We aligned around a few key goals:

  • A clear request lifecycle
    Each request should move through clear, well-defined phases: loading the current state, applying changes, validating the resulting state, determining the Elasticsearch actions, and executing the actions. This structure would help engineers understand where things happen—and why.

  • A unified state model
    We wanted a clear model of desired vs. current state—a single place to reason about the outcome of a change. This would enable safer validation, more efficient updates, and easier debugging by allowing us to compute the difference between the two states.

  • A single commit point
    All Elasticsearch changes should happen in one place, after everything’s validated and we know exactly what needs to change. This would reduce side effects, make failures easier to manage, and unlock support for dry runs.

  • Isolated stream logic
    We needed clearer separation between stream types so each could be developed and tested in isolation. This would simplify adding new types, reduce unintended side effects, and clarify whether changes belong to a stream type or the state management layer.

  • Bulk operations and system introspection
    Finally, we wanted to support features like bulk updates, dry runs, and health diagnostics—capabilities that were difficult or impossible with the old design. A more explicit and inspectable model of system state would make this possible.

These goals became our north star as we explored new architectural patterns to get there, with a strong focus on comparing the current state with the desired state.

Where We Drew Inspiration From

Our new design drew inspiration from two well-known open source projects: Kubernetes and React. Though very different, both share a central concept: reconciliation.

Reconciliation means comparing two states, calculating their differences, and taking the necessary actions to move the system from its current state to its desired state.

  • In Kubernetes, you declare the desired state of your resources, and the controller continuously works to align the cluster with that state.

  • In React, each component defines how it should render, and the virtual DOM updates the real DOM efficiently to match that.

We were also inspired by the Plan/Execute pattern which aims to separate decision making from execution. This sounded like what we needed in order to perform all validations before committing to any actions—ensuring we could reason about and inspect the system's intent ahead of time.

These concepts resonated with what we needed. It made clear that we required two key pieces:

  1. A model representing system state, responsible for comparing states and driving the overall workflow (like the Kubernetes controller loop).

  2. A representation of individual streams that make up that state, handling the specific logic for each stream type (like React components).

Each Stream is defined and stored in Elasticsearch. We recognized a disconnect between data management and state changes in our existing code, so we designed each stream to manage both. This fits naturally with the Active Record pattern, where a class encapsulates both domain logic and persistence.

To make the system easier to extend and the state model’s interface simpler, we implemented an abstract Active Record class using the Template Method pattern, clearly defining the interface new stream types must follow.

We did have some concerns that adopting these more advanced patterns—like reconciliation, the Active Record, and Template Method—might make it harder for new or less experienced engineers to get up to speed. While the code would be cleaner and more straightforward for those familiar with the patterns, we worried it could create a barrier for juniors or newcomers unfamiliar with these concepts.

In practice, however, we found the opposite: the code became easier to follow because the patterns provided a clear, consistent structure. More importantly, the architectural choices helped keep the focus on the domain itself, rather than on complex implementation details, making it more approachable for the whole team. The patterns are there but the code doesn't talk about them, it talks about the domain.

How We Structured the System

When a request hits one of our API endpoints in Kibana, the handler performs basic request validation, then passes the request to the Streams Client. The client’s job is to translate the request into one or more Change objects. Each Change represents the creation, modification, or deletion of a Stream.

These Change objects are then passed to a central class we introduced called

State
, which plays two key roles:

  • It holds the set of Stream instances that make up the current version of the system.

  • It orchestrates the pipeline that applies changes and transitions from one state to another.

Let’s walk through the key phases the State class manages when applying a change.

Loading the Starting State

First, the State class loads the current system state by reading the stored Stream definitions from Elasticsearch. This becomes our reference point for all subsequent comparisons—used during validation, diffing, and action planning.

Applying Changes

We begin by cloning the starting state. Each Stream is responsible for cloning itself. Then we process each incoming Change:

  • The change is presented to all Streams in the current state (creating a new one if needed).

  • Each Stream can react by updating itself and optionally emitting cascading changes—additional changes that ripple through related Streams.

  • Cascading changes are processed in a loop until no more are generated (or until we hit a safety threshold).

We then move to the next requested Change.
If any requested or cascading Change cannot be applied safely, the system aborts the entire request to prevent partial updates.

Validating the Desired State

Once we’ve applied all Changes and cascading effects, we run validations to ensure the resulting configuration is safe and consistent.

Each Stream is asked to validate itself in the context of the full desired state and the original starting state. This allows for both localized checks (within a Stream) and broader coordination (between related Streams). If any validation fails, we abort the request.

Determining Actions

Next, each Stream is asked to determine what Elasticsearch actions are needed to move from the starting state to the desired state. This is the first point where the system needs to consider which Elasticsearch resources back an individual Stream.

If the request is a dry run, we stop here and return a summary of what would happen. If it’s meant to be executed, we move to the next phase.

Planning and Execution

The list of Elasticsearch actions is handed off to a dedicated class called

ExecutionPlan
. This class handles:

  • Resolving cross-stream dependencies that individual Streams cannot address alone.

  • Organizing the actions into the correct order to ensure safe application (e.g. to avoid data loss when routing rules change).

  • Maximizing parallelism wherever possible within those ordering constraints.

If the plan executes successfully, we return a success response from the API.

Handling Failures

If the plan fails during execution, the

State
class attempts a roll back—it computes a new plan that should return the system to its starting state (by going from desired state to starting state instead) and tries to execute it.

If the roll back also fails, we have a fallback mechanism: a “reset” operation that re-applies the known-good state stored in Elasticsearch, skipping diffing entirely.

A Closer Look at the Stream Active Record Classes

All Streams in the State are subclasses of an abstract class called

StreamActiveRecord
. This class is responsible for:

  • Tracking the change status of the Stream

  • Routing change application, validation, and action determination to specialized template method hooks implemented by its concrete subclasses based on the change status.

These hooks are as follows:

  • Apply upsert / Apply deletion

  • Validate upsert / Validate deletion

  • Determine actions for creation / change / deletion

With this architecture in place, we’ve created a clear, phased, and declarative flow from input to action—one that’s modular, testable, and resilient to failure. It cleanly separates generic stream lifecycle logic (like change tracking and orchestration) from stream-specific behaviors (such as what “upsert” means for a given Stream type), enabling a highly extensible system. This structure allows us to isolate side effects, validate with confidence, and reason more clearly about system-wide behavior—all while supporting dry runs and bulk operations.

Now that we’ve covered how it works, let’s explore what this unlocks—the capabilities, safety guarantees, and new workflows this design makes possible.

What This Unlocks

The reconciliation based design we landed on isn’t just easier to reason about—it directly addresses many of the core limitations we faced in the earlier version of the system.

Bulk operations and dry runs, by design

One of our key goals was to support bulk configuration changes across many Streams in a single request. The previous codebase made this difficult because the side effects were interleaved with decision-making logic, making it risky to apply multiple changes at once.

Now, bulk changes are the default. The

State
class handles any number of changes, tracks cascading effects automatically, and validates the end result as a whole. Whether you're updating one Stream or fifty, the pipeline handles it consistently.

Dry runs were another desired feature. Because actions are now computed in a side-effect-free step—before anything is sent to Elasticsearch—we can generate a full preview of what would happen. This includes both which Streams would change and what specific Elasticsearch operations would be performed. That visibility helps users and developers make confident, informed decisions.

Easier debugging, better diagnostics

In the old system, debugging required reconstructing the execution context and piecing together side effects. Now, every phase of the pipeline is explicit and testable in isolation by following the phases.

Because validation and Elasticsearch actions are now tied directly to the Stream definition and lifecycle, any inconsistencies or errors are easier to trace to their source.

Validated planning before execution

Because we now validate and plan before making any changes, the risk of leaving the system in an inconsistent or partially-updated state has been greatly reduced. All actions are determined in advance, and only executed once we’re confident the entire set of changes is valid and coherent.

And if something does go wrong during execution, we can lean on the fact that both the starting and desired states are fully modeled in memory. This allows us to generate a roll back plan automatically, and when that’s not possible, fall back to a complete reset from the stored state. In short: safety is now built in, not bolted on.

Extensible by default

Adding a new type of Stream used to mean editing logic scattered across multiple files. Now, it’s a focused, well-defined task. You subclass

StreamActiveRecord
and implement the handful of lifecycle hooks.

That’s it. The orchestration, tracking, and dependency handling are already wired up. That also means it’s easier to onboard new developers or experiment with new Stream types without fear of breaking unrelated parts of the system.

Easier to test

Because each Stream is now encapsulated and has clear, isolated responsibilities, testing is much simpler. You can test individual Stream classes by simulating specific inputs and asserting the resulting cascading changes, validation results, or Elasticsearch actions. There's no need to spin up a full end-to-end environment just to test a single validation.

What’s Next

At Elastic, we live by our Source Code, which states “Progress, SIMPLE Perfection”—a reminder to favor steady, incremental improvement over chasing perfection.

This new system is a solid foundation—but it’s only the beginning. Our focus so far has been on clarity, safety, and extensibility, and while we’ve addressed some long-standing pain points, there’s still plenty of room to evolve.

Continuous improvement ahead

We intentionally shipped this work with a sharp scope and have already identified several enhancements that we will be adding in the coming weeks:

  • Introduce a locking layer
    To safely handle concurrent updates, we plan to introduce a locking mechanism that prevents race conditions during parallel modifications.

  • Expose bulk and dry-run features via our APIs
    The

    State
    class already supports them—now it’s time to make those capabilities available to users.

  • Improve debugging output
    Now that state transitions are modeled explicitly, we can expose clearer diagnostics to help both users and developers reason about changes.

  • Avoiding Redundant Elasticsearch Requests
    Currently we make multiple redundant requests during validation. Introducing a lightweight in-memory cache would let us avoid reloading the same resource more than once.

  • Improve access controls
    Currently, we rely on Elasticsearch to enforce access control. Because a single change can touch many different resources, it’s difficult to determine up front which privileges are required. We plan to extend our action definitions with privilege metadata, enabling us to validate the full set of required permissions before executing any actions. This will let us detect and report missing privileges early—before the plan runs.

  • Add APM instrumentation
    With the system structured in distinct, well-defined phases, we’re now in a great position to add performance instrumentation. This will help us identify bottlenecks and improve responsiveness over time.

Revisiting responsibilities

As our orchestration becomes more robust, we’re also re-evaluating where it should live. Large-scale bulk operations, for example, might eventually be better handled closer to Elasticsearch itself, where we can benefit from greater atomicity and tighter performance guarantees. That kind of deep integration would have been premature earlier on—when we were still figuring out the right abstractions and phases for the system. But now that the design has stabilized, we’re in a much better position to start that conversation.

Built to evolve

We designed this system with adaptability in mind. Whether improvements come in the form of internal refactors, better developer experience, or deeper collaboration with Elasticsearch, we’re in a strong position to keep evolving. The architecture is modular by design—and that gives us both the stability to rely on and the flexibility to grow.

Wrapping Up

Building robust, maintainable systems is never just about code — it’s about aligning architecture with the evolving needs and direction of the product. Our journey refactoring Streams reaffirmed that a thoughtful, phased approach not only improves technical clarity but also empowers teams to move faster and innovate more confidently.

If you’re working on complex systems facing similar challenges—whether tangled logic, unpredictable side effects, or the need for extensibility—you’re not alone. We hope our story offers some useful insights and inspiration as you shape your own path forward.

We welcome feedback and collaboration from the community—whether it’s in the form of questions, ideas, or code.

To learn more about Streams, explore:

Read about Reimagining streams

Look at the Streams website

Read the Streams documentation

Check out the pull request on GitHub to dive into the code or join the conversation.

Share this article