You can’t just prompt your way to great AI features

We recently shipped Call Intelligence - a real-time, AI-native call recording feature built directly into Attio. It transcribes as you talk, surfaces insights as they happen, and turns conversations into structured, actionable CRM data.

After launch, I broke down how we built it: from untangling fragmented calendar data to orchestrating real-time insight generation throughout a conversation. However, that thread barely scratched the surface of the technical groundwork that made it all possible - so in this post, I’m pulling back the curtain on the framework, the tooling, and everything we’ve built to make AI feel native in Attio.

Call Intelligence isn’t powered by a single clever prompt or a spaghetti of OpenAI calls. It runs on a custom framework we built to make AI features effortless to develop, reliable to scale, and native to the product. For engineers, shipping intelligent features it feels just like building anything else in Attio.

Build vs yarn install?

We started where most teams do - with open-source SDKs, a few prompts, and some enthusiastic copy-and-paste wiring.

That approach got us surprisingly far. We even built and shipped our first real AI feature - the Research Agent in Workflows - using those tools. But as we started pushing them harder, things began to break. The abstractions weren’t holding. The implementations varied wildly. Testing was non-existent. And coordinating logic across multiple models and tools quickly turned into tech debt.

So instead of papering over the cracks with more wrappers and one-off fixes, we stepped back and asked: What would it look like to treat AI as a first-class concept in our codebase, just like we do for controllers, services, or stories?

That thinking led to Thread Agent: a framework designed to make building AI-native features feel as natural, structured, and scalable as the rest of our stack.

Principles we built it around:

Consistent patterns across every feature Just like we had patterns for how a controller, service, or story was structured, we wanted the same for AI features. Whether it was generating a workflow name or summarizing a call, these features should be built in the same way - how they are defined, how they are configured, and how they are run. That consistency reduces cognitive overhead for engineers - and made calling an LLM feel no more scary than opening a database transaction.
Model flexibility without lock-in At the core of the framework is an extensible provider pattern that abstracts over the intricacies of OpenAI, Anthropic, or any other model provider. This frees us up to pick the right model for the job, A/B test different models and prompts, or gracefully fall back when something failed - all without that complexity leaking into feature code. Coordination happens at the framework level, not inside feature code. And while we still use provider libraries for things like streaming and retries, everything from orchestration to rate limiting was handled by the framework.
Visibility into usage and cost We wanted a clear understanding of how AI was being used - and what it was costing us. Every LLM and tool call in Thread Agent is tracked and attributed to the workspace, the feature, and the request that triggered it. This gives us full visibility into which customers are using which features, how those features perform, and how usage maps to spend.
Type safety, as far as it goes We use TypeScript everywhere - from async task payloads to controller responses - and we wanted to hold AI code to the same standard. Tool arguments and injected context are typed. Agents that return structured output are typed. And when LLMs misbehave, the framework handles it (with room for future handling like retries, fallback models, or even using LLMs to fix schema violations, all without touching feature code). Like everything we build at Attio, the goal isn’t to force engineers to fight the type system - it’s to lean on it. The framework does the heavy lifting so features can stay ergonomic, predictable, and frictionless to work with.
Scalability and fairness built in Last week alone, we processed ~600k LLM completions, ran ~40k tools and sent over a billion tokens to our model providers. Thread Agent is built to handle that kind of load without compromising on fairness or reliability. All LLM and tool executions are round-robin’d across workspaces to prevent noisy features - or entire workspaces - from dominating shared resources. We also treat requests differently based on context: a Call Intelligence insight can take priority over a batch of AI attributes that aren’t currently visible on-screen.

Okay I’m sold, show me the code

At its core, an agent is just a POJO. We provide helpers to define them in a type-safe, provider-aware way.

1. Defining an Agent

Here’s a (hopefully not too over-)simplified example. Let’s imagine we want to build a feature that take some data we already knows about a person, and tries to find their job title.

Our first step would be to define a JobTitleAgent. It takes a name and company, uses tools to navigate the web, returning either an extracted job title - or null if it can’t find one.

// JobTitleAgent is a strongly typed Agent definition that can be imported
// and directly referenced in feature code
export const JobTitleAgent = defineAgent({
	// Variants allow us to test and tweak anything from model, hyperparameters,
	// instructions, output schema or tools
  variants: {
    openai: defineOpenAiVariant({
      model: "gpt-4o",
      instructions: "Use the tools to determine the person's job title. If you're unsure, return null.",
      tools: [SearchTool, CurlTool],
      outputSchema: z.object({
        job_title: z.string().nullable(),
      }),
    }),

    gemini: defineVertexVariant({
      model: "gemini-1.5-flash-002",
      instructions: "Find the person's current job title using the tools provided. If unavailable, return null.",
      tools: [SearchTool, CurlTool],
      outputSchema: z.object({
        job_title: z.string().nullable(),
      }),
    }),
  },
})

2. Calling an Agent

Next, in our feature code, we can simply reference this agent like any other constant in our codebase:

await createThread({
	// Populate the thread with some content
  messages: [
    {
	    role: "user",
	    content: JSON.stringify({
		    name: "Jamie Davies",
		    company: "Attio",
	    }),
    },
  ],
  run: {
	  // Run our agent on this thread
    agent: JobTitleAgent,
    
    // Pick which variant to run
    // We usually inject a feature flag in here - however you could also use switch
    // based using classic logic - e.g use a more powerful model variant based on keywords
    variant: "openai",
    
    // Map events in the agent lifecycle to standard async task processors in
    // the codebase - the framework will automatically inject relevant generated IDs
    callbacks: [
	    {on: "complete", task: {...}},
	    {on: "failure", task: {...}}
	    {on: "tool-call", task: {...}},
    ],
    
    // Tell the framework how important this run is, allowing the framework
    // to prioritize and defer work.
    //
    // (It also lets us abstract over things like batch APIs - so feature code
    // looks identical whether a run takes 4 seconds or 48 hours.)
    priority: "immediate"
  },
})

3. Using the output

Once the Agent has completed, you can extract its output in a way that’s fully typed and fully predictable based on the shared definition of the Agent:

const output = await extractStructuredOutput({ 
  agentThreadId,
  agentThreadRunId,
  
  agent: JobTitleAgent,
})

output.job_title // string | null

Because the JobTitleAgent definition includes a structured output schema, the return value here is automatically typed - no extra validation or transformation needed. You just get what you expect.

Testing, debugging, and peace of mind

We realized that defining agents and running them was only half the battle - making sure we could trust them was a separate challenge. So to give engineers confidence when shipping AI-powered features, we invested in internal tooling that helps us test behavior, debug issues, and continuously improve performance.

Testing Agents like real code

All Attio engineers are already very comfortable writing tests in Javascript - so, just like with agent execution, we tried to make testing agents to feel as close to “normal code” as possible.

We built a lightweight, jest-style testing framework that lets us test agents like any other JavaScript function - let’s take a look at what a test file would look like for our Job Title Agent:

defineAgentTestSuite(JobTitleAgent, ({it, defineScorer}) => {
  it(
    "should extract a known job title",
    {
      value: defineScorer.exactMatch("Engineering Lead"),
    },
    async ({createUserMessage, runAgent, scorers}) => {
      createUserMessage(JSON.stringify({
		    name: "Jamie Davies",
		    company: "Attio",
	    }))
      
      const output = await runAgent()
      scorers.value.assert(output.job_title)
    }
  )

  it(
    "should return null for unknown person",
    {
      value: defineScorer.exactMatch(null),
    },
    async ({createUserMessage, runAgent, scorers}) => {
      createUserMessage(JSON.stringify({
		    name: "Fake Person no Job",
		    company: "Attio",
	    }))

      const output = await runAgent()
      scorers.value.assert(output.job_title)
    }
  )
})

As you can see, these agent test files look almost identical to normal tests you’d expect to find in any JavaScript codebase. Each suite is tied directly to an agent definition, and cases are defined with a list of scorers - which could do anything from comparing an exact value to using an LLM to score and rank responses.

Each case runs multiple times across all variants - and we capture both pass/fail status and relative performance (like cost and latency) per variant.

It should should extract a known job title...
—————————————————————————————————————————————————————————————————————————
|  Variant  |  value (ExactMatch = Engineering Lead)  |  Cost           |
—————————————————————————————————————————————————————————————————————————
|  openai   |  ✅ Pass x3                             |  avg μ$8419.17  |
|  gemini   |  ✅ Pass x3                             |  avg μ$1811.70  |
—————————————————————————————————————————————————————————————————————————

This makes it trivial to:

A/B test variants on well understood examples
Spot regressions when prompts or models change
Balance performance vs cost
Ship changes with actual confidence - not “it looked good locally”

Debugging with full visibility

Of course, even with tests, things break, or more annoyingly get slow.

That’s why every agent run in production is fully traceable inside our internal admin tooling. The thread viewer gives engineers a step-by-step breakdown of what happened: each model call, each tool invocation, and even sub-agent runs (yes, agents can trigger other agents 😵‍💫).

The viewer is tightly integrated into our auditing and support tooling - so engineers can get scoped, secure access fast, without compromising customer data.

It helps us debug issues quickly. Was this a tool failure, or did the agent just misinterpret the response? Did the model hallucinate a schema? Or did the tool return something unexpected?

It’s also helped us improve the framework itself. In one case, trace timings clearly showed unnecessary dead time between tool calls and model hops - helping us shave literal seconds off multiple critical workflows.

When you can see exactly what happened, it’s a lot easier to make things faster, more predictable, and less weird.

Wrap up

Call Intelligence might have been the first major feature to run on Thread Agent - but it won’t be the last.

What started as a simple way to orchestrate model calls quickly evolved into something much deeper: a shared foundation for how we build, test, and ship AI features across the product. One that’s consistent, observable, and scalable by design.

We didn’t want LLMs to be treated like magic or one-off experiments. We wanted engineers to be able to use AI the same way they’d use a mutation or a database transaction: confidently, repeatably, and with guardrails that actually scale.

We’re still evolving the framework - but the foundations are already powering numerous AI-native features across the platform.

If this kind of work sounds fun - if you’re into well-structured systems, fast-moving product work, or just really hate string-parsing LLM output - we’re hiring.